Topological Alignment of Shared Vision-Language Embedding Space

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual vision-language models (VLMs) are constrained by English-centric multimodal data and rely solely on instance-level cross-modal alignment, neglecting the global geometric structure of embedding spaces. To address this, we propose ToMCLIP—a novel framework that introduces persistent homology into multilingual VLMs for the first time. It employs a topology-aware alignment mechanism to explicitly model the global topological structure shared between visual and multilingual textual embeddings. Coupled with a graph sparsification strategy, it enables efficient approximation of topological features within a theoretically guaranteed error bound. Crucially, ToMCLIP enforces topological consistency explicitly in the shared embedding space, thereby enhancing both structural coherence and robustness of semantic alignment. Experiments demonstrate substantial improvements: zero-shot classification accuracy on CIFAR-100 increases significantly, and multilingual image–text retrieval performance on xFlickr&CO markedly surpasses state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot capabilities. However, their cross-modal alignment remains biased toward English due to limited multilingual multimodal data. Recent multilingual extensions have alleviated this gap but enforce instance-level alignment while neglecting the global geometry of the shared embedding space. We address this problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a topology-aware framework aligning embedding spaces with topology-preserving constraints. The proposed method applies persistent homology to define a topological alignment loss and approximates persistence diagram with theoretical error bounds using graph sparsification strategy. This work validates the proposed approach, showing enhanced structural coherence of multilingual representations, higher zero-shot accuracy on the CIFAR-100, and stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the proposed approach provides a general method for incorporating topological alignment into representation learning.
Problem

Research questions and friction points this paper is trying to address.

Addressing English bias in multilingual vision-language model alignment
Improving global geometry of shared embedding space topology
Enhancing multilingual representation coherence and cross-modal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Topological alignment loss with persistent homology
Graph sparsification for persistence diagram approximation
Topology-preserving constraints for multilingual embedding spaces
🔎 Similar Papers
No similar papers found.
J
Junwon You
Department of Mathematics, POSTECH, Republic of Korea
D
Dasol Kang
Dololo Research Engineer
Jae-Hun Jung
Jae-Hun Jung
SUNY at Buffalo