TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

📅 2025-10-22
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
Existing vision-language models (VLMs) are predominantly English-centric, severely limiting their multilingual understanding and generation capabilities. To address this, we systematically investigate critical design factors—including training data composition, encoder architecture, and textual backbone selection—and propose the TowerVision model family alongside VisionBlocks, a high-quality multilingual multimodal dataset. Challenging the prevailing paradigm that instruction-tuned large models serve as default initialization, our approach integrates a Tower architecture with multilingual text encoders and employs a vision–culture dual-aware, multi-stage alignment training strategy, unifying support for image-text and video-text tasks. Evaluated on ALM-Bench, Multi30K, and ViMUL-Bench, TowerVision significantly outperforms same-scale and even larger models—particularly on culture-sensitive tasks and multimodal translation—establishing a new design paradigm and open-source foundation for multilingual VLMs.

Technology Category

Application Category

📝 Abstract
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
Problem

Research questions and friction points this paper is trying to address.

Analyzing multilingual design choices in vision-language models
Developing open multilingual VLMs for image-text and video-text tasks
Improving cross-lingual generalization through multilingual training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual VLMs built on Tower+ text backbone
Fine-tuning with visual and cultural context integration
Releasing curated VisionBlocks dataset for training
🔎 Similar Papers
No similar papers found.
A
André G. Viveiros
Instituto Superior Técnico, Universidade de Lisboa; Instituto de TelecomunicaçÔes
Patrick Fernandes
Patrick Fernandes
Carnegie Mellon University & Instituto Superior Técnico
NLPMachine Learning
S
Saul Santos
Instituto Superior Técnico, Universidade de Lisboa; Instituto de TelecomunicaçÔes
S
Sonal Sannigrahi
Instituto Superior Técnico, Universidade de Lisboa; Instituto de TelecomunicaçÔes
E
Emmanouil Zaranis
Instituto Superior Técnico, Universidade de Lisboa; Instituto de TelecomunicaçÔes
N
Nuno M. Guerreiro
Sword Health
Amin Farajian
Amin Farajian
Unbabel
Natural Language ProcessingMachine Translation
Pierre Colombo
Pierre Colombo
CS of Equall & Ass. Prof @Univ ParisSacaly (CentraleSupelec)
NLPMultimodal
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence
A
André F. T. Martins
Instituto Superior Técnico, Universidade de Lisboa; Instituto de TelecomunicaçÔes; TransPerfect; ELLIS Unit Lisbon