TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models (VLMs) are predominantly English-centric, severely limiting their multilingual understanding and generation capabilities. To address this, we systematically investigate critical design factors—including training data composition, encoder architecture, and textual backbone selection—and propose the TowerVision model family alongside VisionBlocks, a high-quality multilingual multimodal dataset. Challenging the prevailing paradigm that instruction-tuned large models serve as default initialization, our approach integrates a Tower architecture with multilingual text encoders and employs a vision–culture dual-aware, multi-stage alignment training strategy, unifying support for image-text and video-text tasks. Evaluated on ALM-Bench, Multi30K, and ViMUL-Bench, TowerVision significantly outperforms same-scale and even larger models—particularly on culture-sensitive tasks and multimodal translation—establishing a new design paradigm and open-source foundation for multilingual VLMs.

Technology Category

Application Category

📝 Abstract

Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

Problem

Research questions and friction points this paper is trying to address.

Analyzing multilingual design choices in vision-language models

Developing open multilingual VLMs for image-text and video-text tasks

Improving cross-lingual generalization through multilingual training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual VLMs built on Tower+ text backbone

Fine-tuning with visual and cultural context integration

Releasing curated VisionBlocks dataset for training

🔎 Similar Papers

No similar papers found.

Authors to Follow