đ€ AI Summary
Existing vision-language models (VLMs) are predominantly English-centric, severely limiting their multilingual understanding and generation capabilities. To address this, we systematically investigate critical design factorsâincluding training data composition, encoder architecture, and textual backbone selectionâand propose the TowerVision model family alongside VisionBlocks, a high-quality multilingual multimodal dataset. Challenging the prevailing paradigm that instruction-tuned large models serve as default initialization, our approach integrates a Tower architecture with multilingual text encoders and employs a visionâculture dual-aware, multi-stage alignment training strategy, unifying support for image-text and video-text tasks. Evaluated on ALM-Bench, Multi30K, and ViMUL-Bench, TowerVision significantly outperforms same-scale and even larger modelsâparticularly on culture-sensitive tasks and multimodal translationâestablishing a new design paradigm and open-source foundation for multilingual VLMs.
đ Abstract
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.