π€ AI Summary
Existing open-source multimodal large language models (MLLMs) exhibit severe performance degradation on non-English languages and cross-cultural scenarios, hindering equitable global deployment. To address this, we introduce Pangeaβthe first fully open-source MLLM supporting 39 languages. Methodologically, we construct PangeaIns, a culturally diverse multimodal instruction dataset comprising 6 million samples, integrating high-quality English instructions, fine-grained machine translation, and culture-adapted tasks; design PangeaBench, a comprehensive evaluation benchmark covering 47 languages; and employ end-to-end multilingual multimodal joint training, complemented by systematic ablation studies. Experiments demonstrate that Pangea significantly outperforms existing open-source MLLMs across multilingual and multicultural benchmarks. Crucially, our analysis is the first to empirically validate the critical influence of English data proportion, language popularity, and multimodal sample volume on model performance. We fully open-source all components: data, code, and model weights.
π Abstract
Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.