🤖 AI Summary
This work addresses the lack of systematic evaluation and inconsistent training/evaluation protocols in multimodal large language models (MLLMs), which hinder fair cross-model comparisons. We propose the first unified framework for scale- and architecture-agnostic benchmarking of vision-language backbone co-design. Methodologically, we instantiate LLaVA with diverse small-to-medium language models (e.g., Phi-4, LLaMA-3.1, Gemma-2) and vision encoders (e.g., CLIP, DINOv2, SigLIP/SigLIP2), conducting visual instruction tuning under standardized data, training protocols, and evaluation benchmarks. Key findings reveal that compact LLMs—when paired with state-of-the-art vision backbones like SigLIP2—achieve multimodal performance on par with or exceeding that of significantly larger LLMs; moreover, model scale, image resolution, and vision pretraining data exhibit strong nonlinear interactions. We open-source all code, trained models, and a standardized evaluation suite to advance principled, reproducible MLLM co-design research.
📝 Abstract
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.