An Effective Router for Vision-Language Model Selection

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models (VLMs) face significant challenges in model selection due to the lack of specialized evaluation data, limited representational capacity, and rigid model architectures. To address this, this work introduces the first multimodal benchmark dataset tailored for VLM selection, comprising outputs from seven prominent VLMs across 32,626 image-text queries. Furthermore, the authors propose ARMS, a lightweight routing model with only 0.8 billion parameters, which effectively integrates VLM configuration metadata with query-specific features through a streamlined architecture to achieve precise capability matching. ARMS supports both incremental and standalone training paradigms, enabling efficient adaptation to newly introduced models. Experimental results demonstrate that ARMS consistently outperforms commercial models hundreds of times larger—such as GPT-4o—on both in-distribution and out-of-distribution evaluations, underscoring its strong generalization and scalability.

📝 Abstract

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

model selection

router

multimodal dataset

adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model Selection

Multimodal Router

Model Profiling