Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing test-time computation (TTC) methods for vision-language models are limited in effectiveness, primarily due to insufficient prediction diversity, which restricts the benefits of voting. This work systematically investigates TTC in vision-language models and proposes Entropy-based Test-Time Computation (ETTC), which dynamically evaluates model confidence to guide inference: it degenerates to majority voting for a single model and preferentially selects high-confidence outputs when multiple models are available. Extensive experiments across seven vision-language models and six benchmarks demonstrate that ETTC consistently outperforms both majority voting and the best individual model. Moreover, the results reveal that smaller models can collaboratively enhance larger ones, achieving performance gains that surpass those of conventional ensembles.

📝 Abstract

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

Problem

Research questions and friction points this paper is trying to address.

test-time compute

vision-language models

prediction diversity

model ensembles

majority voting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Compute

Vision-Language Models

Prediction Diversity