Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing approaches struggle to elucidate the interaction mechanisms and differential contributions of individual visual encoders in multi-encoder vision-language models prior to training, hindering efficient architectural design. This work addresses this gap by conducting from-scratch training of 31 encoder subsets under a unified framework on the Cambrian-1 benchmark. We propose a Capacity–Necessity dual-axis decomposition framework, revealing that an encoder’s standalone performance (Capacity) does not equate to its necessity within joint training. Our analysis demonstrates that optimal ensembles are not merely combinations of high-Capacity encoders and introduces the effective rank of pre-projection layers as a novel predictor of collaborative encoder performance. Experiments show that combining a single high-Capacity anchor encoder with one complementary encoder nearly matches the performance of the full five-encoder model, with diminishing returns from additional encoders.

📝 Abstract

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

Problem

Research questions and friction points this paper is trying to address.

multi-encoder VLMs

encoder interaction

vision-language models

parameter-efficient configuration

joint training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capacity-Necessity decomposition

multi-encoder VLMs

pre-projector effective rank