Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work investigates how multimodal large language models extract query-relevant information from noisy visual contexts. The authors propose a Retrieval Attention Metric (RAM) to identify a critical class of context-aware retrieval (CoRe) attention heads and reveal their functional sparsity: merely the top 5% of CoRe heads dominate cross-modal information extraction, while the rest contribute minimally. Through integrated analyses—including attention visualization, causal intervention, and ablation studies—the study validates the pivotal role of these CoRe heads. Leveraging this sparsity, the authors demonstrate that inference can be substantially accelerated with negligible performance degradation, offering a promising avenue for enhancing model interpretability and designing more efficient multimodal architectures.
📝 Abstract
While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.
Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs
functional sparsity
cross-modal retrieval
mechanistic interpretability
visual feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

functional sparsity
CoRe heads
Retrieval Attention Mass
multimodal LLMs
mechanistic interpretability