A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the opacity of decision-making mechanisms in large vision-language models (LVLMs), which often obscures whether predictions rely on multimodal synergy or unimodal priors. The authors propose the first model-agnostic evaluation framework based on Partial Information Decomposition (PID), integrating a scalable PID estimator to systematically quantify redundant, unique, and synergistic information in model decisions across 26 LVLMs and 4 datasets. Their analysis uncovers two distinct task mechanisms—synergy-driven and knowledge-driven—and two model strategies—fusion-centric and language-centric. Furthermore, they identify a three-stage pattern in layer-wise processing, highlighting the critical role of visual instruction tuning. This study establishes a new paradigm for fine-grained LVLM evaluation that moves beyond mere accuracy metrics.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

Problem

Research questions and friction points this paper is trying to address.

large vision-language models

multimodal fusion

unimodal priors

model interpretability

decision-making opacity

Innovation

Methods, ideas, or system contributions that make the work stand out.

partial information decomposition

vision-language models

multimodal fusion