Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

๐Ÿ“… 2026-05-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

211K/year
๐Ÿค– AI Summary
Existing approaches struggle to finely dissect the interaction mechanisms among modalities in multimodal large language models and their task-specific dependencies. This work introduces Partial Information Decomposition (PID) into the analysis of multimodal foundation models, proposing the Sensory PID framework to quantify, at the decision level, the unique, redundant, and synergistic contributions of visual, auditory, and linguistic inputsโ€”extending PID to tri-modal systems for the first time. The study reveals general patterns of modality usage: reasoning and localization tasks heavily rely on cross-modal synergy, whereas knowledge-intensive tasks predominantly depend on language. It also uncovers a vision-dominated information bottleneck in audio-visual fusion. Building on these insights, the authors devise a PID-guided weight reweighting strategy that yields preliminary performance gains in multimodal reasoning and localization.
๐Ÿ“ Abstract
Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.
Problem

Research questions and friction points this paper is trying to address.

modality interaction
multimodal language models
Partial Information Decomposition
sensory synergy
decision-level analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial Information Decomposition
Multimodal Language Models
Modality Interaction
Sensory PID
Synergy Bottleneck