Information-Theoretic Decomposition for Multimodal Interaction Learning

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to effectively model the dynamic interplay among redundant, unique, and synergistic information at the sample level in multimodal learning. This work proposes a novel information-theoretic decomposition–based paradigm for multimodal interaction learning, offering the first systematic analysis of the importance of sample-level interactions. By employing a variational architecture, the method explicitly disentangles these three interaction components and integrates a component-aware fine-tuning strategy to adaptively leverage them. Extensive experiments across diverse tasks and architectures consistently demonstrate the superiority of the proposed approach over current state-of-the-art methods, confirming its effectiveness and generality in finely modeling sample-level multimodal interactions.
📝 Abstract
Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.
Problem

Research questions and friction points this paper is trying to address.

multimodal interaction
sample-specific
information decomposition
dynamic interaction
redundancy and synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

information-theoretic decomposition
multimodal interaction learning
sample-specific interactions
variational decomposition
adaptive multimodal fusion