🤖 AI Summary
This study addresses the unclear internal mechanisms of audio-visual signal propagation in multimodal large language models (AVLLMs). It presents the first systematic characterization of how audio and visual information is routed, utilized, and fused within AVLLMs under varying input configurations. The work proposes enhancing inference efficiency by identifying and discarding modality-specific tokens whose information has already been fully propagated. Through information flow tracing and ablation analyses across tasks on 3B/7B-scale models—including Qwen2.5-Omni and Video-SALMONN2 Plus—the study reveals that audio-visual information flows along task-dependent pathways in proportion to modal relevance and supports efficient parallel processing even with interleaved multimodal inputs. These findings substantially improve both model interpretability and computational efficiency.
📝 Abstract
Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.