Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing audio-visual large language models (AV-LLMs) employ joint decoders, which often induce modality bias and impair balanced multimodal understanding. To address this, we propose Fork-Merge Decoding (FMD), a zero-shot, architecture-agnostic inference-time decoding strategy: audio and video features are first processed independently through separate forward passes; their hidden states are then fused to enable joint multimodal modeling. FMD requires no retraining or architectural modifications—only layer-wise decoder segmentation—and can be seamlessly deployed on mainstream AV-LLMs such as VideoLLaMA2 and video-SALMONN. Evaluated across three major benchmarks, FMD consistently improves both unimodal (audio-only or video-only) and cross-modal joint reasoning performance. Notably, it is the first work to demonstrate that pure inference-time intervention—without altering training objectives or model parameters—can effectively mitigate modality dependency, thereby enhancing the robustness and balance of multimodal understanding.

Technology Category

Application Category

📝 Abstract

The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.

Problem

Research questions and friction points this paper is trying to address.

Addressing modality bias in AV-LLMs without retraining

Balancing audio-video contributions via Fork-Merge Decoding

Enhancing multimodal reasoning via inference-time intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fork-Merge Decoding balances multimodal understanding

Modality-specific reasoning before joint processing

No additional training or architectural changes needed

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs