Listening without Looking: Modality Bias in Audio-Visual Captioning

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the unclear modality complementarity and robustness, as well as the prevalent audio-dominance bias, in audio-visual captioning. To this end, we propose a systematic modality robustness evaluation framework—first quantifying models’ excessive reliance on audio streams. We introduce AudioVisualCaps, the first large-scale, human-annotated benchmark dataset jointly describing synchronized audio and visual content. Leveraging LAVCap, we conduct ablation studies by selectively suppressing or perturbing unimodal inputs to assess performance degradation under modality mismatch. Models trained on AudioVisualCaps exhibit significantly reduced audio bias, improved cross-modal balanced fusion, and enhanced robustness to noise and missing modalities. Our core contributions are threefold: (1) establishing a principled evaluation paradigm for modality bias in multimodal captioning; (2) releasing AudioVisualCaps—the first dedicated benchmark for joint audio-visual description; and (3) empirically demonstrating that carefully constructed multimodal data is critical for achieving fairness and robustness across modalities.

Technology Category

Application Category

📝 Abstract

Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.

Problem

Research questions and friction points this paper is trying to address.

Investigating modality bias in audio-visual captioning models

Assessing robustness when audio or visual inputs are degraded

Developing balanced models using both modalities equally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic modality robustness testing on LAVCap

Created AudioVisualCaps dataset with balanced annotations

Reduced modality bias through balanced dataset training

🔎 Similar Papers

No similar papers found.

Authors to Follow