🤖 AI Summary
This work addresses the unclear modality complementarity and robustness, as well as the prevalent audio-dominance bias, in audio-visual captioning. To this end, we propose a systematic modality robustness evaluation framework—first quantifying models’ excessive reliance on audio streams. We introduce AudioVisualCaps, the first large-scale, human-annotated benchmark dataset jointly describing synchronized audio and visual content. Leveraging LAVCap, we conduct ablation studies by selectively suppressing or perturbing unimodal inputs to assess performance degradation under modality mismatch. Models trained on AudioVisualCaps exhibit significantly reduced audio bias, improved cross-modal balanced fusion, and enhanced robustness to noise and missing modalities. Our core contributions are threefold: (1) establishing a principled evaluation paradigm for modality bias in multimodal captioning; (2) releasing AudioVisualCaps—the first dedicated benchmark for joint audio-visual description; and (3) empirically demonstrating that carefully constructed multimodal data is critical for achieving fairness and robustness across modalities.
📝 Abstract
Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.