When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates cross-modal interference from irrelevant audio—such as silence, synthetic noise, and environmental sounds—on the text reasoning capabilities of large audio-language models (LALMs). Contrary to assumptions, silence is not a neutral input; it exerts destabilizing effects comparable to noise, exposing an unexpected robustness challenge for audio-augmented language processing. Using multi-text reasoning benchmarks, we systematically vary audio type, duration, loudness, and decoding temperature to evaluate leading LALMs. We propose and validate prompt engineering and self-consistency strategies as mitigation approaches. Results show that all models suffer significant performance degradation, with larger models exhibiting greater—but insufficient—resilience. Self-consistency improves output stability at the cost of increased computational overhead. This study provides the first systematic empirical evidence of latent fragility in audio–text modality coupling, offering critical insights and actionable pathways for enhancing LALM robustness.

Technology Category

Application Category

📝 Abstract

Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

Problem

Research questions and friction points this paper is trying to address.

Investigates how irrelevant audio degrades text reasoning in multimodal models

Reveals silence causes output instability similar to synthetic noise interference

Identifies cross-modal interference as critical robustness challenge in audio-language systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies irrelevant audio interference in multimodal models

Measures silence impact equivalent to synthetic noise

Proposes self-consistency method for output stabilization

🔎 Similar Papers

No similar papers found.

Authors to Follow