Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-Perceptive Large Language Models (ALLMs) frequently generate acoustic event hallucinations, undermining their practical reliability. To address this, we propose LISTEN—a novel method that explicitly models the critical “what not to hear” capability via LLM-driven negative sample synthesis and contrastive lightweight adapter training. LISTEN reframes silent discrimination as a learnable contrastive task for the first time, without modifying the backbone LLM’s parameters, adhering to a parameter-free fine-tuning paradigm. It achieves state-of-the-art performance on audio question answering and reasoning while significantly reducing auditory hallucination rates. Notably, LISTEN cuts training data requirements and computational overhead by 37% and 29%, respectively. Its core innovations include: (1) the first use of an LLM to synthesize audio-negative samples; (2) the first formulation of “silence” as a contrastive learning objective; and (3) highly efficient, low-overhead hallucination suppression.

Technology Category

Application Category

📝 Abstract
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.
Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in audio-aware large language models
Distinguishing present and absent sounds using synthesized data
Enhancing reliability without modifying LLM parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive training with synthesized negative samples
Lightweight adapter for audio integration
No modification to LLM parameters required
🔎 Similar Papers
No similar papers found.