🤖 AI Summary
Audio-Perceptive Large Language Models (ALLMs) frequently generate acoustic event hallucinations, undermining their practical reliability. To address this, we propose LISTEN—a novel method that explicitly models the critical “what not to hear” capability via LLM-driven negative sample synthesis and contrastive lightweight adapter training. LISTEN reframes silent discrimination as a learnable contrastive task for the first time, without modifying the backbone LLM’s parameters, adhering to a parameter-free fine-tuning paradigm. It achieves state-of-the-art performance on audio question answering and reasoning while significantly reducing auditory hallucination rates. Notably, LISTEN cuts training data requirements and computational overhead by 37% and 29%, respectively. Its core innovations include: (1) the first use of an LLM to synthesize audio-negative samples; (2) the first formulation of “silence” as a contrastive learning objective; and (3) highly efficient, low-overhead hallucination suppression.
📝 Abstract
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.