π€ AI Summary
Existing auditory foundation models neglect the human auditory selective attention mechanism, leading to poor alignment with listenersβ subjective perception in multi-speaker scenarios. To address this, we propose Intention-Informed Auditory Scene Understanding (II-ASU), a novel paradigm that, for the first time, integrates intracranial electroencephalography (iEEG)-decoded neural attention signals from listeners into large audio language models to enable intention-driven auditory understanding. Methodologically, we design an end-to-end architecture comprising an iEEG feature encoder, an attention-state classifier, and a conditional response generation module. Evaluated on multi-speaker description, transcription, source separation, and question-answering tasks, II-ASU achieves substantial improvements: +23.6% in subjective intention alignment, 18.4% reduction in word error rate (WER), and 15.2% increase in BLEU score. These results empirically validate the effectiveness and feasibility of neurofeedback-enhanced auditory AI.
π Abstract
Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.