Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio language models exhibit limited performance in complex affective interactions, primarily due to text-dominant semantic representations that obscure subtle acoustic variations and insufficient cognitive depth that undermines empathetic capacity. To address these challenges, this work proposes the CogAudio-LLM framework, which introduces LIME-440K—the first token-aligned multi-emotion dataset—enabling effective disentanglement of acoustic and semantic modalities. The framework further incorporates a four-step Emotion-Informed Psychological Reasoning Chain (EIPS) to enhance cognitive-affective reasoning and employs a Dual-path Residual Soft Adaptive Optimization strategy (DR-SAPO) to dynamically balance logical coherence with empathetic quality. Experimental results demonstrate that this approach significantly advances model capabilities in emotional understanding and empathetic response generation, thereby fostering more natural and nuanced human–machine affective interaction.
📝 Abstract
While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} https://github.com/zxzhao0/CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.
Problem

Research questions and friction points this paper is trying to address.

Audio Language Models
Semantic Dominance
Affective Reasoning
Empathetic Response
Acoustic-semantic Decoupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

cognitive affective reasoning
acoustic-semantic decoupling
Chain-of-Thought
empathetic response alignment
dual-route optimization