Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge that emotional cues in everyday speech are often subtle or ambiguous, limiting the response accuracy of empathetic dialogue systems. The authors propose a speech-to-speech emotionally adaptive dialogue framework that incorporates a continuous valence-arousal (VA) space control mechanism. This framework integrates multimodal affective inputs—including facial expressions, physiological signals, and textual descriptions—and leverages a newly constructed synthetic dataset, Sympatheia-18k, which pairs emotional anchors with neutral queries to enable disentangled and precise emotional modulation. Experimental results demonstrate that the proposed method generates responses that are more contextually appropriate in both semantic content and prosody, significantly outperforming existing speech-based dialogue baselines. The study further validates the efficacy of multimodal affective signals in compensating for insufficient vocal emotional cues.

📝 Abstract

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

Problem

Research questions and friction points this paper is trying to address.

emotionally adaptive voice assistant

affective cues

emotion inference

spoken dialogue systems

emotional ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous affect conditioning

emotionally adaptive voice assistant

valence-arousal control