🤖 AI Summary
To address the limitations of existing voice assistants—coarse-grained emotion understanding, low response naturalness, and absence of real-time tool invocation—this paper proposes LUCY, an end-to-end voice agent. Methodologically, LUCY is the first E2E speech model to jointly model linguistic-level emotional instructions and paralinguistic cues (e.g., intonation, pauses) for fine-grained, controllable emotional generation. It incorporates an LLM-assisted naturalness evaluation mechanism to optimize response conciseness and fluency. Additionally, it features a structured function-calling interface enabling dynamic integration of external tools. Experiments demonstrate that LUCY achieves significantly higher emotional control accuracy than baseline models; its response naturalness is validated by mainstream large-model benchmarks; and in open-domain real-time question answering, it attains both high accuracy and strong generalization, while maintaining state-of-the-art performance on general QA tasks.
📝 Abstract
The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.