Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of single-modality emotion modeling and the absence of visual-audio synergy in expressive speech generation. We propose an end-to-end audiovisual language model that deeply integrates full-face visual cues—including facial actions and micro-expressions—with a pre-trained audio language model. A lightweight visual encoder and a learnable cross-modal attention fusion mechanism are introduced to jointly align and co-optimize speech–vision emotion representations. Unlike conventional unimodal approaches, this work is the first to systematically investigate the efficacy of full-face visual signals for expressive speech generation, empirically validating their critical role in enhancing emotion perception and generation consistency. Experiments demonstrate a 5.0% improvement in F1 score over audio-only baselines on multi-emotion speech generation and cross-modal emotion recognition tasks, significantly boosting the naturalness and affective understanding capability of dialogue systems.

Technology Category

Application Category

📝 Abstract
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
Problem

Research questions and friction points this paper is trying to address.

Integrating visual cues for expressive speech generation
Exploring effective multimodal fusion strategies
Improving emotion recognition and expressive dialogue performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates full-face visual cues into speech model
Explores visual encoders and multimodal fusion strategies
Fine-tunes on emotion recognition and dialogue tasks
🔎 Similar Papers
No similar papers found.