EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D talking-head generation methods suffer from limited facial expression diversity in training data, hindering fine-grained emotional control and high-fidelity lip-sync accuracy. To address this, we introduce the continuous Valence-Arousal (VA) emotion space into a 3D Gaussian Splatting (3DGS)-based talking-head framework for the first time, proposing an emotion-conditioned generation network. We further design a self-supervised lip-sync module that adapts to in-the-wild audio without manual annotations, and integrate TTS-assisted training with audio-visual joint optimization. Experiments demonstrate state-of-the-art performance across multiple metrics—including PSNR, SSIM, LPIPS, visual RMSE (V-RMSE), audio RMSE (A-RMSE), Emotion Accuracy, and Lip Movement Distance (LMD)—achieving high-fidelity, real-time, and emotionally controllable 3D talking-head synthesis.

Technology Category

Application Category

📝 Abstract
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
Problem

Research questions and friction points this paper is trying to address.

3D Avatar Generation
Emotional Expression
Lip Synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

EmoTalkingGaussian Model
Audio-Visual Synchronization Network
Emotion-Adaptive Expression Adjustment
🔎 Similar Papers
2024-03-19IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 4