🤖 AI Summary
Existing 3D talking-head generation methods suffer from limited facial expression diversity in training data, hindering fine-grained emotional control and high-fidelity lip-sync accuracy. To address this, we introduce the continuous Valence-Arousal (VA) emotion space into a 3D Gaussian Splatting (3DGS)-based talking-head framework for the first time, proposing an emotion-conditioned generation network. We further design a self-supervised lip-sync module that adapts to in-the-wild audio without manual annotations, and integrate TTS-assisted training with audio-visual joint optimization. Experiments demonstrate state-of-the-art performance across multiple metrics—including PSNR, SSIM, LPIPS, visual RMSE (V-RMSE), audio RMSE (A-RMSE), Emotion Accuracy, and Lip Movement Distance (LMD)—achieving high-fidelity, real-time, and emotionally controllable 3D talking-head synthesis.
📝 Abstract
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.