🤖 AI Summary
To address low lip-sync accuracy, weak identity preservation, and high computational cost in audio-driven talking-head generation, this paper proposes a landmarks-guided noise diffusion framework. Methodologically: (1) a lightweight landmarks generator extracts audio-synchronized 2D/3D facial keypoints in real time; (2) a landmarks-guided noise scheduling strategy explicitly decouples audio semantics from stochastic noise, enhancing temporal alignment robustness; (3) a 3D identity-aware diffusion network jointly models identity invariance and dynamic expression details in the latent space. Trained end-to-end on MEAD and CREMA-D, our method achieves state-of-the-art performance across SyncNet score (+2.1), FID (−14.3%), and identity similarity (+8.7%). It runs at 22 FPS on an NVIDIA RTX 4090, enabling near-real-time inference while significantly improving synchronization fidelity, identity preservation, and fine-grained expression reconstruction.
📝 Abstract
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}