ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low lip-sync accuracy, weak identity preservation, and high computational cost in audio-driven talking-head generation, this paper proposes a landmarks-guided noise diffusion framework. Methodologically: (1) a lightweight landmarks generator extracts audio-synchronized 2D/3D facial keypoints in real time; (2) a landmarks-guided noise scheduling strategy explicitly decouples audio semantics from stochastic noise, enhancing temporal alignment robustness; (3) a 3D identity-aware diffusion network jointly models identity invariance and dynamic expression details in the latent space. Trained end-to-end on MEAD and CREMA-D, our method achieves state-of-the-art performance across SyncNet score (+2.1), FID (−14.3%), and identity similarity (+8.7%). It runs at 22 FPS on an NVIDIA RTX 4090, enabling near-real-time inference while significantly improving synchronization fidelity, identity preservation, and fine-grained expression reconstruction.

Technology Category

Application Category

📝 Abstract
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}
Problem

Research questions and friction points this paper is trying to address.

Improving audio-facial synchronization in talking head generation
Reducing noise and computational costs in animation
Preserving identity characteristics during facial animation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early Landmarks-Guide Noise Diffusion for synchronization
3D Identity Diffusion network preserves identity
Landmark Generation Module converts audio to landmarks
🔎 Similar Papers
No similar papers found.
Hoang-Son Vo
Hoang-Son Vo
AI Convergence, Chonnam National University
Computer VisionMedical Image ProcessingImage Generation3D Image
Q
Quang-Vinh Nguyen
Chonnam National University , Gwangju, Republic of Korea
S
Seungwon Kim
Chonnam National University , Gwangju, Republic of Korea
H
Hyung-Jeong Yang
Chonnam National University , Gwangju, Republic of Korea
Soonja Yeom
Soonja Yeom
IEEE Senior Member, School of Information & Communication Technology, University of Tasmania
Educational TechnologyLearning AnalyticsAffective ComputingHapticsAuthentication
S
Soo-Hyung Kim
Chonnam National University , Gwangju, Republic of Korea