ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address low lip-sync accuracy, weak identity preservation, and high computational cost in audio-driven talking-head generation, this paper proposes a landmarks-guided noise diffusion framework. Methodologically: (1) a lightweight landmarks generator extracts audio-synchronized 2D/3D facial keypoints in real time; (2) a landmarks-guided noise scheduling strategy explicitly decouples audio semantics from stochastic noise, enhancing temporal alignment robustness; (3) a 3D identity-aware diffusion network jointly models identity invariance and dynamic expression details in the latent space. Trained end-to-end on MEAD and CREMA-D, our method achieves state-of-the-art performance across SyncNet score (+2.1), FID (−14.3%), and identity similarity (+8.7%). It runs at 22 FPS on an NVIDIA RTX 4090, enabling near-real-time inference while significantly improving synchronization fidelity, identity preservation, and fine-grained expression reconstruction.

Technology Category

Application Category

📝 Abstract

Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}

Problem

Research questions and friction points this paper is trying to address.

Improving audio-facial synchronization in talking head generation

Reducing noise and computational costs in animation

Preserving identity characteristics during facial animation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early Landmarks-Guide Noise Diffusion for synchronization

3D Identity Diffusion network preserves identity

Landmark Generation Module converts audio to landmarks

🔎 Similar Papers

No similar papers found.

Authors to Follow