InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional lip-sync video dubbing is limited to mouth-region editing, causing desynchronization between facial expressions and body motions and degrading immersion. This paper proposes a novel sparse-frame video dubbing paradigm that synthesizes audio-driven full-body motion by retaining only a few key reference frames. Our contributions include: (1) formally defining the sparse-frame dubbing task for the first time; (2) designing an adaptive conditional control mechanism that jointly models temporal context and performs fine-grained reference frame localization to ensure long-term consistency in identity, signature gestures, and camera motion; and (3) developing a streaming audio-driven architecture with optimized sampling strategies to enhance controllability and long-sequence stability of image-to-video diffusion models. Extensive experiments on HDTF, CelebV-HQ, and EMTD demonstrate state-of-the-art performance, with significant improvements in visual realism, emotional alignment, and full-body synchronization.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.

Problem

Research questions and friction points this paper is trying to address.

Full-body motion synchronization in video dubbing

Adaptive conditioning for sparse-frame video generation

Infinite-length audio-driven human animation generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-frame video dubbing paradigm

Streaming audio-driven generator architecture

Fine-grained reference frame positioning strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow