🤖 AI Summary
Traditional lip-sync video dubbing is limited to mouth-region editing, causing desynchronization between facial expressions and body motions and degrading immersion. This paper proposes a novel sparse-frame video dubbing paradigm that synthesizes audio-driven full-body motion by retaining only a few key reference frames. Our contributions include: (1) formally defining the sparse-frame dubbing task for the first time; (2) designing an adaptive conditional control mechanism that jointly models temporal context and performs fine-grained reference frame localization to ensure long-term consistency in identity, signature gestures, and camera motion; and (3) developing a streaming audio-driven architecture with optimized sampling strategies to enhance controllability and long-sequence stability of image-to-video diffusion models. Extensive experiments on HDTF, CelebV-HQ, and EMTD demonstrate state-of-the-art performance, with significant improvements in visual realism, emotional alignment, and full-body synchronization.
📝 Abstract
Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.