🤖 AI Summary
This work addresses the challenge of end-to-end video dubbing—generating high-fidelity, temporally synchronized, and emotionally consistent speech conditioned jointly on text and facial visual cues. Methodologically, it pioneers the integration of a visual adapter and an audio-visual cross-modal fusion layer into a neural codec language model (NCLM), enabling precise lip-sync alignment and natural prosody generation. Additionally, the authors introduce CelebV-Dub, the first large-scale dataset tailored for realistic expressive dubbing. Experiments demonstrate a 37% reduction in lip-sync error, with speech naturalness and intelligibility reaching human-level performance. Subjective evaluations show statistically significant improvements over state-of-the-art methods. The framework further supports cross-video generalization, enabling robust dubbing across unseen video content.
📝 Abstract
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.