VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of end-to-end video dubbing—generating high-fidelity, temporally synchronized, and emotionally consistent speech conditioned jointly on text and facial visual cues. Methodologically, it pioneers the integration of a visual adapter and an audio-visual cross-modal fusion layer into a neural codec language model (NCLM), enabling precise lip-sync alignment and natural prosody generation. Additionally, the authors introduce CelebV-Dub, the first large-scale dataset tailored for realistic expressive dubbing. Experiments demonstrate a 37% reduction in lip-sync error, with speech naturalness and intelligibility reaching human-level performance. Subjective evaluations show statistically significant improvements over state-of-the-art methods. The framework further supports cross-video generalization, enabling robust dubbing across unseen video content.

Technology Category

Application Category

📝 Abstract
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.
Problem

Research questions and friction points this paper is trying to address.

Automated video dubbing with text and facial cues
Synchronizing speech with facial movements naturally
Creating high-quality speech for diverse applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Codec Language Models for speech synthesis
Adapters align facial features with NCLM tokens
Audio-visual fusion layers merge facial and speech data
🔎 Similar Papers
No similar papers found.