🤖 AI Summary
Audio-driven video generation suffers from scarce open-source data, inconsistent benchmarks, and high computational costs. To address these challenges, we introduce TalkVerse—the first large-scale, open-source audiovisual corpus comprising 2.3 million high-resolution video segments—and propose a 5B-parameter DiT-based architecture trained on it. Our method incorporates a video VAE with aggressive downsampling, sliding-window generation with motion-frame contextual modeling, 2D skeleton guidance, and MLLM-driven prompt rewriting and style control. It enables minute-long, low-drift, lip-sync-accurate monologue video generation with zero-shot voice reenactment and coherent long-video narration. Quantitatively, our approach matches the quality of a 14B-parameter baseline while reducing inference cost by 10×. All data, training recipes, and model checkpoints are publicly released.
📝 Abstract
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$ imes$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/