🤖 AI Summary
Existing 3D lip-sync methods require speaker-specific model training, resulting in high computational cost and poor generalization. This paper proposes GenSync—the first audio-driven talking-head synthesis framework for multiple identities based on 3D Gaussian Splatting. Its core innovation lies in a unified generative network coupled with an identity-audio disentanglement module, enabling zero-shot, cross-identity lip synchronization with a single model—no fine-tuning or retraining required. GenSync achieves state-of-the-art lip-sync accuracy (reduced LSE) and visual quality (lower FID and LPIPS), while improving training efficiency by 6.8×. To our knowledge, this is the first work to introduce 3D Gaussian Splatting into multi-identity lip-sync synthesis, significantly enhancing generation efficiency and scalability.
📝 Abstract
We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.