UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lip-syncing methods struggle to simultaneously maintain global coherence and preserve fine-grained local details in real-world complex scenarios, such as stylized avatars, facial occlusions, or extreme lighting conditions. This work proposes UniSync, a unified framework that innovatively integrates mask-free pose-anchor training with a mask-based fusion inference strategy to achieve high-fidelity lip synchronization while preserving natural head motion. Through fine-tuning on a small yet diverse set of videos, the model demonstrates strong generalization capabilities, effectively handling challenging real-world edge cases. To facilitate comprehensive evaluation, we introduce RealWorld-LipSync, a new benchmark encompassing both real human faces and stylized avatars. Experimental results show that UniSync significantly outperforms current state-of-the-art methods on this benchmark.

Technology Category

Application Category

📝 Abstract
Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.
Problem

Research questions and friction points this paper is trying to address.

lip synchronization
generalization
high-fidelity
challenging scenarios
real-world conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

lip synchronization
mask-free training
pose-anchored strategy
domain adaptability
real-world benchmark
🔎 Similar Papers
No similar papers found.
R
Ruidi Fan
Mango TV
Y
Yang Zhou
Mango TV
S
Siyuan Wang
Mango TV
Tian Yu
Tian Yu
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
Y
Yutong Jiang
Mango TV
X
Xusheng Liu
Mango TV