UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing lip-syncing methods struggle to simultaneously maintain global coherence and preserve fine-grained local details in real-world complex scenarios, such as stylized avatars, facial occlusions, or extreme lighting conditions. This work proposes UniSync, a unified framework that innovatively integrates mask-free pose-anchor training with a mask-based fusion inference strategy to achieve high-fidelity lip synchronization while preserving natural head motion. Through fine-tuning on a small yet diverse set of videos, the model demonstrates strong generalization capabilities, effectively handling challenging real-world edge cases. To facilitate comprehensive evaluation, we introduce RealWorld-LipSync, a new benchmark encompassing both real human faces and stylized avatars. Experimental results show that UniSync significantly outperforms current state-of-the-art methods on this benchmark.

Technology Category

Application Category

📝 Abstract

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.

Problem

Research questions and friction points this paper is trying to address.

lip synchronization

generalization

high-fidelity

challenging scenarios

real-world conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

lip synchronization

mask-free training

pose-anchored strategy

domain adaptability

real-world benchmark

🔎 Similar Papers

No similar papers found.

Authors to Follow