HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
Existing methods for generating talking-face videos often struggle to simultaneously achieve high visual fidelity and precise lip-sync accuracy, frequently suffering from visual degradation or temporal inconsistencies. This work proposes the first end-to-end high-resolution diffusion framework that directly synthesizes 512×512 videos with both high perceptual quality and accurate audio alignment. Built upon a latent diffusion architecture, our approach integrates audio-driven temporal modeling with high-resolution image generation, enabling native high-resolution lip synchronization for the first time. Moreover, we systematically identify and eliminate temporal modeling biases in prior works caused by data leakage. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both perceptual quality and lip-sync accuracy, making it suitable for professional production scenarios such as film and broadcasting.
📝 Abstract
We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync
Problem

Research questions and friction points this paper is trying to address.

lip synchronization
talking-face generation
temporal consistency
audio-visual alignment
photorealistic video
Innovation

Methods, ideas, or system contributions that make the work stand out.

lip synchronization
latent diffusion models
high-resolution video generation
temporal modeling
audio-visual alignment
🔎 Similar Papers