HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing methods for generating talking-face videos often struggle to simultaneously achieve high visual fidelity and precise lip-sync accuracy, frequently suffering from visual degradation or temporal inconsistencies. This work proposes the first end-to-end high-resolution diffusion framework that directly synthesizes 512×512 videos with both high perceptual quality and accurate audio alignment. Built upon a latent diffusion architecture, our approach integrates audio-driven temporal modeling with high-resolution image generation, enabling native high-resolution lip synchronization for the first time. Moreover, we systematically identify and eliminate temporal modeling biases in prior works caused by data leakage. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both perceptual quality and lip-sync accuracy, making it suitable for professional production scenarios such as film and broadcasting.

📝 Abstract

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

Problem

Research questions and friction points this paper is trying to address.

lip synchronization

talking-face generation

temporal consistency

audio-visual alignment

photorealistic video

Innovation

Methods, ideas, or system contributions that make the work stand out.

lip synchronization

latent diffusion models

high-resolution video generation