🤖 AI Summary
Existing audio-visual synchronization methods suffer from limited representational capacity and suboptimal learning strategies, leading to degraded performance in complex scenarios. This paper proposes the first cross-modal unified synchronization evaluation framework, supporting heterogeneous audio-visual representations—including Mel-spectrograms, HuBERT features, 3D Morphable Model (3DMM) parameters, and facial landmarks. We introduce an interval-aware contrastive loss and cross-speaker asynchronous negative samples to enhance discriminability and generalizability. Furthermore, we fuse multi-source audio-visual embeddings and quantify synchronization via similarity modeling. Evaluated on standard benchmarks, our method significantly outperforms state-of-the-art approaches. It also effectively improves lip-sync quality in talking-face generation systems, demonstrating robustness across both natural videos and AI-generated content.
📝 Abstract
Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.