SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the challenge of learning synchronization-aware and transferable facial dynamic representations from unlabeled audio-visual streams. Methodologically, it proposes a disentangled self-supervised pretraining framework that introduces three types of frame-level prompt tokens to separately model speaker identity, articulatory movements, and environmental motions. It integrates contrastive masked visual encoding with cross-modal contrastive alignment, leveraging temporally aligned positive samples and misaligned negative sampling to achieve fine-grained audio-visual frame-level synchronization awareness. The framework uniformly supports five downstream tasks: audio-visual synchronization, facial emotion recognition, head/facial action recognition, visual speech recognition, and lip-synced voice conversion. It achieves state-of-the-art performance across all four major task families, marking the first demonstration of the effectiveness and generalizability of jointly modeling synchronization awareness and disentangled representation learning for unsupervised audio-visual understanding.

Technology Category

Application Category

📝 Abstract

We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

Problem

Research questions and friction points this paper is trying to address.

Learns synchronization-aware facial dynamics from unlabeled audio-visual streams

Enables unified interface for multiple downstream talking-face applications

Achieves state-of-the-art results across four distinct task families

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked visual modeling with cross-modal contrastive alignment

Three prompt tokens encoding identity, vocal motion, ambient motion

Unified interface for multiple downstream tasks via aligned tokens

🔎 Similar Papers

No similar papers found.

Authors to Follow