DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Audio-driven talking head generation suffers from poor temporal coherence and difficulty in fine-grained motion control. To address these challenges, we propose a disentangled diffusion-based generation framework. First, we design a motion autoencoder that learns approximately orthogonal, disentangled motion representations for lip articulation, head pose, and eye movement within a structured latent space. Second, we introduce an optimal transport-based flow matching mechanism, jointly optimized with a Transformer-based predictor, to enable independent and precise control over multiple facial motion dimensions. By modeling smooth motion trajectories directly in the latent space, our method significantly improves lip-sync accuracy, motion naturalness, and visual realism. Extensive evaluations on multiple benchmarks demonstrate consistent and substantial improvements over state-of-the-art methods. Our approach establishes a new paradigm for high-fidelity, controllable talking head synthesis.

Technology Category

Application Category

📝 Abstract

Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained motion control in talking portrait synthesis

Generating temporally coherent videos with disentangled motion factors

Improving lip synchronization and motion fidelity in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled motion latent space for fine-grained control

Flow matching with transformer predictor for smooth trajectories

Motion auto-encoder with orthogonalized factor representation

🔎 Similar Papers

No similar papers found.

Authors to Follow