DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-driven talking head generation suffers from poor temporal coherence and difficulty in fine-grained motion control. To address these challenges, we propose a disentangled diffusion-based generation framework. First, we design a motion autoencoder that learns approximately orthogonal, disentangled motion representations for lip articulation, head pose, and eye movement within a structured latent space. Second, we introduce an optimal transport-based flow matching mechanism, jointly optimized with a Transformer-based predictor, to enable independent and precise control over multiple facial motion dimensions. By modeling smooth motion trajectories directly in the latent space, our method significantly improves lip-sync accuracy, motion naturalness, and visual realism. Extensive evaluations on multiple benchmarks demonstrate consistent and substantial improvements over state-of-the-art methods. Our approach establishes a new paradigm for high-fidelity, controllable talking head synthesis.

Technology Category

Application Category

📝 Abstract
Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.
Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained motion control in talking portrait synthesis
Generating temporally coherent videos with disentangled motion factors
Improving lip synchronization and motion fidelity in video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled motion latent space for fine-grained control
Flow matching with transformer predictor for smooth trajectories
Motion auto-encoder with orthogonalized factor representation
🔎 Similar Papers
No similar papers found.
P
Peiyin Chen
College of Artificial Intelligence and Automation, Hohai University, Changzhou, China
Z
Zhuowei Yang
College of DaYu, Hohai University, Nanjin, China
Hui Feng
Hui Feng
Associate Professor, Fudan University, Shanghai, China
Graph Signal ProcessingGraph Machine LearningUnmanned System
Sheng Jiang
Sheng Jiang
College of Water Conserwancy and Hydropower Engineering, Hohai University, Nanjing, China
R
Rui Yan
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China