🤖 AI Summary
Audio-driven talking head generation suffers from poor temporal coherence and difficulty in fine-grained motion control. To address these challenges, we propose a disentangled diffusion-based generation framework. First, we design a motion autoencoder that learns approximately orthogonal, disentangled motion representations for lip articulation, head pose, and eye movement within a structured latent space. Second, we introduce an optimal transport-based flow matching mechanism, jointly optimized with a Transformer-based predictor, to enable independent and precise control over multiple facial motion dimensions. By modeling smooth motion trajectories directly in the latent space, our method significantly improves lip-sync accuracy, motion naturalness, and visual realism. Extensive evaluations on multiple benchmarks demonstrate consistent and substantial improvements over state-of-the-art methods. Our approach establishes a new paradigm for high-fidelity, controllable talking head synthesis.
📝 Abstract
Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.