FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

📅 2025-12-15
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing research treats talking-face editing and generation as disjoint tasks. This paper proposes a unified paradigm—“speech-conditioned facial motion inpainting”—formulating both as local motion mask reconstruction driven by speech. Methodologically, we introduce FacEDiT (Face Editing Diffusion Transformer), integrating diffusion-based Transformers, flow matching training, masked autoencoding, bias-aware attention, and temporal smoothing constraints to support fine-grained editing operations—including replacement, insertion, and deletion—while preserving identity consistency, lip-sync accuracy, and spatiotemporal boundary continuity. Our contributions are threefold: (1) the first unified framework bridging editing and generation; (2) FacEDiTBench—the first benchmark specifically designed for talking-face editing—accompanied by dedicated evaluation metrics; and (3) zero-shot generalization to unconditional generation, achieving state-of-the-art performance on standard talking-face generation benchmarks.

Technology Category

Application Category

📝 Abstract
Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
Problem

Research questions and friction points this paper is trying to address.

Unify talking face editing and generation via facial motion infilling
Enable localized facial edits like substitution, insertion, and deletion
Address lack of standard editing benchmark with new dataset and metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified facial motion infilling via Diffusion Transformer
Masked autoencoder approach for localized edits and generation
Biased attention and smoothness constraints enhance synchronization
🔎 Similar Papers
No similar papers found.