VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Video facial enhancement (VFE) faces three core challenges: texture modeling distortion, temporal inconsistency, and poor generalization with inefficient inference. To address these, we propose VividFace—a one-stage diffusion-based framework. It innovatively integrates single-step flow matching for efficient denoising, employs joint latent- and pixel-space optimization with randomized switching training to enhance reconstruction fidelity and temporal stability, and leverages multimodal large language models (MLLMs) for high-quality facial video data curation. Built upon the pretrained WANX architecture, VividFace adopts a progressive two-stage training strategy. Extensive experiments demonstrate that VividFace achieves state-of-the-art performance in perceptual quality, identity preservation, and temporal consistency, while significantly accelerating inference speed. To foster community advancement, we publicly release both the model and curated dataset.

Technology Category

Application Category

📝 Abstract

Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.

Problem

Research questions and friction points this paper is trying to address.

Enhancing degraded video facial regions with high quality

Overcoming temporal inconsistency and limited generalization in face enhancement

Reducing inefficient multi-step denoising processes in video enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step diffusion framework for video face enhancement

Joint latent-pixel training with stochastic switching strategy

MLLM-driven data curation pipeline for dataset selection

🔎 Similar Papers

No similar papers found.

Authors to Follow