VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video facial enhancement (VFE) faces three core challenges: texture modeling distortion, temporal inconsistency, and poor generalization with inefficient inference. To address these, we propose VividFace—a one-stage diffusion-based framework. It innovatively integrates single-step flow matching for efficient denoising, employs joint latent- and pixel-space optimization with randomized switching training to enhance reconstruction fidelity and temporal stability, and leverages multimodal large language models (MLLMs) for high-quality facial video data curation. Built upon the pretrained WANX architecture, VividFace adopts a progressive two-stage training strategy. Extensive experiments demonstrate that VividFace achieves state-of-the-art performance in perceptual quality, identity preservation, and temporal consistency, while significantly accelerating inference speed. To foster community advancement, we publicly release both the model and curated dataset.

Technology Category

Application Category

📝 Abstract
Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.
Problem

Research questions and friction points this paper is trying to address.

Enhancing degraded video facial regions with high quality
Overcoming temporal inconsistency and limited generalization in face enhancement
Reducing inefficient multi-step denoising processes in video enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step diffusion framework for video face enhancement
Joint latent-pixel training with stochastic switching strategy
MLLM-driven data curation pipeline for dataset selection
🔎 Similar Papers
No similar papers found.
S
Shulian Zhang
South China University of Technology
Y
Yong Guo
Max Planck Institute for Informatics
Long Peng
Long Peng
China Electric Power Research Institute
LCC-HVDC and VSC-HVDC Transmission Technologies
Z
Ziyang Wang
South China University of Technology
Y
Ye Chen
South China University of Technology
Wenbo Li
Wenbo Li
The Chinese University of Hong Kong
Computer VisionDeep Learning
X
Xiao Zhang
Nanjing University of Science and Technology
Y
Yulun Zhang
Shanghai Jiao Tong University
J
Jian Chen
Max Planck Institute for Informatics