Qffusion: Controllable Portrait Video Editing via Quadrant-Grid Attention Learning

📅 2025-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly achieving controllability, temporal stability, and visual fidelity in portrait video editing, this paper proposes Qffusion—a dual-frame-guided portrait video editing framework built upon Stable Diffusion. Its core innovations are the Quadrant Grid Arrangement (QGA) and Quadrant Grid Propagation (QGP) mechanisms, which rearrange dual-image latent codes and fuse four facial-condition embeddings to jointly model appearance and spatiotemporal dynamics—without introducing auxiliary networks or requiring complex retraining. Qffusion enables end-to-end, temporally stable generation of arbitrarily long videos, significantly improving editing controllability and inter-frame coherence. Extensive evaluations across diverse portrait video editing tasks demonstrate that Qffusion consistently outperforms existing state-of-the-art methods, achieving a harmonious optimization of visual quality, temporal stability, and user-controllable editing.

Technology Category

Application Category

📝 Abstract
This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing'', and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing.
Problem

Research questions and friction points this paper is trying to address.

Controllable Editing
Video Stability
Visual Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

QGA (Quadrant Grid Attention)
Stable Diffusion
QGP (Quadrant Grid Propagation)
🔎 Similar Papers
No similar papers found.
Maomao Li
Maomao Li
The University of Hong Kong << Tencent AIlab
Computer VisionMachine LearningArtificial Intelligence
Lijian Lin
Lijian Lin
Tencent ARC Lab
Computer VisionVisual Tracking,Video Object Detection
Y
Yunfei Liu
International Digital Economy Academy (IDEA)
Y
Ye Zhu
International Digital Economy Academy (IDEA)
Y
Yu Li
International Digital Economy Academy (IDEA)