🤖 AI Summary
Addressing the challenge of jointly achieving controllability, temporal stability, and visual fidelity in portrait video editing, this paper proposes Qffusion—a dual-frame-guided portrait video editing framework built upon Stable Diffusion. Its core innovations are the Quadrant Grid Arrangement (QGA) and Quadrant Grid Propagation (QGP) mechanisms, which rearrange dual-image latent codes and fuse four facial-condition embeddings to jointly model appearance and spatiotemporal dynamics—without introducing auxiliary networks or requiring complex retraining. Qffusion enables end-to-end, temporally stable generation of arbitrarily long videos, significantly improving editing controllability and inter-frame coherence. Extensive evaluations across diverse portrait video editing tasks demonstrate that Qffusion consistently outperforms existing state-of-the-art methods, achieving a harmonious optimization of visual quality, temporal stability, and user-controllable editing.
📝 Abstract
This paper presents Qffusion, a dual-frame-guided framework for portrait video editing. Specifically, we consider a design principle of ``animation for editing'', and train Qffusion as a general animation framework from two still reference images while we can use it for portrait video editing easily by applying modified start and end frames as references during inference. Leveraging the powerful generative power of Stable Diffusion, we propose a Quadrant-grid Arrangement (QGA) scheme for latent re-arrangement, which arranges the latent codes of two reference images and that of four facial conditions into a four-grid fashion, separately. Then, we fuse features of these two modalities and use self-attention for both appearance and temporal learning, where representations at different times are jointly modeled under QGA. Our Qffusion can achieve stable video editing without additional networks or complex training stages, where only the input format of Stable Diffusion is modified. Further, we propose a Quadrant-grid Propagation (QGP) inference strategy, which enjoys a unique advantage on stable arbitrary-length video generation by processing reference and condition frames recursively. Through extensive experiments, Qffusion consistently outperforms state-of-the-art techniques on portrait video editing.