Replace Anyone in Videos

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of precise local human motion replacement and insertion in videos under complex backgrounds, this paper proposes an image-guided video diffusion model framework. Methodologically: (1) a multi-form mask suppression strategy is designed to prevent shape leakage; (2) a strengthened visual guidance mechanism improves appearance-pose alignment accuracy; (3) a hybrid inpainting encoder preserves fine background details; and (4) a two-stage optimization scheme reduces training difficulty. The framework is compatible with state-of-the-art architectures such as Wan2.1, supporting both pose-guided and image-conditioned video inpainting via 3D-UNet or DiT backbones. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing methods in visual realism, temporal consistency, and motion fidelity. It achieves end-to-end, single-framework human replacement and insertion without requiring post-processing or multi-model pipelines.

Technology Category

Application Category

📝 Abstract
The field of controllable human-centric video generation has witnessed remarkable progress, particularly with the advent of diffusion models. However, achieving precise and localized control over human motion in videos, such as replacing or inserting individuals while preserving desired motion patterns, still remains a formidable challenge. In this work, we present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds. Specifically, we formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture that facilitates image-conditioned video inpainting within masked regions. To prevent shape leakage and enable granular local control, we introduce diverse mask forms involving both regular and irregular shapes. Furthermore, we implement an enriched visual guidance mechanism to enhance appearance alignment, a hybrid inpainting encoder to further preserve the detailed background information in the masked video, and a two-phase optimization methodology to simplify the training difficulty. ReplaceAnyone enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Extensive experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content. The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1. The code will be available at https://github.com/ali-vilab/UniAnimate-DiT.
Problem

Research questions and friction points this paper is trying to address.

Precise localized human replacement in videos
Maintaining desired motion patterns during replacement
Handling intricate backgrounds in video inpainting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-conditioned video inpainting with pose guidance
Diverse mask forms for granular local control
Two-phase optimization to simplify training
🔎 Similar Papers
No similar papers found.
X
Xiang Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
C
Changxin Gao
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Y
Yuehuan Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition