Get In Video: Add Anything You Want to the Video

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces “Get-In-Video Editing,” a novel paradigm addressing the core challenge of precisely injecting user-specified real-world object instances into videos under data-scarce conditions—while ensuring visual fidelity, spatiotemporal consistency, and physically plausible scene interaction. Methodologically, we construct GetIn-1M, the first large-scale video editing dataset, and the dedicated benchmark GetInBench; propose GetInVideo, an end-to-end diffusion Transformer framework jointly modeling reference images, source videos, and masks; and integrate an automated Recognize-Track-Erase pipeline, 3D full-attention mechanisms, and an instance-level semantic erasure–recomposition module. Evaluated on GetInBench, our approach significantly outperforms existing methods in identity preservation, temporal coherence, and physically grounded compositing. The framework advances personalized video editing toward practical deployment.

Technology Category

Application Category

📝 Abstract
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage, yet current approaches fundamentally fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We formalize this overlooked yet critical editing paradigm as"Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos. Addressing this task's dual challenges, severe training data scarcity and technical challenges in maintaining spatiotemporal coherence, we introduce three key contributions. First, we develop GetIn-1M dataset created through our automated Recognize-Track-Erase pipeline, which sequentially performs video captioning, salient instance identification, object detection, temporal tracking, and instance removal to generate high-quality video editing pairs with comprehensive annotations (reference image, tracking mask, instance prompt). Second, we present GetInVideo, a novel end-to-end framework that leverages a diffusion transformer architecture with 3D full attention to process reference images, condition videos, and masks simultaneously, maintaining temporal coherence, preserving visual identity, and ensuring natural scene interactions when integrating reference objects into videos. Finally, we establish GetInBench, the first comprehensive benchmark for Get-In-Video Editing scenario, demonstrating our approach's superior performance through extensive evaluations. Our work enables accessible, high-quality incorporation of specific real-world subjects into videos, significantly advancing personalized video editing capabilities.
Problem

Research questions and friction points this paper is trying to address.

Incorporating specific real-world instances into videos
Maintaining spatiotemporal coherence in video editing
Ensuring natural interactions between added elements and scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Recognize-Track-Erase pipeline for dataset creation
Diffusion transformer with 3D full attention for video editing
Comprehensive benchmark for Get-In-Video Editing evaluation
🔎 Similar Papers
No similar papers found.
Shaobin Zhuang
Shaobin Zhuang
Shanghai Jiaotong University
Video GenerationComputer Vision
Zhipeng Huang
Zhipeng Huang
Microsoft Research Asia && University of Science and Technology of China
Multi-ModalityComputer Vision
B
Binxin Yang
WeChat, Tencent Inc
Y
Ying Zhang
WeChat, Tencent Inc
Fangyikang Wang
Fangyikang Wang
Zhejiang University
Diffusion ModelsOptimal TransportOptimization
C
Canmiao Fu
WeChat, Tencent Inc
Chong Sun
Chong Sun
Tencent WeChat
Computer Vision
Z
Zheng-Jun Zha
University of Science and Technology of China
C
Chen Li
WeChat, Tencent Inc
Y
Yali Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shanghai Artificial Intelligence Laboratory