Get In Video: Add Anything You Want to the Video

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work introduces “Get-In-Video Editing,” a novel paradigm addressing the core challenge of precisely injecting user-specified real-world object instances into videos under data-scarce conditions—while ensuring visual fidelity, spatiotemporal consistency, and physically plausible scene interaction. Methodologically, we construct GetIn-1M, the first large-scale video editing dataset, and the dedicated benchmark GetInBench; propose GetInVideo, an end-to-end diffusion Transformer framework jointly modeling reference images, source videos, and masks; and integrate an automated Recognize-Track-Erase pipeline, 3D full-attention mechanisms, and an instance-level semantic erasure–recomposition module. Evaluated on GetInBench, our approach significantly outperforms existing methods in identity preservation, temporal coherence, and physically grounded compositing. The framework advances personalized video editing toward practical deployment.

Technology Category

Application Category

📝 Abstract

Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage, yet current approaches fundamentally fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions. We formalize this overlooked yet critical editing paradigm as"Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos. Addressing this task's dual challenges, severe training data scarcity and technical challenges in maintaining spatiotemporal coherence, we introduce three key contributions. First, we develop GetIn-1M dataset created through our automated Recognize-Track-Erase pipeline, which sequentially performs video captioning, salient instance identification, object detection, temporal tracking, and instance removal to generate high-quality video editing pairs with comprehensive annotations (reference image, tracking mask, instance prompt). Second, we present GetInVideo, a novel end-to-end framework that leverages a diffusion transformer architecture with 3D full attention to process reference images, condition videos, and masks simultaneously, maintaining temporal coherence, preserving visual identity, and ensuring natural scene interactions when integrating reference objects into videos. Finally, we establish GetInBench, the first comprehensive benchmark for Get-In-Video Editing scenario, demonstrating our approach's superior performance through extensive evaluations. Our work enables accessible, high-quality incorporation of specific real-world subjects into videos, significantly advancing personalized video editing capabilities.

Problem

Research questions and friction points this paper is trying to address.

Incorporating specific real-world instances into videos

Maintaining spatiotemporal coherence in video editing

Ensuring natural interactions between added elements and scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Recognize-Track-Erase pipeline for dataset creation

Diffusion transformer with 3D full attention for video editing

Comprehensive benchmark for Get-In-Video Editing evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow