Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
Existing video diffusion models struggle to accurately control camera and scene dynamics under highly complex motion or severe occlusions, often suffering from structural collapse due to reliance on implicit priors. This work proposes an editable generative 3D cache mechanism based on 3D lifting that explicitly models the complete 3D geometry of foreground entities, decoupling geometry from appearance to resolve view ambiguity. By integrating soft spatial alignment injection with a lightweight fine-tuning strategy, the method enables disentangled control over camera trajectories and multi-entity motion. Furthermore, a data augmentation pipeline leveraging masked normal maps eliminates the need for 3D annotations, significantly enhancing spatiotemporal consistency under large camera movements and heavy occlusions. This approach effectively prevents structural collapse and achieves high-fidelity, controllable video generation.
📝 Abstract
While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real
Problem

Research questions and friction points this paper is trying to address.

Video Diffusion Models
3D-aware guidance
camera control
scene dynamics
structural collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative 3D Caches
Video Diffusion Models
3D Lifting
Spatial-Aligned Injection
Spatiotemporal Consistency