🤖 AI Summary
This work introduces a novel paradigm for arbitrary spatiotemporal video inpainting, enabling users to paste image patches at arbitrary spatial locations and temporal timestamps within a video. It unifies diverse tasks—including image-to-video generation, inpainting, outpainting, and frame interpolation—under a single framework. The core challenge lies in temporal ambiguity of latent representations induced by causal VAEs, which impedes frame-level precise control. To address this, the authors propose a hybrid conditional control strategy: combining zero-padded masks with time-aware RoPE interpolation, enabling fine-grained spatiotemporal alignment without introducing new parameters or finetuning the frozen latent video diffusion model. Built upon the In-Context Conditioning framework, the method achieves state-of-the-art performance on the newly constructed benchmark VideoCanvasBench, demonstrating both high-fidelity intra-scene reconstruction and strong cross-scene generalization capabilities.
📝 Abstract
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.