🤖 AI Summary
This work addresses the challenge of simultaneously achieving motion-aligned dynamic background synthesis, identity preservation, scene appearance consistency, and globally coherent relighting in cinematic-quality video background replacement. The authors formulate this task as a context-conditioned generation problem and propose a unified diffusion Transformer architecture that jointly models foreground dynamics and reference background information through bidirectional attention mechanisms. Key contributions include the first method capable of generating motion-consistent dynamic backgrounds, a synergistic optimization framework for high-fidelity foreground relighting and identity retention, and the introduction of the first dedicated dataset comprising 30,000 high-quality cinematic videos. Experimental results demonstrate that the proposed approach significantly outperforms existing open-source solutions and commercial APIs across multiple metrics, effectively eliminating artifacts such as static backgrounds, boundary inconsistencies, and synthetic distortions.
📝 Abstract
We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.