DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised disentanglement of static appearance and dynamic motion in videos suffers from inter-factor information leakage and reconstruction ambiguity. This paper proposes the first end-to-end video diffusion framework that employs a sequence encoder to separately extract global static features and frame-wise dynamic features, followed by a conditional denoising diffusion model for high-fidelity decomposition. Innovatively integrating diffusion models into explicit static-dynamic factorization, we introduce three novel components: (i) a shared noise schedule across factors, (ii) a time-varying KL bottleneck to constrain temporal dynamics, and (iii) orthogonal regularization to enforce feature independence. Evaluated on real-world benchmarks, our method achieves state-of-the-art static fidelity and dynamic transferability, with the lowest cross-factor leakage rate and highest joint swap accuracy—demonstrating superior disentanglement quality and generalization.

Technology Category

Application Category

📝 Abstract
Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD's sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised disentanglement of static appearance and dynamic motion in video
Addressing information leakage and blurry reconstructions in existing methods
Explicit static-dynamic factorization using a diffusion framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

First end-to-end video diffusion framework
Explicit static-dynamic factorization method
Conditional DDPM decoder with key biases
🔎 Similar Papers
No similar papers found.