DisCo: World Models with Discrete Camera Motion Control

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing controllable video world models often suffer from unreliable control and degraded precision when executing complex, continuous camera motion commands due to entangled motion representations. To address this, this work proposes replacing continuous trajectories with discrete action primitives as control conditions, effectively disentangling motion representations and enhancing both separability and controllability. We introduce DisCoBench, a comprehensive evaluation benchmark encompassing short-term, long-horizon, and highly dynamic scenarios, along with an integrated framework featuring discretized camera control, action-conditioned generation, and feature disentanglement optimization. Experimental results demonstrate that our approach significantly improves motion-following reliability across diverse exploration settings while preserving high-fidelity visual quality and temporal consistency.
📝 Abstract
Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.
Problem

Research questions and friction points this paper is trying to address.

controllable video generation
action representation entanglement
camera motion control
world models
action controllability
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete action primitives
controllable video generation
world models
action representation disentanglement
camera motion control
🔎 Similar Papers
No similar papers found.