🤖 AI Summary
Existing video generation models struggle to explicitly control underlying dynamics, often producing trajectories that are unsafe, non-smooth, or physically inconsistent. This work proposes a novel framework that integrates classical optimal control during inference: it first extracts a compact, task-relevant world state from the input, then performs geometric planning on a continuous manifold under physical constraints to generate an optimal trajectory, which subsequently conditions video rendering. By unifying 3D geometric structure with task-specific physical laws, this approach represents the first integration of optimal control into video generation. Experiments demonstrate significant improvements in both dynamic plausibility and visual quality across diverse tasks, including image-to-video synthesis, video dynamics editing, and counterfactual generation.
📝 Abstract
Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.