🤖 AI Summary
Existing open-source video world models are constrained by causal autoregressive architectures, struggling to simultaneously achieve high generation quality, long-horizon stability, and real-time controllability. This work proposes BiWM—the first open-source video world model framework supporting a bidirectional autoregressive paradigm—enabling efficient action and camera control through a two-stage training strategy involving control fine-tuning and distribution-matching distillation. The method introduces a novel self-correcting error propagation mechanism, combined with FramePack-based history compression and NVFP4 4-bit quantization, achieving rapid convergence within hundreds of steps on 8×H200 GPUs. BiWM supports multi-scale architectures ranging from 1.3B to 22B parameters and significantly outperforms causal baselines such as minWM, establishing new state-of-the-art results in camera control accuracy and long-sequence generation quality.
📝 Abstract
Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.