BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source video world models are constrained by causal autoregressive architectures, struggling to simultaneously achieve high generation quality, long-horizon stability, and real-time controllability. This work proposes BiWM—the first open-source video world model framework supporting a bidirectional autoregressive paradigm—enabling efficient action and camera control through a two-stage training strategy involving control fine-tuning and distribution-matching distillation. The method introduces a novel self-correcting error propagation mechanism, combined with FramePack-based history compression and NVFP4 4-bit quantization, achieving rapid convergence within hundreds of steps on 8×H200 GPUs. BiWM supports multi-scale architectures ranging from 1.3B to 22B parameters and significantly outperforms causal baselines such as minWM, establishing new state-of-the-art results in camera control accuracy and long-sequence generation quality.
📝 Abstract
Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.
Problem

Research questions and friction points this paper is trying to address.

video world models
bidirectional autoregression
interactivity
open-source framework
controllability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Autoregression
Distribution Matching Distillation
Interactive Video World Models
History Compression
4-bit Training
🔎 Similar Papers
No similar papers found.