🤖 AI Summary
Current driving video generation methods face three key bottlenecks: reliance on computationally intensive large models, poor architectural interpretability, and absence of open-source implementations. This paper introduces the first fully open-source video generation system tailored for autonomous driving, built upon the BDD100K dataset. It modularly integrates and fine-tunes publicly available pre-trained components—namely, an image tokenizer, a world model, and a video decoder. By decoupling and evaluating these three core modules, the work delivers reproducible design insights. The end-to-end pipeline exclusively employs open models and data, enabling efficient training and inference on academic-grade GPUs. At 256×256 resolution and 4 fps, the system achieves high-fidelity, single-frame-latency video generation. This advances efficiency, transparency, and reproducibility, establishing a new benchmark for autonomous driving simulation and world modeling research.
📝 Abstract
Recent successful video generation systems that predict and create realistic automotive driving scenes from short video inputs assign tokenization, future state prediction (world model), and video decoding to dedicated models. These approaches often utilize large models that require significant training resources, offer limited insight into design choices, and lack publicly available code and datasets. In this work, we address these deficiencies and present OpenViGA, an open video generation system for automotive driving scenes. Our contributions are: Unlike several earlier works for video generation, such as GAIA-1, we provide a deep analysis of the three components of our system by separate quantitative and qualitative evaluation: Image tokenizer, world model, video decoder. Second, we purely build upon powerful pre-trained open source models from various domains, which we fine-tune by publicly available automotive data (BDD100K) on GPU hardware at academic scale. Third, we build a coherent video generation system by streamlining interfaces of our components. Fourth, due to public availability of the underlying models and data, we allow full reproducibility. Finally, we also publish our code and models on Github. For an image size of 256x256 at 4 fps we are able to predict realistic driving scene videos frame-by-frame with only one frame of algorithmic latency.