$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing the challenges of modeling 4D dynamic scenes and poor generalization to rare edge cases in autonomous driving, this paper proposes an occupancy-based world model. Methodologically, it introduces a novel two-level tokenization scheme: spatially decoupling intra- and inter-scene representations with multi-scale residual quantization to yield compact 3D tokens; temporally enhancing 4D dynamics modeling via residual aggregation. Furthermore, a controllable encoder-decoder architecture is designed, integrating transformation matrix prediction with conditional decoding to improve generation controllability and temporal consistency. Evaluated on 4D occupancy prediction, the method achieves state-of-the-art performance—improving mIoU by 25.1% and IoU by 36.9%. It requires only 2.9 GB GPU memory during training and operates at 37.0 FPS inference speed, enabling efficient real-time deployment.

Technology Category

Application Category

📝 Abstract

Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

Problem

Research questions and friction points this paper is trying to address.

Efficient tokenization of complex 3D scenes for forecasting

Dynamic 4D scene evolution prediction for autonomous driving

Balancing spatial detail preservation and temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intra-scene tokenizer uses multi-scale residual quantization

Inter-scene tokenizer aggregates temporal dependencies residually

Encoder-decoder architecture enables high-level scene generation control

🔎 Similar Papers

No similar papers found.

Authors to Follow