🤖 AI Summary
This work addresses the limitations of existing discrete visual tokenization methods, which prioritize image generation but struggle to simultaneously support semantic understanding and geometric modeling required for autonomous driving. The authors propose a jointly supervised discrete tokenizer that leverages frozen DINO features to guide representation learning, while integrating supervision from RGB reconstruction, depth estimation, and relative pose between adjacent frames. This approach yields a compact token representation that preserves visual fidelity and captures driving-relevant geometric structure. Innovatively combining representation guidance with geometric augmentation, the method employs multi-codebook vector quantization to stabilize multi-objective training, enabling both high-quality generation and efficient planning within a unified token space. Evaluated on NAVSIM, the proposed approach outperforms baselines in reconstruction fidelity, representation consistency, and planning performance, while significantly improving generation quality under matched settings.
📝 Abstract
Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.