NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the insufficient visual constraints and representational redundancy in perception-agnostic end-to-end autonomous driving, where scene tokens are supervised solely by planning objectives. To mitigate this, the authors propose a Neural Token Reconstruction (NTR) framework that introduces, for the first time, a self-distilled masked latent reconstruction objective at the scene token bottleneck. This approach leverages compact tokens as memory to reconstruct patch-level image features, thereby enhancing their representational capacity. Guided by semantic priors from foundation models, the reconstruction process focuses on driving-relevant structures without requiring additional modules during inference. Evaluated on Waymo E2E and NavSim1&2 benchmarks, NTR achieves state-of-the-art performance (RFS: 8.0461; PDMS/EPDMS: 94.1/90.9), significantly reducing token redundancy and improving effective rank.

📝 Abstract

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

Problem

Research questions and friction points this paper is trying to address.

scene token bottleneck

end-to-end driving

visual representation learning

perception-free autonomous driving

token reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural Token Reconstruction

Scene Token Bottleneck

Masked Latent Reconstruction