🤖 AI Summary
This work addresses the computational burden of long-horizon vision-action models in autonomous driving, where excessive contextual length leads to high computational costs, and existing heuristic token compression methods—decoupled from planning objectives—often discard critical decision-making information. To resolve this, the authors propose COMPACT-VA, a framework that explicitly aligns token compression with driving planning for the first time. It employs a conditional VQ-VAE to jointly encode historical observations and planning intent distilled from future trajectories, learning a bounded working memory representation that is co-optimized with an end-to-end policy. Evaluated in highly dynamic scenarios, COMPACT-VA improves success rates by over 6% (reaching 68.3%) while achieving a 3.3× speedup in inference and a 2.7× reduction in memory usage, effectively balancing efficiency and performance.
📝 Abstract
Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.