🤖 AI Summary
This work addresses the challenge of balancing multimodal behavior generation and real-time inference in end-to-end autonomous driving models, particularly the high latency induced by iterative denoising in diffusion-based approaches. The authors propose CLEAR, a novel framework that integrates single-step latent-space conditional drift with large language model–driven semantic reasoning. Building upon the Drive-JEPA vision encoder, CLEAR replaces multi-step denoising with single-step conditional generation in the VAE latent space, employing a tunable conditioning coefficient to balance diversity and accuracy. Concurrently, a fine-tuned Qwen-3.5-0.8B model extracts scene-aware latent states to guide an adaptive scheduler and a cross-attention trajectory scorer. Requiring neither geometric annotations nor iterative sampling, CLEAR achieves a new state-of-the-art performance of 93.7 PDMS on NAVSIM v1, demonstrating the feasibility of efficient, high-fidelity multimodal motion planning.
📝 Abstract
End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.