🤖 AI Summary
To address the trade-off between low geometric fidelity of grid-based representations and poor structural stability of graph-based representations in online high-definition (HD) map generation, this paper proposes DiffSemanticFusion—a novel framework that integrates semantic grid-based bird’s-eye view (BEV) encoding with diffusion-driven online HD map modeling. Specifically, it employs semantic grid BEV for learnable, vision-friendly scene encoding and introduces a lightweight map diffusion module to enhance robustness and representational capacity of graph-structured maps under sparse or noisy observations. The framework unifies multimodal trajectory prediction and end-to-end motion planning. Evaluated on nuScenes, it achieves a +5.1% mAP improvement; on NAVSIM’s NavHard benchmark, navigation success rate increases by 15%. It significantly outperforms existing methods while maintaining compatibility with mainstream autonomous driving simulation and real-vehicle platforms.
📝 Abstract
Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.