LiAuto-GeoX: Efficient Grounded Driving Transformer

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the demand for real-time, high-fidelity dense 3D scene reconstruction in autonomous driving by proposing a lightweight geometric modeling approach that fuses sparse LiDAR priors with surround-view images. Leveraging a high-capacity Transformer-based teacher model, the method employs mask-guided depth-aware distillation and relative pose relation distillation to effectively preserve fine-grained geometric structures and cross-view consistency within a compact 155M-parameter student model deployable on vehicles. Evaluated on KITTI, the approach achieves real-time inference at 220 FPS, attaining a trajectory prediction PDMS of 90.6, an occupancy prediction mIoU of 24.63, and a future-frame prediction IoU of 47.67, demonstrating its effectiveness as a scalable foundational geometric representation.

📝 Abstract

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

dense 3D reconstruction

autonomous driving

real-time efficiency

geometric fidelity

onboard representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

grounded driving transformer

geometry-preserving distillation

sparse LiDAR priors