๐ค AI Summary
To address the lack of calibration-free, cross-camera-configuration generalizable dense geometric perception models for autonomous driving, this paper introduces the first vision-based geometric Transformer tailored for driving scenarios. Our method reconstructs globally consistent, metric-accurate 3D point cloud maps directly from pose-agnostic multi-view image sequences in an end-to-end mannerโwithout requiring camera intrinsics, extrinsics, or external sensor alignment. Key innovations include: (1) eliminating explicit geometric priors and camera model dependencies; (2) unifying intra-frame local, cross-view spatial, and inter-frame temporal attention mechanisms; and (3) jointly decoding ego-centric point clouds and ego-pose from DINO-derived features. Evaluated on nuScenes and Waymo Open Dataset, our approach achieves state-of-the-art performance in dense geometric reconstruction, demonstrating calibration-free operation, high geometric fidelity, and robust compatibility across diverse camera configurations.
๐ Abstract
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.