DVGT: Driving Visual Geometry Transformer

๐Ÿ“… 2025-12-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of calibration-free, cross-camera-configuration generalizable dense geometric perception models for autonomous driving, this paper introduces the first vision-based geometric Transformer tailored for driving scenarios. Our method reconstructs globally consistent, metric-accurate 3D point cloud maps directly from pose-agnostic multi-view image sequences in an end-to-end mannerโ€”without requiring camera intrinsics, extrinsics, or external sensor alignment. Key innovations include: (1) eliminating explicit geometric priors and camera model dependencies; (2) unifying intra-frame local, cross-view spatial, and inter-frame temporal attention mechanisms; and (3) jointly decoding ego-centric point clouds and ego-pose from DINO-derived features. Evaluated on nuScenes and Waymo Open Dataset, our approach achieves state-of-the-art performance in dense geometric reconstruction, demonstrating calibration-free operation, high geometric fidelity, and robust compatibility across diverse camera configurations.

Technology Category

Application Category

๐Ÿ“ Abstract
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs 3D geometry from unposed multi-view images
Adapts to arbitrary camera setups without geometric priors
Predicts metric-scaled maps directly from visual sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer architecture for 3D geometry reconstruction
Attention mechanisms across views and frames
Metric-scaled prediction without camera parameters
๐Ÿ”Ž Similar Papers
No similar papers found.