CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry

๐Ÿ“… 2025-08-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Unsupervised monocular visual odometry (VO) suffers from pose estimation drift due to dynamic objects violating the static scene assumption. Existing uncertainty modeling is confined to single frames, limiting reliable dynamic region detection. To address this, we propose a temporal uncertainty propagation and fusion frameworkโ€”the first to model cross-frame uncertainty in unsupervised VO. Our method jointly optimizes depth, pose, and pixel-wise uncertainty, probabilistically fusing uncertainties from target and reference frames within a visual Transformer architecture. It enables end-to-end self-supervised learning without ground-truth supervision. Evaluated on KITTI and nuScenes, our approach significantly outperforms state-of-the-art unsupervised methods, especially under high-speed dynamic conditions. Ablation studies confirm that temporal uncertainty modeling is critical for suppressing dynamic regions and improving robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.
Problem

Research questions and friction points this paper is trying to address.

Dynamic objects disrupt static scene assumptions in VO
Single-frame uncertainty modeling ignores temporal uncertainties
Need robust uncertainty propagation for dynamic scene accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines target and reference frame uncertainty
Uses probabilistic formulation for robust masking
Leverages vision transformers for multi-task learning
๐Ÿ”Ž Similar Papers
No similar papers found.