CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address cross-view depth inconsistency in multi-view self-supervised surround-view depth estimation, this paper proposes a geometry-guided cylindrical projection method: multi-view images are explicitly mapped onto a shared unit cylindrical surface to establish pixel-level spatial correspondences across views. A non-learnable spatial attention mechanism—driven by camera extrinsic and intrinsic parameters—is designed to generate positional maps for cross-image feature aggregation and explicit depth consistency regularization. Depth prediction is jointly optimized within a self-supervised framework. The approach requires no additional annotations and significantly improves both depth estimation accuracy and cross-view geometric consistency. On the DDAD and nuScenes benchmarks, it reduces Depth RMSE by 8.2% and 6.7%, respectively, and decreases cross-view depth discrepancy by 32%, outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistent depth estimates in multi-view surround depth estimation

Proposes cylindrical spatial attention for cross-view consistent depth prediction

Improves depth consistency across overlapping images using geometry-guided approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cylindrical projection establishes cross-view neighborhood relations

Non-learned spatial attention aggregates features across images

Geometry-guided method ensures multi-view consistent depth estimation

🔎 Similar Papers

MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas