An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing monocular 360° indoor depth estimation methods prioritize pixel-level accuracy, leading to over-smoothed corners, structural distortions, and high sensitivity to noise. To address these limitations, this paper proposes an end-to-end spherical depth estimation framework that explicitly incorporates room geometry priors. We design a background depth parsing strategy that explicitly models wall intersections and enforces right-angle constraints. A layout-guided background segmentation fusion mechanism is introduced, leveraging a shared encoder and multi-task decoder to jointly optimize layout estimation, semantic segmentation, and depth prediction. Furthermore, multi-scale feature extraction combined with segmentation-mask-weighted feature fusion enhances boundary precision and structural consistency. Extensive experiments on Stanford2D3D, Matterport3D, and Structured3D demonstrate significant improvements over state-of-the-art methods: absolute depth error is reduced by 12.3%, while corner structural integrity and depth map sharpness are substantially enhanced.

Technology Category

Application Category

📝 Abstract

Predicting spherical pixel depth from monocular $360^{circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.

Problem

Research questions and friction points this paper is trying to address.

Estimating depth from monocular indoor panoramas using geometry constraints

Addressing oversmoothed corners and noise in existing depth estimation methods

Integrating layout prediction and background segmentation for improved depth accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses room layout prediction for geometry constraints

Integrates background segmentation into depth estimation

Employs shared encoder with task-specific decoders

🔎 Similar Papers

F3Loc: Fusion and Filtering for Floorplan Localization