🤖 AI Summary
Existing omnidirectional depth estimation methods rely on reference views or heuristic fusion strategies, which struggle to effectively model multi-view geometric relationships and exhibit limited robustness to occlusions, partial overlaps, and baseline variations. This work proposes FreeOmniMVS, a reference-free omnidirectional multi-view stereo framework that explicitly models correlation volumes across all camera pairs via a View-pair Correlation Transformer (VCT). By integrating a lightweight attention mechanism to adaptively fuse correlation vectors, the method achieves globally consistent and visibility-aware depth estimation. All views participate equally in the matching process, eliminating reliance on a designated reference view. FreeOmniMVS significantly outperforms existing approaches across multiple benchmarks, demonstrating superior robustness and adaptability to varying scales and challenging viewing configurations.
📝 Abstract
Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.