π€ AI Summary
Existing mobile stereo matching methods struggle to balance high accuracy with low computational cost: 3D convolutions are computationally expensive, while direct adoption of 2D convolutions often leads to edge blurring, detail loss, and erroneous matches in texture-poor regions. To address this, we propose BANetβa lightweight, purely 2D-convolution-based architecture. Our method introduces a novel scale-aware spatial attention module that enables separate yet collaborative aggregation of fine-grained details and smooth cost volumes. Additionally, we design a bilateral cost volume aggregation paradigm to overcome the matching limitations of 2D convolutions in weak-texture areas. As a result, BANet significantly improves edge sharpness and structural fidelity. On KITTI 2015, BANet-2D achieves a 35.3% accuracy gain over MobileStereoNet-2D while enabling faster inference. BANet-3D sets a new state-of-the-art accuracy among GPU-real-time stereo methods.
π Abstract
State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Additionally, to accurately identify high-frequency detailed regions and low-frequency smooth/textureless regions, we propose a new scale-aware spatial attention module. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. The extended 3D version, BANet-3D, achieves the highest accuracy among all real-time methods on high-end GPUs. Code: extcolor{magenta}{https://github.com/gangweiX/BANet}.