π€ AI Summary
To address the degradation of camera-radar fusion performance in back-projection-based BEV transformation caused by image depth ambiguity, this paper proposes CRAB: a novel framework that (1) explicitly constrains image depth distributions using high-precision sparse depth priors from radar, thereby mitigating depth ambiguity during inverse projection; and (2) introduces a radar-context-enhanced cross-attention mechanism to achieve fine-grained alignment and fusion of image features with radar occupancy information directly in BEV space. CRAB jointly integrates inverse projection, view-specific feature aggregation, and spatially adaptive radar fusion into a single end-to-end trainable architecture for high-fidelity BEV representation learning. Evaluated on nuScenes, CRAB achieves 62.4% NDS and 54.0% mAPβsetting the new state of the art among back-projection-based camera-radar fusion methods for 3D detection and semantic segmentation.
π Abstract
Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.