🤖 AI Summary
To address the cross-modal matching difficulty arising from non-uniform surface distribution of radar point clouds, this paper introduces, for the first time, an explicit physical prior modeling of radar hit distributions. We propose a radar response convolution mechanism: leveraging monocular detection outputs to predict class-adaptive conditional probability hit distributions, which are then used to construct spatially adaptive, learnable convolution kernels for localized radar point cloud matching and confidence refinement. Unlike conventional end-to-end black-box fusion paradigms, our approach explicitly incorporates geometric and physical priors into the fusion pipeline. Evaluated on the nuScenes dataset, our method achieves state-of-the-art performance for radar-camera fusion in 3D object detection, with particularly notable improvements in detecting small objects and distant targets.
📝 Abstract
Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at https://github.com/longyunf/riccardo.