๐ค AI Summary
Monocular camera-based 3D object detection suffers from depth ambiguity and degraded performance under adverse environmental conditions. While radar offers robustness to such conditions, its sparse point clouds and low resolution hinder direct applicability to detection tasks. To address this, this work proposes InstaRadarโan instance segmentation-guided radar point densification method that leverages semantic masks to enhance radar point density and align radar data with image semantics. Furthermore, the authors integrate a pre-trained RCDPT module into the BEVDepth framework, replacing its original depth estimation component to enable, for the first time, joint optimization with explicit depth supervision guided by radar. Experiments demonstrate that InstaRadar achieves state-of-the-art performance in radar-guided depth estimation and significantly improves 3D detection accuracy within the BEVDepth pipeline, validating the efficacy of radar-informed depth estimation.
๐ Abstract
Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.