RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current radar-camera fusion 3D detectors underperform LiDAR-based methods and struggle to model uncertainties arising from moving objects and inherent modality-specific errors. To address these challenges, we propose RCTDistill—the first cross-modal knowledge distillation framework for radar-camera fusion that explicitly leverages temporal information. Our method introduces three novel distillation modules: Range-Azimuth Knowledge Distillation (RAKD) to mitigate radar range/azimuth biases; Temporal Knowledge Distillation (TKD) to align dynamic object trajectories across frames; and Region-Decoupled Knowledge Distillation (RDKD) to disentangle ambiguous cross-modal features. Key technical innovations include temporal BEV alignment, region-wise relational decoupling, and LiDAR-guided feature distillation. Evaluated on nuScenes and VoD benchmarks, RCTDistill achieves state-of-the-art 3D detection accuracy while operating at 26.2 FPS—the highest inference speed among all radar-camera 3D detectors to date.

Technology Category

Application Category

📝 Abstract

Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.

Problem

Research questions and friction points this paper is trying to address.

Improving radar-camera 3D object detection performance lagging behind LiDAR methods

Addressing uncertainties from object motion and sensor-specific errors in fusion

Enhancing temporal alignment and feature discrimination in cross-modal knowledge distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Range-Azimuth Knowledge Distillation for BEV refinement

Temporal Knowledge Distillation to align dynamic objects

Region-Decoupled Knowledge Distillation for feature discrimination

🔎 Similar Papers

No similar papers found.

Authors to Follow