Doracamom: Joint 3D Detection and Occupancy Prediction with Multi-view 4D Radars and Cameras for Omnidirectional Perception

📅 2025-01-26

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the insufficient robustness of autonomous driving 3D perception under adverse weather and low-texture conditions, this paper proposes the first omnidirectional perception framework integrating multi-view cameras and 4D imaging radar, unifying 3D object detection and semantic occupancy prediction. Methodologically, we introduce a coarse-grained voxel query generator, a dual-branch spatiotemporal encoder, and a BEV–voxel cross-modal fusion module; further incorporating geometry-guided voxel initialization, Transformer-based refinement, parallel spatiotemporal modeling, and attention-driven feature alignment, all optimized via multi-task learning. Our approach achieves state-of-the-art performance on OmniHD-Scenes, VoD, and TJ4DRadSet, significantly improving both accuracy and robustness of 3D perception in challenging environments.

Technology Category

Application Category

📝 Abstract

3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limited. In this paper, we propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction, enabling comprehensive environmental perception. Specifically, we introduce a novel Coarse Voxel Queries Generator that integrates geometric priors from 4D radar with semantic features from images to initialize voxel queries, establishing a robust foundation for subsequent Transformer-based refinement. To leverage temporal information, we design a Dual-Branch Temporal Encoder that processes multi-modal temporal features in parallel across BEV and voxel spaces, enabling comprehensive spatio-temporal representation learning. Furthermore, we propose a Cross-Modal BEV-Voxel Fusion module that adaptively fuses complementary features through attention mechanisms while employing auxiliary tasks to enhance feature quality. Extensive experiments on the OmniHD-Scenes, View-of-Delft (VoD), and TJ4DRadSet datasets demonstrate that Doracamom achieves state-of-the-art performance in both tasks, establishing new benchmarks for multi-modal 3D perception. Code and models will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Autonomous Vehicles

3D Object Recognition

Environmental Perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D Radar-Camera Fusion

Time Series Analysis

Multi-modal 3D Perception

🔎 Similar Papers

No similar papers found.