🤖 AI Summary
To address the insufficient robustness of autonomous driving 3D perception under adverse weather and low-texture conditions, this paper proposes the first omnidirectional perception framework integrating multi-view cameras and 4D imaging radar, unifying 3D object detection and semantic occupancy prediction. Methodologically, we introduce a coarse-grained voxel query generator, a dual-branch spatiotemporal encoder, and a BEV–voxel cross-modal fusion module; further incorporating geometry-guided voxel initialization, Transformer-based refinement, parallel spatiotemporal modeling, and attention-driven feature alignment, all optimized via multi-task learning. Our approach achieves state-of-the-art performance on OmniHD-Scenes, VoD, and TJ4DRadSet, significantly improving both accuracy and robustness of 3D perception in challenging environments.
📝 Abstract
3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limited. In this paper, we propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction, enabling comprehensive environmental perception. Specifically, we introduce a novel Coarse Voxel Queries Generator that integrates geometric priors from 4D radar with semantic features from images to initialize voxel queries, establishing a robust foundation for subsequent Transformer-based refinement. To leverage temporal information, we design a Dual-Branch Temporal Encoder that processes multi-modal temporal features in parallel across BEV and voxel spaces, enabling comprehensive spatio-temporal representation learning. Furthermore, we propose a Cross-Modal BEV-Voxel Fusion module that adaptively fuses complementary features through attention mechanisms while employing auxiliary tasks to enhance feature quality. Extensive experiments on the OmniHD-Scenes, View-of-Delft (VoD), and TJ4DRadSet datasets demonstrate that Doracamom achieves state-of-the-art performance in both tasks, establishing new benchmarks for multi-modal 3D perception. Code and models will be publicly available.