🤖 AI Summary
Weak reflectance discrimination and limited feature representation in LiDAR point clouds constrain 3D object detection performance. To address this, this paper introduces depth priors—generated by the vision foundation model DepthAnything—into LiDAR point feature enhancement for the first time. We propose a point-level depth prior fusion module that embeds predicted depth as an auxiliary attribute into raw point clouds. Furthermore, we design a voxel-point dual-path RoI feature extraction network with a bidirectional gated fusion mechanism to jointly model global semantics and local geometric structure. Evaluated on the KITTI dataset, our method achieves significant improvements in 3D detection accuracy, notably increasing the Car class AP₄₀ by 2.1%. This demonstrates the effectiveness and generalization potential of cross-modal geometric prior transfer for LiDAR-based perception.
📝 Abstract
Recent advances in foundation models have opened up new possibilities for enhancing 3D perception. In particular, DepthAnything offers dense and reliable geometric priors from monocular RGB images, which can complement sparse LiDAR data in autonomous driving scenarios. However, such priors remain underutilized in LiDAR-based 3D object detection. In this paper, we address the limited expressiveness of raw LiDAR point features, especially the weak discriminative capability of the reflectance attribute, by introducing depth priors predicted by DepthAnything. These priors are fused with the original LiDAR attributes to enrich each point's representation. To leverage the enhanced point features, we propose a point-wise feature extraction module. Then, a Dual-Path RoI feature extraction framework is employed, comprising a voxel-based branch for global semantic context and a point-based branch for fine-grained structural details. To effectively integrate the complementary RoI features, we introduce a bidirectional gated RoI feature fusion module that balances global and local cues. Extensive experiments on the KITTI benchmark show that our method consistently improves detection accuracy, demonstrating the value of incorporating visual foundation model priors into LiDAR-based 3D object detection.