π€ AI Summary
In 3D object detection, sparse convolutions struggle to capture long-range dependencies, while Transformers incur prohibitive computational overhead and are ill-suited for irregular, sparse voxel layouts. To address this, we propose the Sparse Focal Modulation (SFM) moduleβa novel architecture enabling linear-complexity joint modeling of both short- and long-range contextual information directly in the sparse domain. SFM integrates a hierarchical sparse convolutional backbone, sparse feature encoding, and an end-to-end differentiable detection head, jointly optimized for efficiency and expressiveness. Evaluated on mainstream autonomous driving LiDAR benchmarks (e.g., nuScenes, Waymo), SFM achieves state-of-the-art detection accuracy, with significant mAP gains. Moreover, it improves inference speed by 32% and reduces memory consumption by 41% compared to prior sparse-transformer hybrids, demonstrating strong scalability for real-time, large-scale deployment.
π Abstract
We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.