SFMNet: Sparse Focal Modulation for 3D Object Detection

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

In 3D object detection, sparse convolutions struggle to capture long-range dependencies, while Transformers incur prohibitive computational overhead and are ill-suited for irregular, sparse voxel layouts. To address this, we propose the Sparse Focal Modulation (SFM) module—a novel architecture enabling linear-complexity joint modeling of both short- and long-range contextual information directly in the sparse domain. SFM integrates a hierarchical sparse convolutional backbone, sparse feature encoding, and an end-to-end differentiable detection head, jointly optimized for efficiency and expressiveness. Evaluated on mainstream autonomous driving LiDAR benchmarks (e.g., nuScenes, Waymo), SFM achieves state-of-the-art detection accuracy, with significant mAP gains. Moreover, it improves inference speed by 32% and reduces memory consumption by 41% compared to prior sparse-transformer hybrids, demonstrating strong scalability for real-time, large-scale deployment.

Technology Category

Application Category

📝 Abstract

We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

Problem

Research questions and friction points this paper is trying to address.

Efficiently model long-range dependencies in 3D object detection.

Reduce computational costs of transformers in sparse 3D scenes.

Improve detection performance in large-scale LiDAR environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SFMNet combines sparse convolutions with long-range dependency modeling

Sparse Focal Modulation integrates short- and long-range contexts efficiently

Hierarchical sparse convolution design reduces computational complexity

🔎 Similar Papers

No similar papers found.