FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving

📅 2024-08-13
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address efficiency and accuracy bottlenecks in camera–LiDAR cross-modal fusion for autonomous driving—particularly in pixel-dense and depth-ambiguous scenes—this paper proposes a lightweight fusion framework based on sparse Transformers. The method unifies bidirectional geometric mapping (image→3D and point cloud→2D) to eliminate redundant dense projection operations. It introduces three key sparse-fusion innovations: (1) adaptive neighborhood attention, (2) joint sparse feature encoding with modality-specific flattened tokenizers, and (3) micro-architecture-optimized sparse Transformers. Together, these components establish an end-to-end sparse fusion paradigm. Evaluated on the nuScenes validation set, the framework achieves 73.7 NDS with 10.1 FPS inference speed (PyTorch), outperforming state-of-the-art methods including UniTR, CMT, and SparseFusion.

Technology Category

Application Category

📝 Abstract
The integration of data from diverse sensor modalities (e.g., camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.
Problem

Research questions and friction points this paper is trying to address.

Effective fusion of sparse camera-LiDAR data for autonomous driving
Optimizing Transformer-based design for sparse multi-modal integration
Improving performance over existing sparse fusion methods in accuracy and speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based sparse camera-LiDAR fusion
Image-to-3D and LiDAR-to-2D mapping strategies
Attention neighbor grouping and single modal tokenizer
🔎 Similar Papers
No similar papers found.