MapFusion: A Novel BEV Feature Fusion Network for Multi-modal Map Construction

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak cross-modal interaction, misaligned features, and coarse-grained fusion in multi-modal (camera/LiDAR) bird’s-eye-view (BEV) map construction, this paper proposes two novel modules: Cross-Modal Interaction Transformer (CIT) and Dual Dynamic Fusion (DDF). CIT employs self-attention to achieve fine-grained, pixel-level BEV feature alignment across modalities; DDF introduces a dynamic gating mechanism to adaptively weight and fuse modality-specific discriminative features. The overall architecture is lightweight, modular, and plug-and-play, supporting end-to-end multi-modal encoding-decoding. Evaluated on nuScenes, our method achieves absolute improvements of 3.6% in HD map construction accuracy and 6.2% in BEV semantic segmentation mIoU over prior state-of-the-art methods. These results demonstrate that deep, structured cross-modal interaction substantially enhances the representational capacity for high-definition mapping.

Technology Category

Application Category

📝 Abstract
Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird's-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal map construction accuracy
Solves semantic misalignment in BEV features
Adaptively fuses valuable cross-modal information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Interaction Transform module
Dual Dynamic Fusion module
Plug-and-play BEV feature fusion
🔎 Similar Papers
No similar papers found.
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Yunfeng Diao
Yunfeng Diao
Assistant Professor, Hefei University of Technology
Adversarial RobustnessComputer VisionAI Security
M
Mengchuan Wei
Samsung R&D Institute China–Beijing, No. 12, Taiyangong Middle Road, 100028, Beijing, China
Y
Yifan Yang
Samsung R&D Institute China–Beijing, No. 12, Taiyangong Middle Road, 100028, Beijing, China
P
Peng Hao
Samsung R&D Institute China–Beijing, No. 12, Taiyangong Middle Road, 100028, Beijing, China
Rong Yin
Rong Yin
Associate Researcher, Institute of Information Engineering, Chinese Academy of Sciences
LLMGraph Representation LearningStatistical Learning Theory
H
Hui Zhang
Samsung R&D Institute China–Beijing, No. 12, Taiyangong Middle Road, 100028, Beijing, China
Weiming Li
Weiming Li
Principal Engineer, Samsung Electronics
Computer VisionAugmented RealityComputational Imaging and Display
S
Shu Zhao
Pennsylvania State University, State College, 16801, Pennsylvania, United States
Y
Yu Liu
Department of Biomedical Engineering, Hefei University of Technology, Shushan District, 230009, Hefei, China