MapFusion: A Novel BEV Feature Fusion Network for Multi-modal Map Construction

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address weak cross-modal interaction, misaligned features, and coarse-grained fusion in multi-modal (camera/LiDAR) bird’s-eye-view (BEV) map construction, this paper proposes two novel modules: Cross-Modal Interaction Transformer (CIT) and Dual Dynamic Fusion (DDF). CIT employs self-attention to achieve fine-grained, pixel-level BEV feature alignment across modalities; DDF introduces a dynamic gating mechanism to adaptively weight and fuse modality-specific discriminative features. The overall architecture is lightweight, modular, and plug-and-play, supporting end-to-end multi-modal encoding-decoding. Evaluated on nuScenes, our method achieves absolute improvements of 3.6% in HD map construction accuracy and 6.2% in BEV semantic segmentation mIoU over prior state-of-the-art methods. These results demonstrate that deep, structured cross-modal interaction substantially enhances the representational capacity for high-definition mapping.

Technology Category

Application Category

📝 Abstract

Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird's-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal map construction accuracy

Solves semantic misalignment in BEV features

Adaptively fuses valuable cross-modal information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Interaction Transform module

Dual Dynamic Fusion module

Plug-and-play BEV feature fusion

🔎 Similar Papers

No similar papers found.

Authors to Follow