M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address task isolation and optimization conflicts in autonomous driving multi-task perception, this paper proposes an end-to-end unified framework jointly modeling 3D object detection, semantic segmentation, and voxel occupancy prediction. Methodologically, we introduce two novel components: (i) modality-adaptive feature integration (MAFI) for robust cross-modal fusion, and (ii) task-conditioned channel scaling (TCS) for dynamic feature prioritization. We further design task-specific query initialization and a shared BEV-based decoder leveraging query-token interactions—compatible with both Transformer and Mamba architectures. Evaluated on the nuScenes benchmark, our approach achieves state-of-the-art multi-task performance: all constituent tasks surpass their respective single-task baselines in accuracy, while inference latency is significantly reduced and cross-task perceptual consistency is markedly improved.

Technology Category

Application Category

📝 Abstract
The perception system for autonomous driving generally requires to handle multiple diverse sub-tasks. However, current algorithms typically tackle individual sub-tasks separately, which leads to low efficiency when aiming at obtaining full-perception results. Some multi-task learning methods try to unify multiple tasks with one model, but do not solve the conflicts in multi-task learning. In this paper, we introduce M3Net, a novel multimodal and multi-task network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving and achieves superior performance than single task model. M3Net takes multimodal data as input and multiple tasks via query-token interactions. To enhance the integration of multi-modal features for multi-task learning, we first propose the Modality-Adaptive Feature Integration (MAFI) module, which enables single-modality features to predict channel-wise attention weights for their high-performing tasks, respectively. Based on integrated features, we then develop task-specific query initialization strategies to accommodate the needs of detection/segmentation and 3D occupancy prediction. Leveraging the properly initialized queries, a shared decoder transforms queries and BEV features layer-wise, facilitating multi-task learning. Furthermore, we propose a Task-oriented Channel Scaling (TCS) module in the decoder to mitigate conflicts between optimizing for different tasks. Additionally, our proposed multi-task querying and TCS module support both Transformer-based decoder and Mamba-based decoder, demonstrating its flexibility to different architectures. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Handling multiple diverse perception tasks in autonomous driving efficiently
Resolving conflicts in multi-task learning for 3D detection and segmentation
Integrating multimodal data for superior multi-task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-task network for autonomous driving
Modality-Adaptive Feature Integration module
Task-oriented Channel Scaling module
🔎 Similar Papers
No similar papers found.
X
Xuesong Chen
MMLab, The Chinese University of HongKong
Shaoshuai Shi
Shaoshuai Shi
Didi Chuxing, Max Planck Institute for Informatics
Computer VisionDeep LearningAutonomous Driving
T
Tao Ma
MMLab, The Chinese University of HongKong
Jingqiu Zhou
Jingqiu Zhou
cuhk
optimizationclassic dynamics
Simon See
Simon See
nvidia
applied mathematicsAImachine learningHigh Performance ComputingSimulation
K
Ka Chun Cheung
NVIDIA AI Technology Center
H
Hongsheng Li
MMLab, The Chinese University of HongKong; Centre for Perceptual and Interactive Intelligence