DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

📅 2025-11-13
🏛️ IEEE transactions on circuits and systems for video technology (Print)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited robustness of existing 3D object detectors in autonomous driving—particularly for distant, small-scale, and occluded objects—this paper proposes DGFusion, a dual-guidance fusion framework. Unlike prevailing single-guidance multimodal approaches, DGFusion introduces a difficulty-aware instance matcher (DIPM) that establishes bidirectional guidance between point clouds and images (point cloud → image and image → point cloud), enabling modality-complementary feature interaction at the instance level. DIPM dynamically matches cross-modal features based on difficulty estimation, while a dedicated bidirectional fusion module enhances feature alignment accuracy. Evaluated on the nuScenes benchmark, DGFusion achieves significant improvements with a lightweight design: +1.0% mAP, +0.8% NDS, and +1.3% mean recall—especially for challenging instances. This work establishes a novel paradigm for safe and reliable multimodal 3D perception.

Technology Category

Application Category

📝 Abstract
As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and +1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting distant small occluded objects in autonomous driving
Addressing limitations of single-guided multimodal 3D detection
Improving robustness for hard instance detection scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-guided fusion paradigm for multi-modal detection
Difficulty-aware instance pair matcher for feature matching
Dual-guided modules exploit easy and hard instance pairs
🔎 Similar Papers
No similar papers found.
Feiyang Jia
Feiyang Jia
Beijing Jiaotong University
C
Caiyan Jia
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing 100044, China
A
Ailin Liu
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing 100044, China
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
Q
Qiming Xia
Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University, Xiamen, China, Fujian 361005, China
L
Lin Liu
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing 100044, China
L
Lei Yang
School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore
Y
Yan Gong
State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin 150001, China
Ziying Song
Ziying Song
Beijing Jiaotong University
Object DetectionComputer VisionDeep Learning