Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address “fusion degradation”—a critical issue in multimodal object detection (MMOD) arising from weakened unimodal representation capacity—this paper proposes the M²D-LIF framework, which rethinks MMOD from a unimodal learning perspective. We introduce the first linear-probe-based quantitative evaluation method to assess unimodal representation capability. To strengthen unimodal feature learning, we design Mono-Modality Distillation (M²D), a novel distillation mechanism that enhances modality-specific feature discriminability. Furthermore, we propose Local Illumination-aware Fusion (LIF), a lightweight and robust RGB–IR feature fusion strategy that adaptively integrates complementary cues under varying illumination conditions. Evaluated on three mainstream MMOD benchmarks, M²D-LIF significantly mitigates fusion degradation and achieves comprehensive performance gains over existing state-of-the-art methods, setting new benchmarks in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.

Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient learning in mono-modality feature extraction

Mitigates Fusion Degradation in multi-modal object detection

Proposes M$^2$D-LIF framework for improved RGB-IR object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear probing evaluation for multi-modal detectors

Mono-Modality Distillation (M$^2$D) method

Local Illumination-aware Fusion (LIF) module

🔎 Similar Papers

No similar papers found.

Authors to Follow