🤖 AI Summary
To address “fusion degradation”—a critical issue in multimodal object detection (MMOD) arising from weakened unimodal representation capacity—this paper proposes the M²D-LIF framework, which rethinks MMOD from a unimodal learning perspective. We introduce the first linear-probe-based quantitative evaluation method to assess unimodal representation capability. To strengthen unimodal feature learning, we design Mono-Modality Distillation (M²D), a novel distillation mechanism that enhances modality-specific feature discriminability. Furthermore, we propose Local Illumination-aware Fusion (LIF), a lightweight and robust RGB–IR feature fusion strategy that adaptively integrates complementary cues under varying illumination conditions. Evaluated on three mainstream MMOD benchmarks, M²D-LIF significantly mitigates fusion degradation and achieves comprehensive performance gains over existing state-of-the-art methods, setting new benchmarks in both accuracy and efficiency.
📝 Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.