🤖 AI Summary
In multimodal learning, imbalanced modality missing rates cause inconsistent learning progress and representation degradation, creating a vicious cycle of performance deterioration. Existing methods primarily address dataset-level balancing while neglecting sample-level dynamic variations in modality utility and the intrinsic decline in feature quality. To tackle this, we propose MCE, a general-purpose framework featuring two novel modules: (1) Learning Capability Enhancement, which dynamically adjusts per-sample learning progress across modalities via multi-level factors; and (2) Representation Capability Enhancement, which improves semantic richness and robustness of features through subset prediction and cross-modal completion tasks. Integrated with multimodal fusion and contrastive learning, MCE effectively breaks the performance-degradation cycle. Extensive experiments demonstrate that MCE consistently outperforms state-of-the-art methods across four benchmarks under diverse missing-rate configurations, achieving new best results. The code is publicly available.
📝 Abstract
Multi-modal learning has made significant advances across diverse pattern recognition applications. However, handling missing modalities, especially under imbalanced missing rates, remains a major challenge. This imbalance triggers a vicious cycle: modalities with higher missing rates receive fewer updates, leading to inconsistent learning progress and representational degradation that further diminishes their contribution. Existing methods typically focus on global dataset-level balancing, often overlooking critical sample-level variations in modality utility and the underlying issue of degraded feature quality. We propose Modality Capability Enhancement (MCE) to tackle these limitations. MCE includes two synergistic components: i) Learning Capability Enhancement (LCE), which introduces multi-level factors to dynamically balance modality-specific learning progress, and ii) Representation Capability Enhancement (RCE), which improves feature semantics and robustness through subset prediction and cross-modal completion tasks. Comprehensive evaluations on four multi-modal benchmarks show that MCE consistently outperforms state-of-the-art methods under various missing configurations. The journal preprint version is now available at https://doi.org/10.1016/j.patcog.2025.112591. Our code is available at https://github.com/byzhaoAI/MCE.