🤖 AI Summary
This work addresses the significant performance degradation of multimodal models under partial modality missingness, which stems from implicit modality preferences caused by imbalanced inter-modal learning during training. The study is the first to identify and quantify modality dominance relationships in the frequency domain and introduces a plug-and-play Multimodal Weight Allocation Module (MWAM). MWAM dynamically adjusts the contribution of each modality branch via a Frequency Ratio Metric (FRM), promoting balanced joint learning. This lightweight mechanism is highly generalizable and can be seamlessly integrated into both CNN and Vision Transformer (ViT) architectures. Extensive experiments demonstrate that it consistently enhances model robustness to missing modalities across diverse tasks and modality combinations, while also effectively boosting the performance of existing state-of-the-art methods.
📝 Abstract
Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.