🤖 AI Summary
This paper identifies a systematic modality imbalance problem at the decision level in multimodal learning: even when representation learning is well-balanced (e.g., via large-scale pretraining and optimization), models exhibit significant bias toward weak modalities—such as audio—during fusion. This bias arises intrinsically from geometric disparities in feature-space structure and decision-weight distributions, rather than merely from optimization dynamics; uncalibrated modality-wise output aggregation further exacerbates weight skew and suppresses weak-modality contributions. To address this, we propose a decision-level adaptive weighting mechanism. Evaluated on CREMAD and Kinetic-Sounds, our method demonstrably improves weak-modality participation and overall generalization. Experiments confirm that optimizing representations alone fails to mitigate this imbalance, whereas our approach achieves substantial gains. The work establishes a new paradigm for multimodal fusion architecture design, emphasizing decision-level calibration over representation-level balance.
📝 Abstract
Multimodal learning integrates information from different modalities to enhance model performance, yet it often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization. This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer. Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) show that even after extensive pretraining and balanced optimization, models still exhibit systematic bias toward certain modalities, such as audio. Further analysis demonstrates that this bias originates from intrinsic disparities in feature-space and decision-weight distributions rather than from optimization dynamics alone. We argue that aggregating uncalibrated modality outputs at the fusion stage leads to biased decision-layer weighting, hindering weaker modalities from contributing effectively. To address this, we propose that future multimodal systems should focus more on incorporate adaptive weight allocation mechanisms at the decision layer, enabling relative balanced according to the capabilities of each modality.