🤖 AI Summary
Existing reinforcement learning post-training methods fail to differentiate among multimodal inputs, leading to high policy gradient variance, slow convergence, and poor robustness to missing modalities or distribution shifts. To address this, this work proposes the first task-level benchmark for minimal modality-combination annotation and introduces MAPO, a modality-aware policy optimization framework that integrates hierarchical batching, adaptive weighting, and a curriculum scheduling mechanism based on signal-combination difficulty. Experiments on MAPLE-bench demonstrate that the proposed approach reduces the accuracy gap between unimodal and multimodal settings by 30.24%, accelerates convergence by a factor of 3.18, and maintains stable performance across diverse modality-missing scenarios.
📝 Abstract
Multimodal language models now integrate text, audio, and video for unified reasoning. Yet existing RL post-training pipelines treat all input signals as equally relevant, ignoring which modalities each task actually requires. This modality-blind training inflates policy-gradient variance, slows convergence, and degrades robustness to real-world distribution shifts where signals may be missing, added, or reweighted. We introduce MAPLE, a complete modality-aware post-training and learning ecosystem comprising: (1) MAPLE-bench, the first benchmark explicitly annotating minimal signal combinations required per task; (2) MAPO, a modality-aware policy optimization framework that stratifies batches by modality requirement to reduce gradient variance from heterogeneous group advantages; (3) Adaptive weighting and curriculum scheduling that balances and prioritizes harder signal combinations. Systematic analysis across loss aggregation, clipping, sampling, and curriculum design establishes MAPO's optimal training strategy. Adaptive weighting and curriculum focused learning further boost performance across signal combinations. MAPLE narrows uni/multi-modal accuracy gaps by 30.24%, converges 3.18x faster, and maintains stability across all modality combinations under realistic reduced signal access. MAPLE constitutes a complete recipe for deployment-ready multimodal RL post-training.