🤖 AI Summary
In multimodal domain generalization (MMDG), weight averaging (WA) suffers from modality bias and degraded generalization due to disparate optimization speeds across modalities. To address this, we propose Modality-Balanced Collaborative Distillation (MB-CD). MB-CD mitigates convergence imbalance via adaptive modality dropout, enforces gradient consistency to enhance cross-modal optimization coordination, and—novelly—integrates WA into a multi-branch collaborative distillation pipeline, jointly modeling cross-modal knowledge transfer and flatness-aware optimization. Its core innovation lies in unifying distillation and ensemble learning through WA, thereby guiding the model toward flatter, more generalizable optima. Extensive experiments on multiple MMDG benchmarks demonstrate that MB-CD significantly improves cross-domain accuracy and robustness over state-of-the-art methods.
📝 Abstract
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.