When Model Merging Breaks Routing: Training-Free Calibration for MoE

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

180K/year
πŸ€– AI Summary
This work addresses the sensitivity of Mixture-of-Experts (MoE) models to parameter perturbations during model merging, which often leads to routing collapse and severe performance degradation. The study is the first to identify this issue and introduces Hessian-aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information from the Hessian matrix to analytically realign the router post-merging, thereby restoring its routing capability. By integrating matrix-free conjugate gradient methods with Top-k routing analysis, HARC significantly enhances the performance of various MoE merging baselines on mathematical reasoning and code generation tasks, effectively mitigating routing failure without additional training.
πŸ“ Abstract
Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.
Problem

Research questions and friction points this paper is trying to address.

Model Merging
Mixture-of-Experts
Routing Breakdown
MoE
Router Calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Merging
Mixture-of-Experts
Routing Breakdown
Hessian-Aware Calibration
Training-Free
πŸ”Ž Similar Papers
No similar papers found.