🤖 AI Summary
This work addresses the performance degradation in Mixture-of-Experts (MoE) models during quantization, which often stems from routing instability that causes expert selection drift. To mitigate this issue, the authors propose VSRAQ, a novel post-training quantization method that explicitly models routing consistency as a quantization objective. By jointly optimizing value alignment and structural alignment, VSRAQ incorporates routing logits matching along with mechanisms to preserve expert ranking and Top-k boundaries. This approach effectively alleviates path deviation without introducing additional inference overhead. Evaluated on prominent MoE large language models, VSRAQ significantly enhances expert selection consistency and outperforms baseline methods that focus solely on weight reconstruction or isolated router optimization.
📝 Abstract
Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.