MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses key challenges in binarizing Mixture-of-Experts (MoE) large language models—namely cross-expert redundancy, task-agnostic importance estimation, and routing instability induced by quantization—and presents the first post-training binarization framework tailored for MoE architectures. The proposed method jointly applies SVD decomposition to eliminate redundancy, integrates local Hessian information with global gradients to construct a task-aware importance metric, and introduces an input null-space–based error constraint to preserve routing stability, all without incurring additional storage overhead. Evaluated on models such as Qwen3-30B-A3B, the approach reduces perplexity by 52.2%, improves zero-shot accuracy by 43.4%, achieves over 2× faster inference, and significantly shortens quantization time.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$\%$, improves average zero-shot performance by 43.4$\%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

binary quantization

post-training quantization

routing shift

cross-expert redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

weight binarization

post-training quantization