MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in binarizing Mixture-of-Experts (MoE) large language models—namely cross-expert redundancy, task-agnostic importance estimation, and routing instability induced by quantization—and presents the first post-training binarization framework tailored for MoE architectures. The proposed method jointly applies SVD decomposition to eliminate redundancy, integrates local Hessian information with global gradients to construct a task-aware importance metric, and introduces an input null-space–based error constraint to preserve routing stability, all without incurring additional storage overhead. Evaluated on models such as Qwen3-30B-A3B, the approach reduces perplexity by 52.2%, improves zero-shot accuracy by 43.4%, achieves over 2× faster inference, and significantly shortens quantization time.
📝 Abstract
Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$\%$, improves average zero-shot performance by 43.4$\%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
binary quantization
post-training quantization
routing shift
cross-expert redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
weight binarization
post-training quantization
routing distortion
efficiency optimization
🔎 Similar Papers
No similar papers found.
Z
Zhixiong Zhao
Houmo AI, Nanyang Technological University
Z
Zukang Xu
Houmo AI
Z
Zhixuan Chen
Houmo AI
Dawei Yang
Dawei Yang
Houmo AI
AI