CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the mismatch between expert parameter growth and performance gains in Mixture-of-Experts (MoE) large language models, this paper proposes CAMERA: the first cross-matrix joint compression framework built upon “sub-experts” as atomic units. Through decoding-stage contribution analysis, we observe substantial performance heterogeneity among sub-experts; leveraging this insight, we design two training-free, structured compression methods—CAMERA-P (pruning) and CAMERA-Q (mixed-precision quantization)—supporting quantization down to 2 bits. On nine downstream tasks, CAMERA-P outperforms strong baselines across 20%–60% pruning ratios; CAMERA-Q achieves significantly higher accuracy than existing matrix- or channel-level quantization schemes at 2-bit precision. Full analysis of Qwen2-57B-A14B requires only five minutes on a single A100 GPU. CAMERA thus enables efficient, scalable, and hardware-friendly MoE model compression without retraining.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and storage overheads in MoE models
Identifying and pruning redundant micro-experts efficiently
Improving performance under aggressive quantization for MoE models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Micro-expert as finer-grained compression unit
Training-free framework for redundancy analysis
Structured pruning and mixed-precision quantization
🔎 Similar Papers
No similar papers found.
Yuzhuang Xu
Yuzhuang Xu
Tsinghua University
Natural Language ProcessingEfficient AIMachine Learning
X
Xu Han
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Y
Yuanchi Zhang
Tencent Inc, Beijing, China
Y
Yixuan Wang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Y
Yijun Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, Harbin, China
Shiyu Ji
Shiyu Ji
University of California, Santa Barbara
Information RetrievalPrivacySecurity
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing