Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work proposes a highly area-efficient Mixture-of-Experts (MoE)–Processing-in-Memory (PIM) co-design architecture to address key challenges in deploying MoE models on PIM systems, including excessive peripheral circuit area overhead, load imbalance, and routing bottlenecks. The proposed architecture synergistically optimizes area, performance, and energy efficiency through crossbar-level resource sharing, expert grouping scheduling, and a gated-output (GO) caching mechanism. Experimental results demonstrate that the design improves area efficiency of the MoE component by 2.2×. When generating eight tokens, it achieves 4.2× higher performance and 10.1× better energy efficiency compared to baseline approaches, attaining an overall performance density of 15.6 GOPS/W/mm².

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router requires access to all hidden states during generation, we propose a gate-output (GO)cache to store necessary results and bypass expensive additional computation. Experiments show that our approaches improve the area efficiency of the MoE part by up to 2.2x compared to a SOTA architecture. During generation, the cache improves performance and energy efficiency by 4.2x and 10.1x, respectively, compared to the baseline when generating 8 tokens. The total performance density achieves 15.6 GOPS/W/mm2. The code is open source at https://github.com/superstarghy/MoEwithPIM.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Process-in-Memory

Area Overhead

In-Memory Computing

Peripheral Circuits

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-memory computing

Mixture-of-Experts

crossbar multiplexing