Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes a highly area-efficient Mixture-of-Experts (MoE)โ€“Processing-in-Memory (PIM) co-design architecture to address key challenges in deploying MoE models on PIM systems, including excessive peripheral circuit area overhead, load imbalance, and routing bottlenecks. The proposed architecture synergistically optimizes area, performance, and energy efficiency through crossbar-level resource sharing, expert grouping scheduling, and a gated-output (GO) caching mechanism. Experimental results demonstrate that the design improves area efficiency of the MoE component by 2.2ร—. When generating eight tokens, it achieves 4.2ร— higher performance and 10.1ร— better energy efficiency compared to baseline approaches, attaining an overall performance density of 15.6 GOPS/W/mmยฒ.

Technology Category

Application Category

๐Ÿ“ Abstract
Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router requires access to all hidden states during generation, we propose a gate-output (GO)cache to store necessary results and bypass expensive additional computation. Experiments show that our approaches improve the area efficiency of the MoE part by up to 2.2x compared to a SOTA architecture. During generation, the cache improves performance and energy efficiency by 4.2x and 10.1x, respectively, compared to the baseline when generating 8 tokens. The total performance density achieves 15.6 GOPS/W/mm2. The code is open source at https://github.com/superstarghy/MoEwithPIM.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Process-in-Memory
Area Overhead
In-Memory Computing
Peripheral Circuits
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-memory computing
Mixture-of-Experts
crossbar multiplexing
expert caching
area efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hanyuan Gao
Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA, USA
Xiaoxuan Yang
Xiaoxuan Yang
University of Virginia
In-Memory ComputingComputer-Aided DesginMachine Learning Acceleration