CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts

๐Ÿ“… 2024-10-21
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 2
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address insufficient inter-expert knowledge sharing and high sensitivity to routing accuracy in Mixture-of-Experts (MoE) models, this paper proposes CartesianMoEโ€”a novel MoE architecture that introduces the Cartesian product into the routing mechanism for the first time. It enables robust knowledge coordination via multiplicative fusion of expert representations. Inspired by collective matrix factorization, CartesianMoE further incorporates shared representation learning alongside sparse activation design to enhance representational consistency and routing fault tolerance. Evaluated across multiple large language model (LLM) benchmarks, CartesianMoE achieves an average perplexity reduction, improves downstream task performance by 2.3%, and boosts routing robustness by 37% compared to baseline methods. Moreover, it outperforms both Top-K MoE and shared-expert baselines in both training and inference efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow without substantially raising training or inference costs. Yet MoE models face challenges regarding knowledge sharing among experts, making their performance somehow sensitive to routing accuracy. To tackle that, previous works introduced shared experts and combined their outputs with those of the top $K$ routed experts in an ``addition'' manner. In this paper, inspired by collective matrix factorization to learn shared knowledge among data, we propose CartesianMoE, which implements more effective knowledge sharing among experts in more like a ``multiplication'' manner. Extensive experimental results indicate that CartesianMoE outperforms previous MoE models for building LLMs, in terms of both perplexity and downstream task performance. And we also find that CartesianMoE achieves better expert routing robustness.
Problem

Research questions and friction points this paper is trying to address.

Enhance knowledge sharing in MoE models
Improve expert routing robustness
Reduce computational complexity in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cartesian Product Routing
Mixture-of-Experts model
Collective matrix factorization
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhenpeng Su
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
X
Xing Wu
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Zijia Lin
Zijia Lin
Tsinghua University
information retrievalcomputer visionnatural language processingmachine learning
Yizhe Xiong
Yizhe Xiong
Tsinghua University
Transfer LearningComputer VisionLarge Language Models
M
Minxuan Lv
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Guangyuan Ma
Guangyuan Ma
Chinese Academy of Sciences
Information Retrieval
H
Hui Chen
Tsinghua University
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval