MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

To address scalability and inference efficiency bottlenecks in large language models (LLMs), this paper proposes MoxE—a novel architecture integrating extended LSTM (xLSTM) with sparse Mixture of Experts (MoE). Its core innovation is an entropy-aware dynamic routing mechanism that adaptively assigns tokens to specialized experts based on token-distribution entropy, enabling balanced modeling of both high- and low-frequency tokens. To improve rare-token representation, we introduce mLSTM, a modified LSTM variant. Additionally, we design entropy regularization and group-balanced auxiliary losses to enhance expert utilization fairness and generalization. Experiments demonstrate that, at equal parameter count, MoxE achieves a 37% speedup in inference latency, reduces memory footprint by 42%, and significantly lowers language modeling perplexity compared to strong baselines.

Technology Category

Application Category

📝 Abstract

This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and group-wise balancing losses, ensuring robust performance and efficient training. Theoretical analysis and empirical evaluations rigorously demonstrate that MoxE achieves significant efficiency gains and enhanced effectiveness compared to existing approaches, marking a notable advancement in scalable LLM architectures.

Problem

Research questions and friction points this paper is trying to address.

Addresses scalability and efficiency in large language models

Introduces entropy-based routing for balanced resource utilization

Enhances generalization with auxiliary losses for robust performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines xLSTM with Mixture of Experts for efficiency

Uses entropy-based routing for dynamic token allocation

Introduces auxiliary losses to enhance model generalization

🔎 Similar Papers

OLMoE: Open Mixture-of-Experts Language Models