xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Large language models (LLMs) suffer from slow inference and high energy consumption, particularly at scale. Method: This work introduces the first 7-billion-parameter xLSTM model—the largest xLSTM to date—leveraging a gated recurrent unit design that achieves linear-time complexity and constant memory footprint, eliminating the long-range modeling bottlenecks of conventional RNNs without relying on attention mechanisms for efficient autoregressive decoding. Contribution/Results: The 7B-xLSTM matches the performance of comparably sized Transformer models on mathematical reasoning, code generation, and complex reasoning benchmarks. Relative to Llama-2 7B and Mamba-2 7B, it achieves 2.3× and 1.8× higher inference throughput, respectively, while reducing energy consumption by 40%. This work redefines the efficiency frontier for 7B-scale LLMs and establishes a new paradigm for deploying high-performance LLMs under resource-constrained or high-throughput scenarios.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Problem

Research questions and friction points this paper is trying to address.

Improve inference speed and efficiency of LLMs

Scale xLSTM-based LLMs for larger models

Compare xLSTM 7B with other LLMs on performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

xLSTM architecture for linear compute scaling

7B-parameter model optimized for fast inference

Outperforms Llama and Mamba in speed and efficiency

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling