Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address low inference efficiency, poor batch processing capability, high training-data requirements, and performance degradation of large language models (LLMs) on resource-constrained devices, this paper introduces the Llamba family of efficient recurrent language models. We propose MOHAWK, the first cross-architecture knowledge distillation framework, which faithfully transfers Llama-3.x models to the Mamba state-space architecture via architecture-aligned distillation and an edge-optimized inference engine. The method requires only 0.1% of the original training data for adaptation. On edge devices such as smartphones, it achieves a 2.3× average inference speedup and 68% memory reduction. Llamba-1B/3B/8B attain 97.2% of Llama-3.1’s performance on standard benchmarks including GLUE and MMLU—marking the first successful, high-fidelity compression of Llama-level capabilities into lightweight recurrent architectures.

Technology Category

Application Category

📝 Abstract

We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.

Problem

Research questions and friction points this paper is trying to address.

Efficient recurrent language models

Higher inference throughput

Optimized for resource-constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled recurrent language models

Cross-architecture distillation technique

Optimized for resource-constrained devices

🔎 Similar Papers

No similar papers found.