Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing

πŸ“… 2025-02-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low inference efficiency, poor batch processing capability, high training-data requirements, and performance degradation of large language models (LLMs) on resource-constrained devices, this paper introduces the Llamba family of efficient recurrent language models. We propose MOHAWK, the first cross-architecture knowledge distillation framework, which faithfully transfers Llama-3.x models to the Mamba state-space architecture via architecture-aligned distillation and an edge-optimized inference engine. The method requires only 0.1% of the original training data for adaptation. On edge devices such as smartphones, it achieves a 2.3Γ— average inference speedup and 68% memory reduction. Llamba-1B/3B/8B attain 97.2% of Llama-3.1’s performance on standard benchmarks including GLUE and MMLUβ€”marking the first successful, high-fidelity compression of Llama-level capabilities into lightweight recurrent architectures.

Technology Category

Application Category

πŸ“ Abstract
We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.
Problem

Research questions and friction points this paper is trying to address.

Efficient recurrent language models
Higher inference throughput
Optimized for resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled recurrent language models
Cross-architecture distillation technique
Optimized for resource-constrained devices
πŸ”Ž Similar Papers
No similar papers found.