🤖 AI Summary
This work addresses the challenge of knowledge distillation from Transformer-based large language models (LLMs) to recurrent architectures (e.g., xLSTM, Mamba), stemming from their structural heterogeneity with self-attention mechanisms. We propose Distil-xLSTM—the first compact language model built entirely on a recurrent architecture that explicitly models and distills the parameterization of Transformer attention. Our method leverages xLSTM’s sequential mixing capability to learn a compact representation of attention behavior from an LLM via knowledge distillation, eliminating the conventional requirement for architectural homogeneity. Key contributions include the first interpretable, parameterized approximation of attention mechanisms by a purely recurrent model. Experiments demonstrate that Distil-xLSTM achieves performance competitive with larger models under minimal training overhead, while exhibiting superior computational efficiency and scalability compared to attention-based counterparts—validating the feasibility and effectiveness of recurrent models in learning attention representations.
📝 Abstract
The current era of Natural Language Processing (NLP) is dominated by Transformer models. However, novel architectures relying on recurrent mechanisms, such as xLSTM and Mamba, have been proposed as alternatives to attention-based models. Although computation is done differently than with the attention mechanism mechanism, these recurrent models yield good results and sometimes even outperform state-of-the-art attention-based models. In this work, we propose Distil-xLSTM, an xLSTM-based Small Language Model (SLM) trained by distilling knowledge from a Large Language Model (LLM) that shows promising results while being compute and scale efficient. Our Distil-xLSTM focuses on approximating a transformer-based model attention parametrization using its recurrent sequence mixing components and shows good results with minimal training.