Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Large language models (LLMs) suffer from high inference costs and difficulty in efficiently adapting to novel attention architectures. Method: This paper introduces MHA2MLA—the first data-efficient fine-tuning method enabling rapid architectural migration from pretrained models (e.g., Llama2-7B) to DeepSeek’s multi-head latent attention (MLA), obviating full pretraining. It innovatively integrates Partial-RoPE pruning with joint key-value SVD-based low-rank approximation, achieving architecture conversion using only 0.3%–0.6% of the original training data while natively supporting KV quantization. Contribution/Results: Experiments demonstrate a 92.19% reduction in KV cache size and only a 0.5% degradation in LongBench performance, significantly improving inference efficiency and deployment cost-effectiveness.

Technology Category

Application Category

📝 Abstract

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

Problem

Research questions and friction points this paper is trying to address.

Efficient inference in LLMs

Transition from MHA to MLA

Reduce KV cache size significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-head Latent Attention compression

Data-efficient fine-tuning method

Joint SVD approximations for keys

🔎 Similar Papers

No similar papers found.