Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference costs and difficulty in efficiently adapting to novel attention architectures. Method: This paper introduces MHA2MLA—the first data-efficient fine-tuning method enabling rapid architectural migration from pretrained models (e.g., Llama2-7B) to DeepSeek’s multi-head latent attention (MLA), obviating full pretraining. It innovatively integrates Partial-RoPE pruning with joint key-value SVD-based low-rank approximation, achieving architecture conversion using only 0.3%–0.6% of the original training data while natively supporting KV quantization. Contribution/Results: Experiments demonstrate a 92.19% reduction in KV cache size and only a 0.5% degradation in LongBench performance, significantly improving inference efficiency and deployment cost-effectiveness.

Technology Category

Application Category

📝 Abstract
Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.
Problem

Research questions and friction points this paper is trying to address.

Efficient inference in LLMs
Transition from MHA to MLA
Reduce KV cache size significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-head Latent Attention compression
Data-efficient fine-tuning method
Joint SVD approximations for keys
🔎 Similar Papers
No similar papers found.
Tao Ji
Tao Ji
中国人民大学
B
Bin Guo
East China Normal University
Y
Yuanbin Wu
East China Normal University
Qipeng Guo
Qipeng Guo
Fudan University
L
Lixing Shen
Hikvision Inc
Zhan Chen
Zhan Chen
Georgia Southern University
Mathematical modeling in biology and scientific computing
X
Xipeng Qiu
Fudan University
Q
Qi Zhang
Fudan University
T
Tao Gui
Fudan University