A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive storage and computational overhead of linear layers in large language models (LLMs) deployed on resource-constrained hardware such as FPGAs, this work proposes a co-design methodology integrating Tensor Train Decomposition (TTD) with a Group Vector Systolic Array (GVSA). TTD is innovatively applied for low-rank reconstruction of linear layers, while GVSA is a custom hardware architecture featuring DSP-sharing parallel processing units and optimized dataflow scheduling. This synergy enables high compression ratios without compromising inference efficiency. Evaluated on ChatGLM3-6B and LLaMA2-7B, the approach achieves end-to-end parameter compression ratios of 1.94× and 1.60×, respectively, and reduces first-token latency by 1.45× and 1.57×. The solution delivers a scalable, software-hardware co-optimized framework for efficient LLM deployment on edge FPGA platforms.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94$ imes$ and 1.60$ imes$, respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45$ imes$ and 1.57$ imes$ reduction in first token delay for ChatGLM3-6B and LLaMA2-7B models, respectively.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource Constraints
FPGA Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Train Decomposition
Large Language Models Compression
FPGA Acceleration
🔎 Similar Papers
No similar papers found.
S
Sixiao Huang
T
Tintin Wang
A
Ang Li
Ao Shen
Ao Shen
Purdue University
machine learning system and architecture
K
Kai Li
K
Keyao Jiang
M
Mingqiang Huang
H
Hao Yu