Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational complexity and excessive GPU memory consumption of conventional attention mechanisms in long-sequence modeling, this paper proposes an efficient inference optimization framework tailored for DCU hardware. Methodologically, it introduces (1) Opt-GQA—a novel attention mechanism integrating grouped-query attention (GQA) with ALiBi positional bias to reduce computational overhead; (2) DCU-specific GPU kernels and a paged memory management strategy to enhance hardware parallelism and memory utilization; and (3) gradient-calibrated GPTQ post-training quantization to jointly optimize accuracy and efficiency. Integrated into the vLLM system, the framework achieves a 32% improvement in long-sequence inference throughput and a 41% reduction in peak GPU memory usage, significantly strengthening the deployment capability of large language models for long-context scenarios.

Technology Category

Application Category

📝 Abstract
In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt?GPTQ significantly reduces computation time and memory usage while improving model performance.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational complexity in deep learning attention mechanisms
Minimizes memory fragmentation and enhances memory utilization
Optimizes GPU kernels for faster attention computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines GQA with paging memory management
Optimizes GPU kernels for reduced latency
Integrates ALiBi for long-sequence efficiency
🔎 Similar Papers
No similar papers found.
J
Jie Kong
School of Computer Science and Engineering, Shandong University of Science and Technology
J
Junxiang Zhang
School of Computer Science and Engineering, Shandong University of Science and Technology
J
Jiheng Xu
School of Computer Science and Engineering, Shandong University of Science and Technology
Y
Yalong Li
School of Computer Science and Engineering, Shandong University of Science and Technology
S
Shouhua Zhang
Faculty of Information Technology and Electrical Engineering, University of Oulu
Jiehan Zhou
Jiehan Zhou
Shandong University of Science and Technology
Industrial Large ModelsDigital TwinsIndustry 5.0Internet of ThingsCloud Computing
Y
Yuhai Liu
Dawning Information Industry Co., Ltd
Peng Liang
Peng Liang
School of Computer Science, Wuhan University
Software EngineeringSoftware ArchitectureEmpirical Software Engineering
Q
Quan Zhang
school of Computer Science and Engineering, Southwest Petroleum University
L
Luohan Jiang
School of Computer Science and Engineering, Shandong University of Science and Technology