Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pruning methods for large language models lack a unified, hardware-aware evaluation framework, making it difficult to accurately assess their true acceleration potential. This work proposes a GEMM-centric taxonomy of pruning strategies and establishes a standardized benchmark that encompasses static and dynamic approaches across depth and width dimensions. By integrating Pareto frontier analysis, the framework systematically evaluates the trade-offs between inference speedup and model quality. Experimental results reveal that static depth pruning achieves performance closest to the theoretical acceleration ceiling under memory-constrained conditions. Furthermore, as tolerance for quality degradation increases, the optimal strategy during the prefill phase transitions sequentially from static depth to dynamic depth and then to static width, thereby delineating—for the first time—the practical performance boundaries and applicable scenarios of mainstream pruning techniques.
📝 Abstract
Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}
Problem

Research questions and friction points this paper is trying to address.

LLM pruning
inference acceleration
GEMM
benchmarking
Pareto frontier
Innovation

Methods, ideas, or system contributions that make the work stand out.

GEMM-centric taxonomy
LLM pruning
inference acceleration
Pareto frontier
benchmarking framework
🔎 Similar Papers
H
Haozhe Hu
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
H
Hao Wu
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
A
Anhao Zhao
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Department of Computing, The Hong Kong Polytechnic University
L
Longwei Ding
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
P
Peiran Yin
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Yunpu Ma
Yunpu Ma
Ludwig Maximilian University of Munich
Foundation ModelsAgentic AITemporal Knowledge GraphQuantum AI
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning