Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing LLM inference performance prediction methods rely either on hardware-specific benchmarks or black-box ML models, suffering from poor generalizability. This paper proposes LIFE—a lightweight, hardware- and benchmark-agnostic analytical framework for fine-grained latency and throughput prediction across heterogeneous platforms (e.g., CPU, NPU, iGPU, GPU) and optimization techniques (e.g., quantization, KV cache compression, LoRA). LIFE employs modular, operator-level modeling that integrates hardware-specific TOPS and memory bandwidth parameters to construct interpretable predictive models for three key metrics: first-token latency, per-token latency, and tokens-per-second (TPS). Evaluated on Llama2-7B variants across AMD Ryzen CPUs, NPUs, iGPUs, and NVIDIA V100 GPUs, LIFE achieves an average prediction error of <8%. It is the first framework enabling accurate, portable, and interpretable inference performance estimation without requiring platform-specific profiling or training data—significantly accelerating cross-platform LLM deployment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as time-to-first-token (TTFT), time-per-output-token (TPOT) and tokens-per-second (TPS). LIFE enables performance forecasting using only hardware specifications, such as TOPS and memory bandwidth, without requiring extensive dataset benchmarking. We validate LIFE's forecasting with inference on AMD Ryzen CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants, demonstrating the utility of LIFE in forecasting LLM performance through lens of system efficiency to enable efficient LLM deployment across different hardware platforms.

Problem

Research questions and friction points this paper is trying to address.

Forecasting LLM inference performance on heterogeneous hardware

Developing hardware-agnostic analytical model for LLM workloads

Evaluating software optimizations impact on inference metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-agnostic analytical modeling for LLM inference

Modular framework configurable for diverse optimizations

Performance forecasting using only hardware specifications

🔎 Similar Papers

No similar papers found.