On Evaluating Performance of LLM Inference Serving Systems

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This paper identifies pervasive anti-patterns in evaluating LLM inference service systems—including unfair baselines, distorted experimental setups, and narrow metric design—that severely undermine scientific rigor and practical utility. To address this, the authors systematically characterize evaluation anti-patterns along three dimensions: (1) the prefill/decoding dual-phase execution behavior, (2) heterogeneous workload modeling, and (3) real-time constraints. They propose an actionable evaluation checklist and design an empirical framework featuring production-grade workloads, controlled baseline comparisons, and fine-grained temporal metrics. Validated on representative scenarios such as speculative decoding, the framework significantly improves reproducibility, comparability, and operational fidelity. It establishes the first standardized, deployable scientific evaluation paradigm for LLM inference systems—bridging the gap between academic benchmarks and industrial practice.

Technology Category

Application Category

📝 Abstract

The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns -- such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.

Problem

Research questions and friction points this paper is trying to address.

Identifies flaws in LLM inference evaluation methodologies

Highlights anti-patterns in Baseline Fairness, Setup, Metrics

Proposes framework to ensure robust, reproducible performance evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies evaluation anti-patterns in LLM systems

Proposes a checklist for robust LLM evaluation

Demonstrates framework via speculative decoding case

🔎 Similar Papers

No similar papers found.