Statistical Modeling and Uncertainty Estimation of LLM Inference Systems

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the challenge of robust performance modeling and uncertainty quantification for large language model (LLM) inference systems under dynamic workloads, heterogeneous hardware, and multidimensional configuration spaces, this paper proposes ALA (Analytic-Learning Augmentation), a hybrid framework integrating analytical modeling and machine learning. ALA introduces a novel uncertainty estimation mechanism grounded in vector-space similarity, jointly leveraging an analytical throughput model and supervised learning, while employing simulated annealing to optimize the error-prediction submodel. Evaluated across diverse architectures, models, and batch sizes, ALA significantly enhances generalization to unseen configurations—reducing median prediction error to a state-of-the-art level. Moreover, it delivers statistically principled, interpretable predictions with controllable confidence, enabling adaptive scheduling and cost-aware LLM inference deployment.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) inference systems present significant challenges in statistical performance characterization due to dynamic workload variations, diverse hardware architectures, and complex interactions between model size, batch processing, and throughput requirements. Accurate statistical characterization enables better workload scheduling, adaptive resource provisioning, and cost-aware inference optimization, making it crucial for improving efficiency in large-scale AI deployments. Traditional analytical models provide explainability but cannot cover the vast diversity of real-world workloads, making it impossible to benchmark every scenario in advance. Machine learning (ML) approaches effectively predict performance for non-benchmarked cases but struggle when extrapolating beyond their observed training space. To address these limitations for LLM inference systems, we propose an Analytical with Learning Augmentation (ALA) framework that bridges analytical modeling with ml for robust statistical prediction and uncertainty estimation in LLM inference workloads. Our method employs an analytical throughput model with parameters estimated for benchmarked workloads, then extends to unobserved configurations using ml predictions. We enhance this with simulated annealing to exploit subsets of the workload data point combinations and develop an error predictor. Finally, we quantify uncertainty based on vector space similarity between new and observed workloads to ensure robust generalization. Through extensive experimentation on diverse LLM inference workloads, we demonstrate that our framework achieves low median errors while maintaining adaptability to new inference scenarios.

Problem

Research questions and friction points this paper is trying to address.

Characterizing LLM inference performance under dynamic workloads

Bridging analytical models and ML for robust prediction

Estimating uncertainty in LLM inference system performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines analytical modeling with machine learning augmentation

Uses simulated annealing for optimal workload data subsets

Quantifies uncertainty via vector space similarity metrics

🔎 Similar Papers

No similar papers found.