🤖 AI Summary
To address the challenge of robust performance modeling and uncertainty quantification for large language model (LLM) inference systems under dynamic workloads, heterogeneous hardware, and multidimensional configuration spaces, this paper proposes ALA (Analytic-Learning Augmentation), a hybrid framework integrating analytical modeling and machine learning. ALA introduces a novel uncertainty estimation mechanism grounded in vector-space similarity, jointly leveraging an analytical throughput model and supervised learning, while employing simulated annealing to optimize the error-prediction submodel. Evaluated across diverse architectures, models, and batch sizes, ALA significantly enhances generalization to unseen configurations—reducing median prediction error to a state-of-the-art level. Moreover, it delivers statistically principled, interpretable predictions with controllable confidence, enabling adaptive scheduling and cost-aware LLM inference deployment.
📝 Abstract
Large Language Model (LLM) inference systems present significant challenges in statistical performance characterization due to dynamic workload variations, diverse hardware architectures, and complex interactions between model size, batch processing, and throughput requirements. Accurate statistical characterization enables better workload scheduling, adaptive resource provisioning, and cost-aware inference optimization, making it crucial for improving efficiency in large-scale AI deployments. Traditional analytical models provide explainability but cannot cover the vast diversity of real-world workloads, making it impossible to benchmark every scenario in advance. Machine learning (ML) approaches effectively predict performance for non-benchmarked cases but struggle when extrapolating beyond their observed training space. To address these limitations for LLM inference systems, we propose an Analytical with Learning Augmentation (ALA) framework that bridges analytical modeling with ml for robust statistical prediction and uncertainty estimation in LLM inference workloads. Our method employs an analytical throughput model with parameters estimated for benchmarked workloads, then extends to unobserved configurations using ml predictions. We enhance this with simulated annealing to exploit subsets of the workload data point combinations and develop an error predictor. Finally, we quantify uncertainty based on vector space similarity between new and observed workloads to ensure robust generalization. Through extensive experimentation on diverse LLM inference workloads, we demonstrate that our framework achieves low median errors while maintaining adaptability to new inference scenarios.