🤖 AI Summary
Existing AI inference scaling approaches rely on one-dimensional (fixed-step) or two-dimensional (e.g., performance–computation trade-off) strategies, failing to jointly optimize accuracy, cost, and latency while neglecting real-world deployment constraints.
Method: This paper formulates inference scaling as a three-dimensional multi-objective optimization problem for the first time and proposes a constraint-aware joint calibration framework that enables environment-adaptive inference policy selection within a unified decision space. We integrate Monte Carlo simulation with four representative multi-objective optimization algorithms and conduct systematic evaluation across nine large language models and three canonical deployment scenarios.
Contribution/Results: Knee-point optimization consistently achieves superior performance—both in balanced and accuracy-critical regimes—significantly enhancing deployment flexibility and practicality. Our work establishes both theoretical foundations and empirical methodologies for deployment-aware AI inference optimization.
📝 Abstract
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.