3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing AI inference scaling approaches rely on one-dimensional (fixed-step) or two-dimensional (e.g., performance–computation trade-off) strategies, failing to jointly optimize accuracy, cost, and latency while neglecting real-world deployment constraints. Method: This paper formulates inference scaling as a three-dimensional multi-objective optimization problem for the first time and proposes a constraint-aware joint calibration framework that enables environment-adaptive inference policy selection within a unified decision space. We integrate Monte Carlo simulation with four representative multi-objective optimization algorithms and conduct systematic evaluation across nine large language models and three canonical deployment scenarios. Contribution/Results: Knee-point optimization consistently achieves superior performance—both in balanced and accuracy-critical regimes—significantly enhancing deployment flexibility and practicality. Our work establishes both theoretical foundations and empirical methodologies for deployment-aware AI inference optimization.

Technology Category

Application Category

📝 Abstract

AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.

Problem

Research questions and friction points this paper is trying to address.

Optimizing AI inference across accuracy, cost, latency constraints

Solving 3D multi-objective optimization beyond 1D/2D trade-off limitations

Enabling deployment-aware inference scaling for diverse operational contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D optimization framework jointly calibrates accuracy cost latency

Monte Carlo simulations evaluate multi-objective optimization methods

Knee-point optimization achieves best balance across three objectives

🔎 Similar Papers

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference