CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the low resource utilization in current cloud scheduling caused by conservative over-provisioning and the lack of systematic evaluation of time-series forecasting models’ downstream decision efficacy. We propose CloudCons, the first end-to-end evaluation benchmark for cloud resource consolidation, integrating statistical models, deep learning, and foundation time-series models to jointly assess prediction accuracy and optimization performance on real-world workloads from Huawei Cloud, Azure, and Google Borg. Our analysis reveals a significant inconsistency between forecasting accuracy and scheduling utility: despite their strong zero-shot prediction capabilities, foundation models do not necessarily improve decision outcomes. To bridge this gap, we introduce a quantile-calibrated trade-off mechanism between efficiency and reliability, accompanied by actionable selection guidelines that substantially enhance the co-optimization of resource efficiency and service reliability.

📝 Abstract

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

Problem

Research questions and friction points this paper is trying to address.

cloud resource consolidation

forecasting models

decision utility

benchmark

time series foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

cloud resource consolidation

forecast-then-optimize

time series foundation models