🤖 AI Summary
This work addresses the challenge that AI data centers, constrained by fixed site-level power capacity, experience sharply increased cooling energy demands during high-temperature periods, which compromises computational resource availability. To overcome this limitation, the authors propose ComputeAmp, a novel framework that, for the first time, enables adaptive co-optimization of non-evaporative cooling systems, battery energy storage, and computational workloads. By dynamically coordinating cooling, energy storage, and computing in both time and space under a fixed power budget, ComputeAmp transcends traditional static provisioning strategies. This approach significantly enhances usable compute capacity under local constraints on both electricity and water resources while effectively mitigating power wastage caused by cooling load peaks.
📝 Abstract
The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for multiple hours per day. Therefore, maximizing compute capacity under a limited site-level power budget is an important planning and operational challenge. Sizing the compute system conservatively based on peak cooling power can leave part of the site-level power capacity underutilized when the cooling power is below its peak, particularly in cooler months. On the other hand, sizing the compute system aggressively based on low cooling power can cause the total site-level power demand to exceed the site-level power capacity during hot days in the summer. This paper proposes ComputeAmp (Compute Amplifier), a framework that maximizes the compute capacity by jointly and dynamically leveraging cooling, battery energy storage, and computing-based adaptation. We discuss the opportunities and limitations of ComputeAmp and illustrate its potential to significantly expand usable compute capacity within local power and water resource limits. We also present a problem formulation for ComputeAmp and highlight a few algorithmic and operational challenges.