Coordinated Cooling and Compute Management for AI Datacenters
By: Nardos Belay Abera, Yize Chen
The AI datacenters are currently being deployed on a large scale to support the training and deployment of power-intensive large-language models (LLMs). Extensive amount of computation and cooling required in datacenters increase concerns about the energy use and carbon emissions of AI datacenters. Although current state-of-the-art has examined the energy efficiency of LLM inference, most prior research focused on optimizing compute-side scheduling without considering thermal objectives or constraints. Since GPU-intensive inference generates substantial heat that can degrade datacenter performance, ignoring thermal effects can increase total energy consumption and reduce the efficiency of LLM serving. To fill this gap, we profile the characteristics of GPU servers under varying cooling and AI jobs, and develop a joint cooling and computing modeling approach for AI datacenters. Built upon such workload and thermal dynamics models, a novel hierarchical control framework is proposed to co-optimize computing and thermal management by identifying the optimal GPU parallelism, frequency (DVFS), and cooling control knobs. Using real Azure inference traces and detailed GPU profiling, our model balances serving latency and thermal constraints in AI datacenters while significantly improving AI datacenters' energy efficiency.
Similar Papers
Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
Artificial Intelligence
Cuts AI computer costs by 40%.
Machine Learning Guided Cooling Optimization for Data Centers
Systems and Control
Cuts computer room energy waste by 96%.
Cooling Matters: Benchmarking Large Language Models and Vision-Language Models on Liquid-Cooled Versus Air-Cooled H100 GPU Systems
Distributed, Parallel, and Cluster Computing
Liquid cooling makes AI computers run faster and cooler.