Score: 0

GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving

Published: August 22, 2025 | arXiv ID: 2508.16449v1

By: Qunyou Liu , Darong Huang , Marina Zapater and more

Potential Business Impact:

Saves energy when computers think big thoughts.

Business Areas:

Energy Efficiency Energy, Sustainability

Large Language Models (LLMs) are becoming the backbone of modern cloud services, yet their inference costs are dominated by GPU energy. Unlike traditional GPU workloads, LLM inference has two stages with different characteristics: the prefill phase, which is latency sensitive and scales quadratically with prompt length, and the decode phase, which progresses token by token with unpredictable length. Current GPU power governors (for example, NVIDIA's default) overlook this asymmetry and treat both stages uniformly. The result is mismatched voltage and frequency settings, head-of-line blocking, and excessive energy use. We introduce GreenLLM, an SLO-aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. At ingress, requests are routed into length-based queues so short prompts avoid head-of-line blocking and TTFT improves. For prefill, GreenLLM collects short traces on a GPU node, fits compact latency-power models over SM frequency, and solves a queueing-aware optimization to select energy-minimal clocks per class. During decode, a lightweight dual-loop controller tracks throughput (tokens per second) and adjusts frequency with hysteretic, fine-grained steps to hold tail TBT within target bounds. Across Alibaba and Azure trace replays, GreenLLM reduces total energy by up to 34 percent versus the default DVFS baseline, with no loss of throughput and with less than 3.5 percent additional SLO violations.

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Distributed, Parallel, and Cluster Computing

Cuts computer brain energy use for chats.

5 Sep 2025 2

91%

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Distributed, Parallel, and Cluster Computing

Saves energy when AI answers your questions.

5 Sep 2025 3

90%

Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

Operating Systems

Makes phones run AI faster and use less power.

2 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇭 Switzerland

Page Count

13 pages

GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving

Saves energy when computers think big thoughts.

Technical Abstract

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency