VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
By: Jiahuan Yu , Aryan Taneja , Junfeng Lin and more
Potential Business Impact:
Cuts computer brain energy use for chats.
Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving.
Similar Papers
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
Distributed, Parallel, and Cluster Computing
Saves energy when AI answers your questions.
GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
Performance
Saves energy when computers think big thoughts.
Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency
Operating Systems
Makes phones run AI faster and use less power.