BeLLMan: Controlling LLM Congestion
By: Tella Rajashekhar Reddy , Atharva Deshmukh , Karan Tandon and more
Potential Business Impact:
Makes AI faster and use less power.
Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.
Similar Papers
Congestion Control System Optimization with Large Language Models
Networking and Internet Architecture
AI makes internet faster by fixing traffic jams.
xLLM Technical Report
Distributed, Parallel, and Cluster Computing
Makes smart computer programs run much faster.
Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems
Distributed, Parallel, and Cluster Computing
Lets computers solve hard scheduling puzzles from words.