Score: 0

BeLLMan: Controlling LLM Congestion

Published: October 17, 2025 | arXiv ID: 2510.15330v1

By: Tella Rajashekhar Reddy , Atharva Deshmukh , Karan Tandon and more

Potential Business Impact:

Makes AI faster and use less power.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.