Scaling Up Throughput-oriented LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
By: Thanh Son Phung, Douglas Thain
Potential Business Impact:
Lets computers finish big jobs faster, cheaper.
The widespread growth in LLM developments increasingly demands more computational power from clusters than what they can supply. Traditional LLM applications inherently require huge static resource allocations, which force users to either wait in a long job queue and accept progress delay, or buy expensive hardware to fulfill their needs and exacerbate the demand-supply problem. However, not all LLM applications are latency-sensitive and can instead be executed in a throughput-oriented way. This throughput orientation allows a dynamic allocation that opportunistically pools available resources over time, avoiding both the long queue and expensive GPU purchases. Effectively utilizing opportunistic resources brings numerous challenges nevertheless. Our solution, pervasive context management, exploits the common computational context in LLM applications and provides mechanisms and policies that allow seamless context reuse on opportunistic resources. Our evaluation shows an LLM application with pervasive context management on opportunistic resources reduces its execution time by 98.1%.
Similar Papers
Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management
Distributed, Parallel, and Cluster Computing
Makes AI discover science faster, saving time and money.
High-Throughput LLM inference on Heterogeneous Clusters
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster on different computers.
Predictable LLM Serving on GPU Clusters
Distributed, Parallel, and Cluster Computing
Makes computer programs run faster on shared machines.