Efficient Function-as-a-Service for Large Language Models with TIDAL
By: Weihao Cui , Ziyi Xu , Han Zhao and more
Potential Business Impact:
Makes AI start up much faster.
Large Language Model (LLM) applications have emerged as a prominent use case for Function-as-a-Service (FaaS) due to their high computational demands and sporadic invocation patterns. However, serving LLM functions within FaaS frameworks faces significant GPU-side cold start. A fundamental approach involves leveraging a template with function state saved on GPUs to bypass the cold start for new invocations. Yet, this approach struggles with the high GPU footprint, dynamic initialization behaviors, and lazy GPU kernel loading inherent in LLM functions, primarily due to a lack of insight into the underlying execution details. In this paper, we introduce TIDAL, an optimized FaaS framework for LLM applications that achieves fast startups by tracing fine-grained execution paths. By utilizing the traced execution details, TIDAL generates adaptive function templates, effectively breaking startup barriers for LLM functions. Extensive evaluations demonstrate that TIDAL reduces cold start latency by $1.79\times\text{\textasciitilde}2.11\times$ and improves the $95\%$-ile time-to-first-token by $76.0\%$, surpassing state-of-the-art methods.
Similar Papers
Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
Distributed, Parallel, and Cluster Computing
Makes AI models load much faster for users.
Code once, Run Green: Automated Green Code Translation in Serverless Computing
Distributed, Parallel, and Cluster Computing
Makes computer code use less power automatically.
Transformer-Based Model for Cold Start Mitigation in FaaS Architecture
Distributed, Parallel, and Cluster Computing
Fixes slow computer programs starting up.