Score: 2

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Published: December 1, 2025 | arXiv ID: 2512.01357v1

By: Wenbin Zhu , Zhaoyan Shen , Zili Shao and more

Potential Business Impact:

Makes AI models load much faster for users.

Business Areas:

Cloud Computing Internet Services, Software

Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.

10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training

Distributed, Parallel, and Cluster Computing

Makes AI learn much faster and cheaper.

18 Nov 2025 2

87%

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Distributed, Parallel, and Cluster Computing

Keeps AI running smoothly even if a part breaks.

5 Nov 2025 0

87%

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI models run cheaper and faster.

6 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Page Count

13 pages

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Makes AI models load much faster for users.

Technical Abstract

10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving