Score: 2

FailSafe: High-performance Resilient Serving

Published: November 18, 2025 | arXiv ID: 2511.14116v1

By: Ziyi Xu , Zhiqiang Xie , Swapnil Gandhi and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Keeps AI running smoothly even if parts break.

Business Areas:

Cloud Computing Internet Services, Software

Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present FailSafe, a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. FailSafe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for uniform memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. We implement these techniques in a lightweight serving engine compatible with existing LLM infrastructures. Evaluated on an 8xH100 DGX system with real-world fault traces and representative workloads, FailSafe achieves up to 2x higher throughput and two orders of magnitude lower recovery latency compared to standard fault handling approaches. Even with up to three GPU failures, FailSafe sustains high throughput and balanced utilization, demonstrating robust and efficient LLM serving under dynamic and unreliable hardware conditions.

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Distributed, Parallel, and Cluster Computing

Keeps AI running smoothly even if a part breaks.

5 Nov 2025 0

89%

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

Distributed, Parallel, and Cluster Computing

Fixes AI training when computer parts break.

8 Apr 2025 0

87%

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and more often.

20 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇺🇸 United States, China

Page Count

13 pages

FailSafe: High-performance Resilient Serving

Keeps AI running smoothly even if parts break.

Technical Abstract

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads