Score: 0

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Published: November 5, 2025 | arXiv ID: 2511.11617v1

By: Wendong Xu , Chujie Chen , He Xiao and more

Potential Business Impact:

Keeps AI running smoothly even if a part breaks.

Business Areas:

Table Tennis Sports

Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes under a byte-cost dominance assumption, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service interfaces. In typical failure scenarios, AnchorTP reduces Time to First Success (TFS) by up to 11x and Time to Peak (TTP) by up to 59% versus restart-and-reload.

FailSafe: High-performance Resilient Serving

Distributed, Parallel, and Cluster Computing

Keeps AI running smoothly even if parts break.

18 Nov 2025 2

89%

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

Distributed, Parallel, and Cluster Computing

Fixes AI training when computer parts break.

8 Apr 2025 0

88%

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Machine Learning (CS)

Makes AI answers the same every time.

21 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

8 pages

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Keeps AI running smoothly even if a part breaks.

Technical Abstract

FailSafe: High-performance Resilient Serving

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch