Score: 1

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

Published: November 9, 2025 | arXiv ID: 2511.06247v1

By: Cong Li , Yuzhe Yang , Xuegui Zheng and more

Potential Business Impact:

Makes AI answer questions much faster.

Business Areas:

Semantic Search Internet Services

With the advancement of large language models (LLMs), their context windows have rapidly expanded. To meet diverse demands from varying-length requests in online services, existing state-of-the-art systems tune the sequence parallelism (SP) allocation. However, current dynamic SP allocation lacks flexibility to (1) support stage-specific parallelism requirements in LLM inference, (2) mitigate the global latency degradation from excessive SP allocation, and (3) exploit resource fragments arising from SP size variation. To tackle this problem, we propose Chunkwise Dynamic Sequence Parallelism (CDSP), a fine-grained parallelism strategy that assigns SP sizes across \textit{intra-request} token segments. Based on CDSP, we build Tetris, an LLM serving system that (1) efficiently integrates CDSP into disaggregated cluster to satisfy parallelism heterogeneity, (2) dynamically regulates SP size expansion based on real-time load conditions, and (3) adaptively explores chunking plans to utilize fragmented resources while meeting per-request demands. Compared with state-of-the-art systems, Tetris achieves up to 4.35$\times$ lower time-to-first-token (TTFT) under max sustainable loads, reduces median time-between-tokens (TBT) by up to 40.1\%, and increases the max request capacity by up to 45\%.

Efficient Long Context Fine-tuning with Chunk Flow

Distributed, Parallel, and Cluster Computing

Makes AI learn faster with longer texts.

4 Mar 2025 0

88%

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Distributed, Parallel, and Cluster Computing

Makes AI understand long texts faster and cheaper.

25 Sep 2025 1

87%

Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers

Information Retrieval

Lets recommendation systems remember more user history.

23 Jul 2025 1

View PDF Login to Bookmark

Page Count

14 pages

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

Makes AI answer questions much faster.

Technical Abstract

Efficient Long Context Fine-tuning with Chunk Flow

Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training

Scaling Generative Recommendations with Context Parallelism on Hierarchical Sequential Transducers