Score: 1

A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Published: November 26, 2025 | arXiv ID: 2511.20982v1

By: Junhan Liao , Minxian Xu , Wanyi Zheng and more

Potential Business Impact:

Makes AI answer questions faster and more reliably.

Business Areas:

Content Delivery Network Content and Publishing

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and more reliably.

26 Nov 2025 1

92%

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving

Distributed, Parallel, and Cluster Computing

Boosts AI chat speed by 77% for balanced delays

4 Aug 2025 2

91%

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and cheaper.

26 Nov 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

14 pages

A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Makes AI answer questions faster and more reliably.

Technical Abstract

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving