Score: 1

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Published: November 26, 2025 | arXiv ID: 2511.21862v1

By: Siyu Wu , Zihan Tang , Yuting Zeng and more

BigTech Affiliations: JD.com

Potential Business Impact:

Makes AI answer questions faster and cheaper.

Business Areas:

Cloud Computing Internet Services, Software

Large Language Models (LLMs) are increasingly deployed in both latency-sensitive online services and cost-sensitive offline workloads. Co-locating these workloads on shared serving instances can improve resource utilization, but directly applying this approach to Prefill/Decode (P/D) disaggregated systems introduces severe load imbalance, as fluctuating request mixes alter the intrinsic P/D ratio. Existing dynamic adjustment techniques cannot keep up with the bursty traffic patterns of online services. We propose a latency-constraint disaggregated architecture, which separates cluster resources into latency-strict and latency-relaxed pools based on task latency requirements. This design enables flexible placement of offline decode tasks, mitigating P/D imbalance while preserving online performance. To fully exploit this flexibility, we propose (1) a bottleneck-based scheduler guided by a Roofline-based performance model for performance bottleneck based scheduling, and (2) a fast preemption mechanism that strictly enforces Service Level Objectives (SLOs) for online requests. Experiments on real-world traces show that compared to existing offline system approaches, our method improves offline throughput by up to 3x, while maintaining online request SLOs.

A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and more reliably.

26 Nov 2025 1

91%

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and more reliably.

26 Nov 2025 1

90%

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Distributed, Parallel, and Cluster Computing

Makes AI models run faster and cheaper.

27 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Makes AI answer questions faster and cheaper.

Technical Abstract

A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference