Score: 1

ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Published: December 11, 2025 | arXiv ID: 2512.10576v1

By: Xinhang Chen , Chao Zhang , Jiahuan He and more

BigTech Affiliations: Baidu

Potential Business Impact:

Makes AI understand long stories faster and cheaper.

Business Areas:

Semantic Search Internet Services

DeepSeek-V3.2-Exp introduces a sparse attention mechanism that significantly reduces inference latency in long-context scenarios. Although the overall throughput has improved greatly, the Decode-stage of PD disaggregation remains to be a major bottleneck. This bottleneck primarily stems from the conflict between linear growth of Latent-Cache with sequence length and the limited GPU memory capacity, which constrains the feasible batch-size and thereby suppresses Decode-stage throughput. To address this challenge, we propose ESS (Extended Sparse Server), an offload-centric system design tailored for DeepSeek-V3.2-Exp. ESS selectively offloads Latent-Cache to CPU memory while preserving latency-critical components on GPU. By freeing up GPU memory, ESS effectively decoupling batch-size scaling from GPU memory constraints. This design significantly improves Decode-stage throughput, thereby reducing deployment costs in real-world settings. Our high-fidelity simulations show that ESS delivers 69.4\% throughput improvement at 32K context length and up to 123\% throughput improvement at 128K, demonstrating its effectiveness for large-context inference workloads. These results highlight ESS as a practical and scalable solution for long-context LLM serving.

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Distributed, Parallel, and Cluster Computing

Makes AI models run much faster on computers.

29 Aug 2025 0

86%

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Machine Learning (CS)

Makes smart AI run faster on less powerful computers.

18 Nov 2025 1

85%

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and cheaper.

26 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

9 pages

ESS: An Offload-Centric Latent-Cache Management Architecture for DeepSeek-V3.2-Exp

Makes AI understand long stories faster and cheaper.

Technical Abstract

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving