Score: 0

DuoServe-MoE: Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference

Published: September 9, 2025 | arXiv ID: 2509.07379v1

By: Yuning Zhang , Grant Pinkert , Nan Yang and more

Potential Business Impact:

Makes AI faster and use less computer memory.

Business Areas:

A/B Testing Data and Analytics

Large Language Models (LLMs) have demonstrated impressive performance across a wide range of deep learning tasks. Mixture of Experts (MoE) further enhances their capabilities by increasing model width through sparsely activated expert branches, which keeps inference computation efficient. However, the large number of expert weights introduces significant GPU memory pressure, especially in resource-constrained environments such as single-GPU servers. More importantly, MoE inference consists of two fundamentally different stages: a prefill stage where most experts are activated densely, and a decode stage where only a few experts are triggered sparsely. Treating these stages with a uniform scheduling strategy often leads to suboptimal latency and memory usage. To address this, we propose DuoServe-MoE, an inference serving system that explicitly separates prefill and decode stages and applies tailored expert scheduling strategies to each. In the prefill stage, DuoServe-MoE uses a two-stream CUDA pipeline that overlaps expert weight prefetching with the computation of non-MoE layers, limiting expert residency in GPU memory. In the decode stage, a lightweight layer-level predictor trained offline from activation traces is used to prefetch only the most likely activated experts, without requiring any changes to the model. Experiments on 4-bit Mixtral-8x7B and 8x22B models show that DuoServe-MoE improves end-to-end latency by 1.42 to 7.54 times while keeping peak memory usage at only 15 percent of the full model size.

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Distributed, Parallel, and Cluster Computing

Lets small computers run big AI models.

3 Dec 2025 2

91%

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Machine Learning (CS)

Lets AI learn more without needing more computer memory.

13 Nov 2025 1

91%

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster.

11 Oct 2025 1

View PDF Login to Bookmark

Page Count

28 pages

DuoServe-MoE: Dual-Phase Expert Prefetch and Cache Scheduling for Efficient MoE LLM Inference

Makes AI faster and use less computer memory.

Technical Abstract

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference