Score: 0

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Published: August 29, 2025 | arXiv ID: 2508.21706v1

By: Zhibin Wang , Zhonghui Zhang , Yuhang Zhou and more

Potential Business Impact:

Makes AI models run much faster on computers.

Business Areas:

Quantum Computing Science and Engineering

Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically tune the hyperparameters of speculative decoding under given hardware and workload. Experimental results show that SpecMoEOff achieves up to 2.5x decode throughput improvement over the state-of-the-art MoE offloading techniques.

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Machine Learning (CS)

Makes smart AI run faster on less powerful computers.

18 Nov 2025 1

94%

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster.

11 Oct 2025 1

92%

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Distributed, Parallel, and Cluster Computing

Lets small computers run big AI models.

3 Dec 2025 2

View PDF Login to Bookmark

Page Count

13 pages

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Makes AI models run much faster on computers.

Technical Abstract

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference