Score: 1

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

Published: December 4, 2025 | arXiv ID: 2512.04476v1

By: Zehao Fan , Zhenyu Liu , Yunzhen Liu and more

Potential Business Impact:

Makes AI models run faster and smarter.

Business Areas:

GPU Hardware

Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Machine Learning (CS)

Lets AI learn more without needing more computer memory.

13 Nov 2025 1

91%

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Distributed, Parallel, and Cluster Computing

Lets small computers run big AI models.

3 Dec 2025 2

91%

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Machine Learning (CS)

Makes smart AI run faster on less powerful computers.

18 Nov 2025 1

View PDF Login to Bookmark

Page Count

8 pages

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

Makes AI models run faster and smarter.

Technical Abstract

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts