Score: 1

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Published: November 4, 2025 | arXiv ID: 2511.02237v1

By: Costin-Andrei Oncescu , Qingyang Wu , Wai Tong Chung and more

Potential Business Impact:

Makes AI faster by sharing computer brain parts.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer decode latency, respectively.

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

Distributed, Parallel, and Cluster Computing

Makes AI models run faster by sharing work better.

10 Dec 2025 0

91%

MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices

Machine Learning (CS)

Makes smart computer programs run faster on phones.

23 Aug 2025 2

91%

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Machine Learning (CS)

Makes AI smarter and faster without retraining.

8 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

18 pages

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Makes AI faster by sharing computer brain parts.

Technical Abstract

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices

Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs