Score: 0

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

Published: September 29, 2025 | arXiv ID: 2510.03293v1

By: Rana Shahout , Colin Cai , Yilun Du and more

Potential Business Impact:

Makes AI faster and cheaper by sharing work.

Business Areas:

Laser Hardware, Science and Engineering

Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

Load Balancing Mixture of Experts with Similarity Preserving Routers

Machine Learning (CS)

Makes AI learn faster and smarter.

16 Jun 2025 2

91%

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

Distributed, Parallel, and Cluster Computing

Makes smart devices learn faster with less power.

7 Dec 2025 0

90%

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Machine Learning (CS)

Makes AI faster by sharing computer brain parts.

4 Nov 2025 1

View PDF Login to Bookmark

Page Count

18 pages

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

Makes AI faster and cheaper by sharing work.

Technical Abstract

Load Balancing Mixture of Experts with Similarity Preserving Routers

Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining