Score: 0

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

Published: October 14, 2025 | arXiv ID: 2510.12357v1

By: Yushu Zhao , Yubin Qin , Yang Wang and more

Potential Business Impact:

Makes smart computer programs run much faster.

Business Areas:

Crowdsourcing Collaboration

Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with \textit{mixture of big-little experts}. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Machine Learning (CS)

Lets AI learn more without needing more computer memory.

13 Nov 2025 1

92%

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Machine Learning (CS)

Makes AI models run faster and cheaper.

10 Sep 2025 0

91%

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

Distributed, Parallel, and Cluster Computing

Makes AI models run much faster on computers.

29 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

7 pages

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

Makes smart computer programs run much faster.

Technical Abstract

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding