MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
By: Krishna Teja Chitty-Venkata , Sylvia Howland , Golara Azar and more
Potential Business Impact:
Makes AI smarter and faster by using many smart parts.
Mixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.
Similar Papers
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting
Distributed, Parallel, and Cluster Computing
Makes AI models run much faster and smoother.
Faster MoE LLM Inference for Extremely Large Models
Computation and Language
Makes AI faster by using fewer parts.
MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
CV and Pattern Recognition
Makes AI understand pictures and words better, faster.