Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving
By: Ziming Liu , Boyu Tian , Guoteng Wang and more
Potential Business Impact:
Makes smart computer programs run faster and more reliably.
Mixture-of-Experts (MoE) models challenge serving infrastructures with dynamic, sparse expert utilization, causing instability on conventional systems designed for dense architectures. We propose EaaS, a novel serving system to enable efficient, scalable, and robust MoE deployment. Our system disaggregates MoE modules into independent, stateless services. This design enables fine-grained resource scaling and provides inherent fault tolerance by decoupling compute units. The architecture is powered by a high-performance, CPU-free peer-to-peer communication library that ensures minimal overhead and high throughput. Experiments confirm EaaS's scalability and efficiency, achieving performance comparable to monolithic systems while providing robust fault tolerance and strong scalability. EaaS incurs less than a 2% throughput reduction under simulated hardware failures that would otherwise halt monolithic architectures. It further saves up to 37.5% of computing resources through dynamic fine-grained adaptation to serving traffic, demonstrating strong resilience for large-scale MoE deployment in production.
Similar Papers
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
Distributed, Parallel, and Cluster Computing
Makes AI models run much faster and cheaper.
ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models
Distributed, Parallel, and Cluster Computing
Lets big AI models grow and shrink instantly.
Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
Distributed, Parallel, and Cluster Computing
Saves money running smart computer programs.