Score: 0

Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Published: December 21, 2025 | arXiv ID: 2512.18674v1

By: Wentao Liu , Yuhao Hu , Ruiting Zhou and more

Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

Distributed, Parallel, and Cluster Computing

Saves money running smart computer programs.

9 Jan 2025 0

91%

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

Distributed, Parallel, and Cluster Computing

Makes smart computer programs run faster on your computer.

18 Dec 2025 2

91%

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Distributed, Parallel, and Cluster Computing

Lets small computers run big AI models.

3 Dec 2025 2

View PDF Login to Bookmark

Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Technical Abstract

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference