Score: 0

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

Published: December 18, 2025 | arXiv ID: 2512.16473v1

By: En-Ming Huang, Li-Shang Lin, Chun-Yi Lee

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency, limiting inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems. The implementation of our framework is available at https://github.com/elsa-lab/MoE-CPU-GPU-Collaborative-Inference.

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

Distributed, Parallel, and Cluster Computing

Lets small computers run big AI models.

3 Dec 2025 2

92%

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

Computation and Language

Makes smart computer programs run much faster.

14 Oct 2025 0

92%

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Machine Learning (CS)

Lets AI learn more without needing more computer memory.

13 Nov 2025 1

View PDF Login to Bookmark

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

Technical Abstract

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference