Score: 1

Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

Published: April 10, 2025 | arXiv ID: 2504.07807v1

By: Hongcheng Guo , Juntao Yao , Boyang Wang and more

Potential Business Impact:

Makes big AI models smaller and faster.

Business Areas:

A/B Testing Data and Analytics

Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Computation and Language

Makes AI models smaller, faster, and just as smart.

9 Apr 2025 0

91%

Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning

Machine Learning (CS)

Teaches computers to learn better from different tasks.

3 Sep 2025 0

91%

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning

Computation and Language

Makes big AI models smaller without losing smarts.

19 Sep 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

12 pages

Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

Makes big AI models smaller and faster.

Technical Abstract

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning

DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning