Score: 0

FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

Published: December 26, 2025 | arXiv ID: 2512.22036v1

By: Zhuoran Zhu , Chunyang Zhu , Hao Lin and more

Large-scale Mixture-of-Experts (MoE) models rely on \emph{expert parallelism} for efficient training and inference, which splits experts across devices and necessitates distributed data shuffling to route each token to its assigned experts. However, existing communication libraries handle this shuffling poorly; its overhead can account for over half of end-to-end runtime. We present FUSCO, an MoE-friendly communication library that achieves efficient and lightweight data shuffling through fused data transformation and communication, based on the key observation that MoE's expert-major data layout conflicts with the device-major layout expected by communication operations. FUSCO captures the fine-grained data layout, which is then interpreted by a pipelined communication engine that performs the required shuffling efficiently along the communication path. Lightweight planning and load-balancing mechanisms complement the engine by eliminating redundant communication and dispersing traffic. Evaluations on representative benchmarks illustrate that FUSCO achieves up to 3.84$\times$ and 2.01$\times$ speedups over NCCL and DeepEP (the state-of-the-art MoE communication library), respectively. In end-to-end MoE tasks, compared to NCCL and DeepEP, FUSCO reduces the training latency by 1.17-1.39$\times$ and 1.10-1.19$\times$, and lowers the first-token generation latency in inference by 1.09-1.25$\times$ and 1.06-1.16$\times$.

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

Distributed, Parallel, and Cluster Computing

Makes AI learn much faster by better sharing work.

11 Dec 2025 1

86%

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

Distributed, Parallel, and Cluster Computing

Makes smart computer brains learn faster on weak computers.

26 Aug 2025 0

85%

Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Distributed, Parallel, and Cluster Computing

Makes AI models train much faster and cheaper.

4 Feb 2025 3

View PDF Login to Bookmark

FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

Technical Abstract

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism