Score: 3

ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Published: August 26, 2025 | arXiv ID: 2508.18850v1

By: Xinhao Luo , Zihan Liu , Yangjie Zhou and more

Potential Business Impact:

Makes AI models answer questions much faster.

Business Areas:

Collaboration Collaboration

Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection

Distributed, Parallel, and Cluster Computing

Makes computer brains learn much faster.

15 Dec 2025 2

88%

ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Computation and Language

Helps computers group words by meaning better.

4 Dec 2025 1

87%

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Distributed, Parallel, and Cluster Computing

Makes giant AI models run much faster.

12 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇬 🇨🇳 China, Singapore

Repos / Data Links

github.com

Page Count

20 pages

ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

Makes AI models answer questions much faster.

Technical Abstract

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection

ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication