Score: 1

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Published: November 17, 2025 | arXiv ID: 2511.13940v1

By: Stuart H. Sul , Simran Arora , Benjamin F. Spector and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Makes AI run much faster on many computers.

Business Areas:

Virtualization Hardware, Information Technology, Software

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \times$ speedup for data- and tensor-parallel workloads, $4.08 \times$ for sequence-parallel workloads, and $1.22 \times$ for expert-parallel workloads.

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Distributed, Parallel, and Cluster Computing

Helps supercomputers run faster on different parts.

4 Nov 2025 0

87%

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

Distributed, Parallel, and Cluster Computing

Makes quantum computers run much faster.

18 Nov 2025 1

87%

Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU

Distributed, Parallel, and Cluster Computing

Makes AI training much faster using more computer parts.

18 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

27 pages

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Makes AI run much faster on many computers.

Technical Abstract

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

Optimizing Allreduce Operations for Heterogeneous Architectures with Multiple Processes per GPU