Score: 2

Iris: First-Class Multi-GPU Programming Experience in Triton

Published: November 16, 2025 | arXiv ID: 2511.12500v1

By: Muhammad Awad, Muhammad Osama, Brandon Potter

BigTech Affiliations: AMD

Potential Business Impact:

Makes computers share work faster and easier.

Business Areas:

GPU Hardware

Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton's programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns--from bulk-synchronous to fine-grained workgroup specialization--that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Distributed, Parallel, and Cluster Computing

Makes AI run faster on many computers.

28 Apr 2025 2

86%

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Distributed, Parallel, and Cluster Computing

Makes AI models train much faster on computers.

4 Nov 2025 2

85%

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Databases

Makes computers analyze big data 3x faster.

2 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

15 pages

Iris: First-Class Multi-GPU Programming Experience in Triton

Makes computers share work faster and easier.

Technical Abstract

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage