Score: 0

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Published: December 19, 2025 | arXiv ID: 2512.18134v1

By: Rupanshu Soi , Rohan Yadav , Fredrik Kjolstad and more

Potential Business Impact:

Makes computer graphics run much faster.

Business Areas:

DSP Hardware

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Machine Learning (CS)

Makes computer graphics run much faster.

16 Oct 2025 2

88%

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Machine Learning (CS)

Makes computer graphics run much faster.

16 Oct 2025 2

87%

Hardware vs. Software Implementation of Warp-Level Features in Vortex RISC-V GPU

Hardware Architecture

Makes computer graphics run much faster.

6 May 2025 1

View PDF Login to Bookmark

Page Count

15 pages

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Makes computer graphics run much faster.

Technical Abstract

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Hardware vs. Software Implementation of Warp-Level Features in Vortex RISC-V GPU