Cyclotron: Compilation of Recurrences to Distributed and Systolic Architectures
By: Shiv Sundram , Akhilesh Balasingam , Nathan Zhang and more
Potential Business Impact:
Makes computers process data faster on many chips.
We present Cyclotron, a framework and compiler for using recurrence equations to express streaming dataflow algorithms, which then get portably compiled to distributed topologies of interlinked processors. Our framework provides an input language of recurrences over logical tensors, which then gets lowered into an intermediate language of recurrences over logical iteration spaces, and finally into programs of send, receive, and computation operations specific to each individual processor. In Cyclotron's IR, programs are optimized such that external memory interactions are confined to the boundaries of the iteration space. Within inner iteration spaces, all data accesses become local: data accesses target values residing in local fast memory or on neighboring processing units, avoiding costly memory movement. We provide a scheduling language allowing users to define how data gets streamed and broadcasted between processors, enabling pipelined execution of computation kernels over distributed topologies of processing elements. We demonstrate the portability of our approach by compiling our IR to a reconfigurable simulator of systolic arrays and chiplet style distributed hardware, as well as to distributed-memory CPU clusters. In the simulated reconfigurable setting, we use our compiler for hardware design space exploration in which link costs and latencies can be specified. In the distributed CPU setting, we show how to use recurrences and our scheduling language to express various matrix multiplication routines (Cannon, SUMMA, PUMMA, weight stationary) and solvers (Triangular solve and Cholesky). For matrix multiplication and the triangular solve, we generate distributed implementations competitive with ScaLAPACK.
Similar Papers
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
Distributed, Parallel, and Cluster Computing
Makes AI run faster on many computers.
StarDist: A Code Generator for Distributed Graph Algorithms
Distributed, Parallel, and Cluster Computing
Makes big computer graphs work much faster.
StarDist: A Code Generator for Distributed Graph Algorithms
Distributed, Parallel, and Cluster Computing
Makes big computer graphs work much faster.