Score: 1

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

Published: April 18, 2025 | arXiv ID: 2504.13821v1

By: Vicki Carrica , Maxwell Onyango , Rabab Alomairy and more

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Makes computers solve math problems faster.

Business Areas:

GPU Hardware

This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.

Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Distributed, Parallel, and Cluster Computing

Makes computers solve math problems much faster.

12 Jan 2026 1

84%

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon

Performance

Makes Apple computers do math much faster.

8 Oct 2025 0

84%

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

Distributed, Parallel, and Cluster Computing

Makes computers learn faster with better math.

8 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

10 pages

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

Makes computers solve math problems faster.

Technical Abstract

Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision