Score: 2

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

Published: August 8, 2025 | arXiv ID: 2508.06339v1

By: Evelyne Ringoot , Rabab Alomairy , Valentin Churavy and more

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Makes computers learn faster with better math.

This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.

Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method

Distributed, Parallel, and Cluster Computing

Makes computers find patterns in data much faster.

15 Aug 2025 0

88%

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices

Distributed, Parallel, and Cluster Computing

Makes computers solve math problems much faster.

14 Oct 2025 1

87%

Design of A Low-Latency and Parallelizable SVD Dataflow Architecture on FPGA

Distributed, Parallel, and Cluster Computing

Speeds up computer analysis of big data streams.

16 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇩🇪 🇺🇸 United States, Germany

Page Count

12 pages

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

Makes computers learn faster with better math.

Technical Abstract

Efficient GPU-Centered Singular Value Decomposition Using the Divide-and-Conquer Method

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices

Design of A Low-Latency and Parallelizable SVD Dataflow Architecture on FPGA