Score: 2

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Published: August 13, 2025 | arXiv ID: 2508.10202v1

By: Sreeram Venkat , Kasia Swirydowicz , Noah Wolfe and more

BigTech Affiliations: AMD

Potential Business Impact:

Makes supercomputers run faster on different parts.

The hardware diversity displayed in leadership-class computing facilities, alongside the immense performance boosts exhibited by today's GPUs when computing in lower precision, provide a strong incentive for scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using Hipify for performance portability and apply it to FFTMatvec-an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent observed performance. Performance optimizations for AMD GPUs are integrated directly into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 2,048 GPUs on the OLCF Frontier supercomputer.

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Distributed, Parallel, and Cluster Computing

Makes supercomputers run faster on different parts.

13 Aug 2025 2

88%

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Distributed, Parallel, and Cluster Computing

Makes computers solve problems faster and use less power.

20 Aug 2025 0

87%

Towards a Higher Roofline for Matrix-Vector Multiplication in Matrix-Free HOSFEM

Performance

Computers solve math problems faster by recalculating.

9 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com github.com github.com github.com

Page Count

12 pages

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Makes supercomputers run faster on different parts.

Technical Abstract

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Towards a Higher Roofline for Matrix-Vector Multiplication in Matrix-Free HOSFEM