Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver
By: Shiting Long , Gustavo Ramirez-Hidalgo , Stepan Nassyr and more
Potential Business Impact:
Makes science computers solve problems much faster.
Managing the high computational cost of iterative solvers for sparse linear systems is a known challenge in scientific computing. Moreover, scientific applications often face memory bandwidth constraints, making it critical to optimize data locality and enhance the efficiency of data transport. We extend the lattice QCD solver DD-$α$AMG to incorporate multiple right-hand sides (rhs) for both the Wilson-Dirac operator evaluation and the GMRES solver, with and without odd-even preconditioning. To optimize auto-vectorization, we introduce a flexible interface that supports various data layouts and implement a new data layout for better SIMD utilization. We evaluate our optimizations on both x86 and Arm clusters, demonstrating performance portability with similar speedups. A key contribution of this work is the performance analysis of our optimizations, which reveals the complexity introduced by architectural constraints and compiler behavior. Additionally, we explore different implementations leveraging a new matrix instruction set for Arm called SME and provide an early assessment of its potential benefits.
Similar Papers
Solving advection equations with reduction multigrids on GPUs
Numerical Analysis
Makes computer programs run much faster on special chips.
Demystifying ARM SME to Optimize General Matrix Multiplications
Distributed, Parallel, and Cluster Computing
Makes computer math much faster for AI.
LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures
Distributed, Parallel, and Cluster Computing
Makes computer simulations run much faster.