Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems
By: David Jin, Alexis Montoison, Sungho Shin
Potential Business Impact:
Makes computers solve hard math problems faster.
Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.
Similar Papers
Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems
Mathematical Software
Speeds up computer calculations for complex problems.
Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems
Mathematical Software
Speeds up computer math for science and control.
A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices
Distributed, Parallel, and Cluster Computing
Makes computers solve math problems much faster.