Score: 3

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Published: September 3, 2025 | arXiv ID: 2509.03015v3

By: David Jin, Alexis Montoison, Sungho Shin

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Makes computers solve hard math problems faster.

Business Areas:

DSP Hardware

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Mathematical Software

Speeds up computer calculations for complex problems.

3 Sep 2025 0

98%

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Mathematical Software

Speeds up computer math for science and control.

3 Sep 2025 0

87%

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices

Distributed, Parallel, and Cluster Computing

Makes computers solve math problems much faster.

14 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

7 pages

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Makes computers solve hard math problems faster.

Technical Abstract

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

A GPU-resident Memory-Aware Algorithm for Accelerating Bidiagonalization of Banded Matrices