Score: 3

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Published: September 3, 2025 | arXiv ID: 2509.03015v3

By: David Jin, Alexis Montoison, Sungho Shin

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Makes computers solve hard math problems faster.

Business Areas:
DSP Hardware

Block-tridiagonal systems are prevalent in state estimation and optimal control, and solving these systems is often the computational bottleneck. Improving the underlying solvers therefore has a direct impact on the real-time performance of estimators and controllers. We present a GPU-based implementation for the factorization and solution of block-tridiagonal symmetric positive definite (SPD) linear systems. Our method employs a recursive Schur-complement reduction, transforming the original system into a hierarchy of smaller, independent systems that can be solved in parallel using batched BLAS/LAPACK routines. Performance benchmarks with our cross-platform (NVIDIA and AMD) implementation, BlockDSS, show substantial speed-ups over state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. At the same time, the current implementation still invokes batched routines sequentially at each recursion level, and high efficiency requires block sizes large enough to amortize kernel launch overhead.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
7 pages

Category
Computer Science:
Mathematical Software