Score: 0

BLEST: Blazingly Efficient BFS using Tensor Cores

Published: December 26, 2025 | arXiv ID: 2512.21967v1

By: Deniz Elbek, Kamer Kaya

Breadth-First Search (BFS) is a fundamental graph kernel that underpins a wide range of applications. While modern GPUs provide specialised Matrix-Multiply-Accumulate (MMA) units, e.g., Tensor Cores (TC), with extremely high throughput, they target dense operations, making it non-trivial to exploit them for irregular, unstructured graph computations. In particular, fully utilising them for a BFS requires an efficient mapping of the edge operations onto TCs while avoiding redundancy, load imbalance, and synchronisation. We present BLEST, a TC-accelerated framework that reformulates the pull-based BFS pipeline around a bitmap-oriented structure and a carefully engineered execution layout. BLEST introduces Binarised Virtual Slice Sets (BVSS) to enforce warp-level load balancing and to eliminate frontier-oblivious work assignment. To improve both memory efficiency and update locality across diverse graphs, we apply two complementary graph reordering strategies: a compression-oriented ordering for social-like graphs and a bandwidth-reducing ordering for non-social graphs. At the compute level, we develop a batched SpMSpV multiplication pattern that uses the bitwise TC tiles to handle dot products without wasting output entries, thereby reducing the number of required MMA calls. Finally, BLEST combines kernel fusion with a lazy vertex update scheme to reduce host-side synchronisation, mitigate atomic overheads, and improve cache locality. Experiments show that BLEST delivers, on average, $3.58\times$, $4.64\times$ and $4.9\times$ speedup over BerryBees, Gunrock, and GSWITCH, respectively, across a broad set of real-world graphs.

Performance-Driven Optimization of Parallel Breadth-First Search

Distributed, Parallel, and Cluster Computing

Makes computer searches on connected data faster.

1 Mar 2025 0

86%

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores

Distributed, Parallel, and Cluster Computing

Speeds up finding similar things in big data.

28 Aug 2025 2

86%

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Hardware Architecture

Makes AI answer questions much faster and use less power.

12 Sep 2025 1

View PDF Login to Bookmark

BLEST: Blazingly Efficient BFS using Tensor Cores

Technical Abstract

Performance-Driven Optimization of Parallel Breadth-First Search

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness