An Adaptive Distributed Stencil Abstraction for GPUs
By: Aditya Bhosale, Laxmikant Kale
The scientific computing ecosystem in Python is largely confined to single-node parallelism, creating a gap between high-level prototyping in NumPy and high-performance execution on modern supercomputers. The increasing prevalence of hardware accelerators and the need for energy efficiency have made resource adaptivity a critical requirement, yet traditional HPC abstractions remain rigid. To address these challenges, we present an adaptive, distributed abstraction for stencil computations on multi-node GPUs. This abstraction is built using CharmTyles, a framework based on the adaptive Charm++ runtime, and features a familiar NumPy-like syntax to minimize the porting effort from prototype to production code. We showcase the resource elasticity of our abstraction by dynamically rescaling a running application across a different number of nodes and present a performance analysis of the associated overheads. Furthermore, we demonstrate that our abstraction achieves significant performance improvements over both a specialized, high-performance stencil DSL and a generalized NumPy replacement.
Similar Papers
PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage
Databases
Makes computers analyze big data 3x faster.
StarDist: A Code Generator for Distributed Graph Algorithms
Distributed, Parallel, and Cluster Computing
Makes big computer graphs work much faster.
StarDist: A Code Generator for Distributed Graph Algorithms
Distributed, Parallel, and Cluster Computing
Makes big computer graphs work much faster.