Score: 0

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Published: November 4, 2025 | arXiv ID: 2511.02655v1

By: Johansell Villalobos, Josef Ruzicka, Silvio Rizzi

Potential Business Impact:

Helps supercomputers run faster on different parts.

Business Areas:

Quantum Computing Science and Engineering

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework's capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance.

LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures

Distributed, Parallel, and Cluster Computing

Makes computer simulations of tiny things run much faster.

19 Aug 2025 2

90%

LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures

Distributed, Parallel, and Cluster Computing

Makes computer simulations run much faster.

19 Aug 2025 1

88%

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Distributed, Parallel, and Cluster Computing

Makes AI run much faster on many computers.

17 Nov 2025 1

View PDF Login to Bookmark

Page Count

6 pages

Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks

Helps supercomputers run faster on different parts.

Technical Abstract

LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures

LAMMPS-KOKKOS: Performance Portable Molecular Dynamics Across Exascale Architectures

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels