A HPX Communication Benchmark: Distributed FFT using Collectives
By: Alexander Strack, Dirk Pflüger
Potential Business Impact:
Makes computer programs run 3x faster.
Due to increasing core counts in modern processors, several task-based runtimes emerged, including the C++ Standard Library for Concurrency and Parallelism (HPX). Although the asynchronous many-task runtime HPX allows implicit communication via an Active Global Address Space, it also supports explicit collective operations. Collectives are an efficient way to realize complex communication patterns. In this work, we benchmark the TCP, MPI, and LCI communication backends of HPX, which are called parcelports in HPX terms. We use a distributed multi-dimensional FFT application relying on collectives. Furthermore, we compare the performance of the HPX all-to-all and scatter collectives against an FFTW3 reference based on MPI+X on a 16-node cluster. Of the three parcelports, LCI performed best for both scatter and all-to-all collectives. Furthermore, the LCI parcelport was up to factor 3 faster than the MPI+X reference. Our results highlight the potential of message abstractions and the parcelports of HPX.
Similar Papers
Understanding the Communication Needs of Asynchronous Many-Task Systems -- A Case Study of HPX+LCI
Distributed, Parallel, and Cluster Computing
Makes supercomputers run science faster.
The Big Send-off: High Performance Collectives on GPU-based Supercomputers
Distributed, Parallel, and Cluster Computing
Makes AI learn much faster on supercomputers.
Parallel FFTW on RISC-V: A Comparative Study including OpenMP, MPI, and HPX
Distributed, Parallel, and Cluster Computing
Makes many-core chips work faster for science.