Do MPI Derived Datatypes Actually Help? A Single-Node Cross-Implementation Study on Shared-Memory Communication
By: Temitayo Adefemi
Potential Business Impact:
Makes computer programs share data faster.
MPI's derived datatypes (DDTs) promise easier, copy-free communication of non-contiguous data, yet their practical performance remains debated and is often reported only for a single MPI stack. We present a cross-implementation assessment using three 2D applications: a Jacobi CFD solver, Conway's Game of Life, and a lattice-based image reconstruction. Each application is written in two ways: (i) a BASIC version with manual packing and unpacking of non-contiguous regions and (ii) a DDT version using MPI_Type_vector and MPI_Type_create_subarray with correct true extent via MPI_Type_create_resized. For API parity, we benchmark identical communication semantics: non-blocking point-to-point (Irecv/Isend + Waitall), neighborhood collectives (MPI_Neighbor_alltoallw), and MPI-4 persistent operations (*_init). We run strong and weak scaling on 1-4 ranks, validate bitwise-identical halos, and evaluate four widely used MPI implementations: MPICH, Open MPI, Intel MPI, and MVAPICH2 on a single ARCHER2 node. Results are mixed. DDTs can be fastest, for example for the image reconstruction code on Intel MPI and MPICH, but can also be among the slowest on other stacks, such as Open MPI and MVAPICH2 for the same code. For the CFD solver, BASIC variants generally outperform DDTs across semantics, whereas for Game of Life the ranking flips depending on the MPI library. We also observe stack-specific anomalies, for example MPICH slowdowns with DDT neighborhood and persistent modes. Overall, no strategy dominates across programs, semantics, and MPI stacks; performance portability for DDTs is not guaranteed. We therefore recommend profiling both DDT-based and manual-packing designs under the intended MPI implementation and communication mode. Our study is limited to a single node and does not analyze memory overhead; multi-node and GPU-aware paths are left for future work.
Similar Papers
Persistent and Partitioned MPI for Stencil Communication
Distributed, Parallel, and Cluster Computing
Makes computer programs run much faster.
Layout-Agnostic MPI Abstraction for Distributed Computing in Modern C++
Distributed, Parallel, and Cluster Computing
Makes supercomputers easier to program.
Trace-based, time-resolved analysis of MPI application performance using standard metrics
Distributed, Parallel, and Cluster Computing
Finds hidden computer speed problems in programs.