Score: 0

Do MPI Derived Datatypes Actually Help? A Single-Node Cross-Implementation Study on Shared-Memory Communication

Published: November 17, 2025 | arXiv ID: 2511.13804v1

By: Temitayo Adefemi

Potential Business Impact:

Makes computer programs share data faster.

Business Areas:

DSP Hardware

MPI's derived datatypes (DDTs) promise easier, copy-free communication of non-contiguous data, yet their practical performance remains debated and is often reported only for a single MPI stack. We present a cross-implementation assessment using three 2D applications: a Jacobi CFD solver, Conway's Game of Life, and a lattice-based image reconstruction. Each application is written in two ways: (i) a BASIC version with manual packing and unpacking of non-contiguous regions and (ii) a DDT version using MPI_Type_vector and MPI_Type_create_subarray with correct true extent via MPI_Type_create_resized. For API parity, we benchmark identical communication semantics: non-blocking point-to-point (Irecv/Isend + Waitall), neighborhood collectives (MPI_Neighbor_alltoallw), and MPI-4 persistent operations (*_init). We run strong and weak scaling on 1-4 ranks, validate bitwise-identical halos, and evaluate four widely used MPI implementations: MPICH, Open MPI, Intel MPI, and MVAPICH2 on a single ARCHER2 node. Results are mixed. DDTs can be fastest, for example for the image reconstruction code on Intel MPI and MPICH, but can also be among the slowest on other stacks, such as Open MPI and MVAPICH2 for the same code. For the CFD solver, BASIC variants generally outperform DDTs across semantics, whereas for Game of Life the ranking flips depending on the MPI library. We also observe stack-specific anomalies, for example MPICH slowdowns with DDT neighborhood and persistent modes. Overall, no strategy dominates across programs, semantics, and MPI stacks; performance portability for DDTs is not guaranteed. We therefore recommend profiling both DDT-based and manual-packing designs under the intended MPI implementation and communication mode. Our study is limited to a single node and does not analyze memory overhead; multi-node and GPU-aware paths are left for future work.