Abstract
The advent of multi-/many-core processors in clusters advocates hybrid parallel programming,
which combines Message Passing Interface (MPI) for inter-node parallelism with a shared
memory model for on-node parallelism. Compared to the traditional hybrid approach
of MPI plus OpenMP, a new, but promising hybrid approach of MPI plus MPI-3 shared-memory
extensions (MPI+MPI) is gaining attraction. We describe an algorithmic approach for
collective operations (with allgather and broadcast as concrete examples) in the context
of hybrid MPI+MPI, so as to minimize memory consumption and memory copies. With this
approach, only one memory copy is maintained and shared by on-node processes. This
allows the removal of unnecessary on-node copies of replicated data that are required
between MPI processes when the collectives are invoked in the context of pure MPI.
We compare our approach of collectives for hybrid MPI+MPI and the traditional one
for pure MPI, and also have a discussion on the synchronization that is required to
guarantee data integrity. The performance of our approach has been validated on a
Cray XC40 system (Cray MPI) and NEC cluster (Open MPI), showing that it achieves comparable
or better performance for allgather operations. We have further validated our approach
with a standard computational kernel, namely distributed matrix multiplication, and
a Bayesian Probabilistic Matrix Factorization code.
Users
Please
log in to take part in the discussion (add own reviews or comments).