ROCm/docs/how-to/gpu-enabled-mpi.md

# GPU-enabled MPI

The Message Passing Interface ([MPI](https://www.mpi-forum.org)) is a standard
API for distributed and parallel application development that can scale to
multi-node clusters. To facilitate the porting of applications to clusters with
GPUs, ROCm enables various technologies. These technologies allow users to
directly use GPU pointers in MPI calls and enable ROCm-aware MPI libraries to
deliver optimal performance for both intra-node and inter-node GPU-to-GPU
communication.

The AMD kernel driver exposes Remote Direct Memory Access (RDMA) through the
*PeerDirect* interfaces to allow Host Channel Adapters (HCA, a type of
Network Interface Card or NIC) to directly read and write to the GPU device
memory with RDMA capabilities. These interfaces are currently registered as a
*peer_memory_client* with Mellanox’s OpenFabrics Enterprise Distribution (OFED)
`ib_core` kernel module to allow high-speed DMA transfers between GPU and HCA.
These interfaces are used to optimize inter-node MPI message communication.

This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
and industry partners.

Several MPI implementations can be made ROCm-aware by compiling them with
[Unified Communication Framework](https://www.openucx.org/) (UCX) support. One
notable exception is MVAPICH2: It directly supports AMD GPUs without using UCX,
and you can download it [here](http://mvapich.cse.ohio-state.edu/downloads/).
Use the latest version of the MVAPICH2-GDR package.

The Unified Communication Framework, is an open source cross-platform framework
whose goal is to provide a common set of communication interfaces that targets a
broad set of network programming models and interfaces. UCX is ROCm-aware, and
ROCm technologies are used directly to implement various network operation
primitives. For more details on the UCX design, refer to it's
[documentation](https://www.openucx.org/documentation).

## Building UCX

The following section describes how to set up UCX so it can be used to compile
Open MPI. The following environment variables are set, such that all software
components will be installed in the same base directory (we assume to install
them in your home directory; for other locations adjust the below environment
variables accordingly, and make sure you have write permission for that
location):

```shell
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
```

```{note}
The following sequences of build commands assume either the ROCmCC or the AOMP
compiler is active in the environment, which will execute the commands.
```

## Install UCX

The next step is to set up UCX by compiling its source code and install it:

```shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.14.1
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
    --with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
```

The [communication libraries tables](../reference/libraries/gpu-libraries/communication.md)
documents the compatibility of UCX versions with ROCm versions.

## Install Open MPI

These are the steps to build Open MPI:

```shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
    -b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
    --with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
```

## ROCm-enabled OSU

The OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of
various primitives with an AMD GPU device and ROCm support. This functionality
is exposed when configured with `--enable-rocm` option. We can use the following
steps to compile OMB:

```shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xfz osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
    --with-rocm=/opt/rocm \
    CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
    LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
    $(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)
```

## Intra-node run

Before running an Open MPI job, it is essential to set some environment variables to
ensure that the correct version of Open MPI and UCX is being used.

```shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH
```

The following command runs the OSU bandwidth benchmark between the first two GPU
devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It
measures the unidirectional bandwidth from the first device to the other.

```shell
$OMPI_DIR/bin/mpirun -np 2               \
   -x UCX_TLS=sm,self,rocm               \
   --mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
```

To select different devices, for example 2 and 3, use the following command:

```shell
export HIP_VISIBLE_DEVICES=2,3
export HSA_ENABLE_SDMA=0
```

The following output shows the effective transfer bandwidth measured for
inter-die data transfer between GPU device 2 and 3 (same OAM). For messages
larger than 67MB, an effective utilization of about 150GB/sec is achieved, which
corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that
connection:

![OSU execution showing transfer bandwidth increasing alongside payload increase](../data/how-to/gpu-enabled-mpi-1.png "Inter-GPU bandwidth with various payload sizes")

## Collective operations

Collective Operations on GPU buffers are best handled through the
Unified Collective Communication Library (UCC) component in Open MPI.
For this, the UCC library has to be configured and compiled with ROCm
support.

Please note the compatibility tables in the [communication libraries](../reference/libraries/gpu-libraries/communication.md)
for UCC versions with the various ROCm versions.

An example for configuring UCC and Open MPI with ROCm support
is shown below:

```shell
export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git
cd ucc
./configure --with-rocm=/opt/rocm \
            --with-ucx=$UCX_DIR   \
            --prefix=$UCC_DIR
make -j && make install

# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm  \
            --with-ucx=$UCX_DIR    \
            --with-ucc=$UCC_DIR
            --prefix=$OMPI_DIR
```

To use the UCC component with an MPI application requires setting some
additional parameters:

```shell
mpirun --mca pml ucx --mca osc ucx \
       --mca coll_ucc_enable 1     \
       --mca coll_ucc_priority 100 -np 64 ./my_mpi_app
```