ROCm/docs/how-to/gpu-enabled-mpi.rst

.. meta::
   :description: GPU-enabled Message Passing Interface
   :keywords: Message Passing Interface, MPI, AMD, ROCm

***************************************************************************************************
GPU-enabled Message Passing Interface
***************************************************************************************************

The Message Passing Interface (`MPI <https://www.mpi-forum.org>`_) is a standard API for distributed
and parallel application development that can scale to multi-node clusters. To facilitate the porting of
applications to clusters with GPUs, ROCm enables various technologies. You can use these
technologies add GPU pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal
performance for both intra-node and inter-node GPU-to-GPU communication.

The AMD kernel driver exposes remote direct memory access (RDMA) through *PeerDirect* interfaces.
This allows network interface cards (NICs) to directly read and write to RDMA-capable GPU device
memory, resulting in high-speed direct memory access (DMA) transfers between GPU and NIC. These
interfaces are used to optimize inter-node MPI message communication.

The Open MPI project is an open source implementation of the MPI. It's developed and maintained by
a consortium of academic, research, and industry partners. To compile Open MPI with ROCm support,
refer to the following sections:

* :ref:`open-mpi-ucx`
* :ref:`open-mpi-libfabric`

.. _open-mpi-ucx:

ROCm-aware Open MPI on InfiniBand and RoCE networks using UCX
================================================================

The `Unified Communication Framework <https://www.openucx.org/documentation>`_ (UCX), is an
open source, cross-platform framework designed to provide a common set of communication
interfaces for various network programming models and interfaces. UCX uses ROCm technologies to
implement various network operation primitives. UCX is the standard communication library for
InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect. To optimize data
transfer operations, many MPI libraries, including Open MPI, can leverage UCX internally.

UCX and Open MPI have a compile option to enable ROCm support. To install and configure UCX to compile Open MPI for ROCm, use the following instructions.

1. Set environment variables to install all software components in the same base directory. We use the
home directory in our example, but you can specify a different location if you want.

    .. code-block:: shell

        export INSTALL_DIR=$HOME/ompi_for_gpu
        export BUILD_DIR=/tmp/ompi_for_gpu_build
        mkdir -p $BUILD_DIR

2. Install UCX. To view UCX and ROCm version compatibility, refer to the
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_

    .. code-block:: shell

        export UCX_DIR=$INSTALL_DIR/ucx
        cd $BUILD_DIR
        git clone https://github.com/openucx/ucx.git -b v1.15.x
        cd ucx
        ./autogen.sh
        mkdir build
        cd build
        ../configure -prefix=$UCX_DIR \
            --with-rocm=/opt/rocm
        make -j $(nproc)
        make -j $(nproc) install

3. Install Open MPI.

    .. code-block:: shell

        export OMPI_DIR=$INSTALL_DIR/ompi
        cd $BUILD_DIR
        git clone --recursive https://github.com/open-mpi/ompi.git \
            -b v5.0.x
        cd ompi
        ./autogen.pl
        mkdir build
        cd build
        ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
            --with-rocm=/opt/rocm
        make -j $(nproc)
        make install

.. _rocm-enabled-osu:

ROCm-enabled OSU benchmarks
---------------------------------------------------------------------------------------------------------------

You can use OSU Micro Benchmarks (OMB) to evaluate the performance of various primitives on
ROCm-supported AMD GPUs. The ``--enable-rocm`` option exposes this functionality.

.. code-block:: shell

    export OSU_DIR=$INSTALL_DIR/osu
    cd $BUILD_DIR
    wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
    tar xfz osu-micro-benchmarks-7.2.tar.gz
    cd osu-micro-benchmarks-7.2
    ./configure --enable-rocm \
        --with-rocm=/opt/rocm \
        CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
        LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
        $(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
    make -j $(nproc)

Intra-node run
----------------------------------------------------------------------------------------------------------------

Before running an Open MPI job, you must set the following environment variables to ensure that
you're using the correct versions of Open MPI and UCX.

.. code-block:: shell

    export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
    export PATH=$OMPI_DIR/bin:$PATH

To run the OSU bandwidth benchmark between the first two GPU devices (``GPU 0`` and ``GPU 1``)
inside the same node, use the following code.

.. code-block:: shell

    $OMPI_DIR/bin/mpirun -np 2 \
    -x UCX_TLS=sm,self,rocm \
    --mca pml ucx \
    ./c/mpi/pt2pt/standard/osu_bw D D

This measures the unidirectional bandwidth from the first device (``GPU 0``) to the second device
(``GPU 1``). To select specific devices, for example ``GPU 2`` and ``GPU 3``, include the following
command:

.. code-block:: shell

    export HIP_VISIBLE_DEVICES=2,3

To force using a copy kernel instead of a DMA engine for the data transfer, use the following
command:

.. code-block:: shell

    export HSA_ENABLE_SDMA=0

The following output shows the effective transfer bandwidth measured for inter-die data transfer
between ``GPU 2`` and ``GPU 3`` on a system with MI250 GPUs. For messages larger than 67 MB, an effective
utilization of about 150 GB/sec is achieved:

.. image:: ../data/how-to/gpu-enabled-mpi-1.png
  :width: 400
  :alt: Inter-GPU bandwidth for various payload sizes

Collective operations
----------------------------------------------------------------------------------------------------------------

Collective operations on GPU buffers are best handled through the Unified Collective Communication
(UCC) library component in Open MPI. To accomplish this, you must configure and compile the UCC
library with ROCm support.

.. note::

    You can verify UCC and ROCm version compatibility using the
    `communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_

.. code-block:: shell

    export UCC_DIR=$INSTALL_DIR/ucc
    git clone https://github.com/openucx/ucc.git -b v1.2.x
    cd ucc
    ./autogen.sh
    ./configure --with-rocm=/opt/rocm \
                --with-ucx=$UCX_DIR   \
                --prefix=$UCC_DIR
    make -j && make install

    # Configure and compile Open MPI with UCX, UCC, and ROCm support
    cd ompi
    ./configure --with-rocm=/opt/rocm  \
                --with-ucx=$UCX_DIR    \
                --with-ucc=$UCC_DIR
                --prefix=$OMPI_DIR

To use the UCC component with an MPI application, you must set additional parameters:

.. code-block:: shell

    mpirun --mca pml ucx --mca osc ucx \
       --mca coll_ucc_enable 1     \
       --mca coll_ucc_priority 100 -np 64 ./my_mpi_app

.. _open-mpi-libfabric:

ROCm-aware Open MPI using libfabric
================================================================

For network interconnects that are not covered in the previous category, such as HPE Slingshot,
ROCm-aware communication can often be achieved through the libfabric library. For more information,
refer to the `libfabric documentation <https://github.com/ofiwg/libfabric/wiki>`_.

.. note::

    When using Open MPI v5.0.x with libfabric support, shared memory communication between
    processes on the same node goes through the *ob1/sm* component. This component has
    fundamental support for GPU memory that is, accomplished by using a staging host buffer
    Consequently, the performance of device-to-device shared memory communication is lower than
    the theoretical peak performance allowed by the GPU-to-GPU interconnect.

1.	Install libfabric. Note that libfabric is often pre-installed. To determine if it's already installed, run:

    .. code-block:: shell

        module avail libfabric

    Alternatively, you can download and compile libfabric with ROCm support. Note that not all
    components required to support some networks (e.g., HPE Slingshot) are available in the open source
    repository. Therefore, using a pre-installed libfabric library is strongly recommended over compiling
    libfabric manually.

    If a pre-compiled libfabric library is available on your system, you can skip the following step.

2.	Compile libfabric with ROCm support.

    .. code-block:: shell

        export OFI_DIR=$INSTALL_DIR/ofi
        cd $BUILD_DIR
        git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
        cd libfabric
        ./autogen.sh
        ./configure --prefix=$OFI_DIR   \
                    --with-rocr=/opt/rocm
        make -j $(nproc)
        make install

Installing Open MPI with libfabric support
----------------------------------------------------------------------------------------------------------------

To build Open MPI with libfabric, use the following code:

.. code-block:: shell

    export OMPI_DIR=$INSTALL_DIR/ompi
    cd $BUILD_DIR
    git clone --recursive https://github.com/open-mpi/ompi.git \
        -b v5.0.x
    cd ompi
    ./autogen.pl
    mkdir build
    cd build
    ../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
                    --with-rocm=/opt/rocm
    make -j $(nproc)
    make install

ROCm-aware OSU with Open MPI and libfabric
----------------------------------------------------------------------------------------------------------------

Compiling a ROCm-aware version of OSU benchmarks with Open MPI and libfabric uses the same
process described in :ref:`rocm-enabled-osu`.

To run an OSU benchmark using multiple nodes, use the following code:

.. code-block:: shell

    export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
    $OMPI_DIR/bin/mpirun -np 2 \
    ./c/mpi/pt2pt/standard/osu_bw D D