Remove gpu-cluster-networking and 'Using MPI' page due to migration to Instinct Docs (#4201)

* remove 'Using MPI' and 'gpu-cluster-networking' sections due to migration to dcgpu

* remove gpu-cluster-networking from index page

---------

Co-authored-by: Alex Xu <alex.xu@amd.com>
This commit is contained in:
alexxu-amd
2024-12-30 09:39:46 -05:00
committed by GitHub
parent f76145c2ad
commit 85bd6e98f5
4 changed files with 0 additions and 269 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

View File

@@ -1,264 +0,0 @@
.. meta::
:description: GPU-enabled Message Passing Interface
:keywords: Message Passing Interface, MPI, AMD, ROCm
***************************************************************************************************
GPU-enabled Message Passing Interface
***************************************************************************************************
The Message Passing Interface (`MPI <https://www.mpi-forum.org>`_) is a standard API for distributed
and parallel application development that can scale to multi-node clusters. To facilitate the porting of
applications to clusters with GPUs, ROCm enables various technologies. You can use these
technologies add GPU pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal
performance for both intra-node and inter-node GPU-to-GPU communication.
The AMD kernel driver exposes remote direct memory access (RDMA) through *PeerDirect* interfaces.
This allows network interface cards (NICs) to directly read and write to RDMA-capable GPU device
memory, resulting in high-speed direct memory access (DMA) transfers between GPU and NIC. These
interfaces are used to optimize inter-node MPI message communication.
The Open MPI project is an open source implementation of the MPI. It's developed and maintained by
a consortium of academic, research, and industry partners. To compile Open MPI with ROCm support,
refer to the following sections:
* :ref:`open-mpi-ucx`
* :ref:`open-mpi-libfabric`
.. _open-mpi-ucx:
ROCm-aware Open MPI on InfiniBand and RoCE networks using UCX
================================================================
The `Unified Communication Framework <https://www.openucx.org/documentation>`_ (UCX), is an
open source, cross-platform framework designed to provide a common set of communication
interfaces for various network programming models and interfaces. UCX uses ROCm technologies to
implement various network operation primitives. UCX is the standard communication library for
InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect. To optimize data
transfer operations, many MPI libraries, including Open MPI, can leverage UCX internally.
UCX and Open MPI have a compile option to enable ROCm support. To install and configure UCX to compile Open MPI for ROCm, use the following instructions.
1. Set environment variables to install all software components in the same base directory. We use the
home directory in our example, but you can specify a different location if you want.
.. code-block:: shell
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
2. Install UCX. To view UCX and ROCm version compatibility, refer to the
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
.. code-block:: shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
3. Install Open MPI.
.. code-block:: shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make install
.. _rocm-enabled-osu:
ROCm-enabled OSU benchmarks
---------------------------------------------------------------------------------------------------------------
You can use OSU Micro Benchmarks (OMB) to evaluate the performance of various primitives on
ROCm-supported AMD GPUs. The ``--enable-rocm`` option exposes this functionality.
.. code-block:: shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
tar xfz osu-micro-benchmarks-7.2.tar.gz
cd osu-micro-benchmarks-7.2
./configure --enable-rocm \
--with-rocm=/opt/rocm \
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)
Intra-node run
----------------------------------------------------------------------------------------------------------------
Before running an Open MPI job, you must set the following environment variables to ensure that
you're using the correct versions of Open MPI and UCX.
.. code-block:: shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH
To run the OSU bandwidth benchmark between the first two GPU devices (``GPU 0`` and ``GPU 1``)
inside the same node, use the following code.
.. code-block:: shell
$OMPI_DIR/bin/mpirun -np 2 \
-x UCX_TLS=sm,self,rocm \
--mca pml ucx \
./c/mpi/pt2pt/standard/osu_bw D D
This measures the unidirectional bandwidth from the first device (``GPU 0``) to the second device
(``GPU 1``). To select specific devices, for example ``GPU 2`` and ``GPU 3``, include the following
command:
.. code-block:: shell
export HIP_VISIBLE_DEVICES=2,3
To force using a copy kernel instead of a DMA engine for the data transfer, use the following
command:
.. code-block:: shell
export HSA_ENABLE_SDMA=0
The following output shows the effective transfer bandwidth measured for inter-die data transfer
between ``GPU 2`` and ``GPU 3`` on a system with MI250 GPUs. For messages larger than 67 MB, an effective
utilization of about 150 GB/sec is achieved:
.. image:: ../data/how-to/gpu-enabled-mpi-1.png
:width: 400
:alt: Inter-GPU bandwidth for various payload sizes
Collective operations
----------------------------------------------------------------------------------------------------------------
Collective operations on GPU buffers are best handled through the Unified Collective Communication
(UCC) library component in Open MPI. To accomplish this, you must configure and compile the UCC
library with ROCm support.
.. note::
You can verify UCC and ROCm version compatibility using the
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
.. code-block:: shell
export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git -b v1.2.x
cd ucc
./autogen.sh
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--prefix=$UCC_DIR
make -j && make install
# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--with-ucc=$UCC_DIR
--prefix=$OMPI_DIR
To use the UCC component with an MPI application, you must set additional parameters:
.. code-block:: shell
mpirun --mca pml ucx --mca osc ucx \
--mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
.. _open-mpi-libfabric:
ROCm-aware Open MPI using libfabric
================================================================
For network interconnects that are not covered in the previous category, such as HPE Slingshot,
ROCm-aware communication can often be achieved through the libfabric library. For more information,
refer to the `libfabric documentation <https://github.com/ofiwg/libfabric/wiki>`_.
.. note::
When using Open MPI v5.0.x with libfabric support, shared memory communication between
processes on the same node goes through the *ob1/sm* component. This component has
fundamental support for GPU memory that is, accomplished by using a staging host buffer
Consequently, the performance of device-to-device shared memory communication is lower than
the theoretical peak performance allowed by the GPU-to-GPU interconnect.
1. Install libfabric. Note that libfabric is often pre-installed. To determine if it's already installed, run:
.. code-block:: shell
module avail libfabric
Alternatively, you can download and compile libfabric with ROCm support. Note that not all
components required to support some networks (e.g., HPE Slingshot) are available in the open source
repository. Therefore, using a pre-installed libfabric library is strongly recommended over compiling
libfabric manually.
If a pre-compiled libfabric library is available on your system, you can skip the following step.
2. Compile libfabric with ROCm support.
.. code-block:: shell
export OFI_DIR=$INSTALL_DIR/ofi
cd $BUILD_DIR
git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
cd libfabric
./autogen.sh
./configure --prefix=$OFI_DIR \
--with-rocr=/opt/rocm
make -j $(nproc)
make install
Installing Open MPI with libfabric support
----------------------------------------------------------------------------------------------------------------
To build Open MPI with libfabric, use the following code:
.. code-block:: shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make install
ROCm-aware OSU with Open MPI and libfabric
----------------------------------------------------------------------------------------------------------------
Compiling a ROCm-aware version of OSU benchmarks with Open MPI and libfabric uses the same
process described in :ref:`rocm-enabled-osu`.
To run an OSU benchmark using multiple nodes, use the following code:
.. code-block:: shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
$OMPI_DIR/bin/mpirun --mca pml ob1 --mca btl_ofi_mode 2 -np 2 \
./c/mpi/pt2pt/standard/osu_bw D D

View File

@@ -42,7 +42,6 @@ ROCm documentation is organized into the following categories:
* [Fine-tune LLMs and inference optimization](./how-to/llm-fine-tuning-optimization/index.rst)
* [System optimization](./how-to/system-optimization/index.rst)
* [AMD Instinct MI300X performance validation and tuning](./how-to/tuning-guides/mi300x/index.rst)
* [GPU cluster networking](https://dcgpu.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html)
* [System debugging](./how-to/system-debugging.md)
* [Use MPI](./how-to/gpu-enabled-mpi.rst)
* [Use advanced compiler features](./conceptual/compiler-topics.md)

View File

@@ -94,10 +94,6 @@ subtrees:
title: System tuning
- file: how-to/tuning-guides/mi300x/workload.rst
title: Workload tuning
- url: https://dcgpu.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html
title: GPU cluster networking
- file: how-to/gpu-enabled-mpi.rst
title: Use MPI
- file: how-to/system-debugging.md
- file: conceptual/compiler-topics.md
title: Use advanced compiler features