Left nav updates (#2647)

* update gpu-enabled-mpi

update the documentation to also include libfabric based network interconnects,
not just UCX.

* add some technical terms to wordlist

* shorten left nav

* grid updates

---------

Co-authored-by: Edgar Gabriel <Edgar.Gabriel@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
This commit is contained in:
Lisa
2023-11-24 07:15:10 -07:00
committed by GitHub
parent 0d6fc80070
commit 4adaff02a6
4 changed files with 118 additions and 88 deletions

View File

@@ -200,6 +200,7 @@ hipSPARSELt
hipTensor
HPC
HPCG
HPE
HPL
HSA
hsa
@@ -245,6 +246,7 @@ KVM
LAPACK
LCLK
LDS
libfabric
libjpeg
libs
linearized
@@ -383,6 +385,7 @@ Rickle
roadmap
roc
ROC
RoCE
rocAL
rocALUTION
rocalution
@@ -451,6 +454,7 @@ SKUs
skylake
sL
SLES
sm
SMEM
SMI
smi

View File

@@ -53,14 +53,14 @@ The following sequences of build commands assume either the ROCmCC or the AOMP
compiler is active in the environment, which will execute the commands.
```
## Install UCX
### Installing UCX
The next step is to set up UCX by compiling its source code and install it:
```shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.14.1
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
@@ -74,7 +74,7 @@ make -j $(nproc) install
The [communication libraries tables](../reference/library-index.md)
documents the compatibility of UCX versions with ROCm versions.
## Install Open MPI
### Installing Open MPI
These are the steps to build Open MPI:
@@ -90,12 +90,12 @@ cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
make install
```
## ROCm-enabled OSU
### ROCm-enabled OSU
The OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of
The OSU Micro Benchmarks (OMB) can be used to evaluate the performance of
various primitives with an AMD GPU device and ROCm support. This functionality
is exposed when configured with `--enable-rocm` option. We can use the following
steps to compile OMB:
@@ -103,10 +103,10 @@ steps to compile OMB:
```shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xfz osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
tar xfz osu-micro-benchmarks-7.2.tar.gz
cd osu-micro-benchmarks-7.2
./configure --enable-rocm \
--with-rocm=/opt/rocm \
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
@@ -114,7 +114,7 @@ cd osu-micro-benchmarks-5.9
make -j $(nproc)
```
## Intra-node run
### Intra-node run
Before running an Open MPI job, it is essential to set some environment variables to
ensure that the correct version of Open MPI and UCX is being used.
@@ -125,31 +125,35 @@ export PATH=$OMPI_DIR/bin:$PATH
```
The following command runs the OSU bandwidth benchmark between the first two GPU
devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It
devices (i.e., GPU 0 and GPU 1) by default inside the same node. It
measures the unidirectional bandwidth from the first device to the other.
```shell
$OMPI_DIR/bin/mpirun -np 2 \
-x UCX_TLS=sm,self,rocm \
--mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
--mca pml ucx \
./c/mpi/pt2pt/standard/osu_bw D D
```
To select different devices, for example 2 and 3, use the following command:
```shell
export HIP_VISIBLE_DEVICES=2,3
```
To force using a copy kernel instead of a DMA engine for the data transfer, use the following command:
```shell
export HSA_ENABLE_SDMA=0
```
The following output shows the effective transfer bandwidth measured for
inter-die data transfer between GPU device 2 and 3 (same OAM). For messages
larger than 67MB, an effective utilization of about 150GB/sec is achieved, which
corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that
connection:
inter-die data transfer between GPU device 2 and 3 on a system with MI250 GPUs. For messages
larger than 67MB, an effective utilization of about 150GB/sec is achieved:
![OSU execution showing transfer bandwidth increasing alongside payload increase](../data/how-to/gpu-enabled-mpi-1.png "Inter-GPU bandwidth with various payload sizes")
## Collective operations
### Collective operations
Collective Operations on GPU buffers are best handled through the
Unified Collective Communication Library (UCC) component in Open MPI.
@@ -164,8 +168,9 @@ is shown below:
```shell
export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git
git clone https://github.com/openucx/ucc.git -b v1.2.x
cd ucc
./autogen.sh
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--prefix=$UCC_DIR
@@ -187,3 +192,92 @@ mpirun --mca pml ucx --mca osc ucx \
--mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
```
## ROCm-aware Open MPI using libfabric
For network interconnects that are not covered in the previous category,
such as HPE Slingshot, ROCm-aware communication can often be
achieved through the libfabric library. For more details on
libfabric, please refer to its
[documentation](https://github.com/ofiwg/libfabric/wiki).
### Installing libfabric
In many instances libfabric is already pre-installed on a system. Please verify
using e.g.
```shell
module avail libfabric
```
the availability of the libfabric library on your system.
Alternatively, one can also download and compile libfabric with ROCm
support. Note however that not all components required to
support e.g. HPE Slingshot networks are available in the open source
repository. Therefore, using a pre-installed libfabric library is strongly
preferred over compiling libfabric yourself.
If a pre-compiled libfabric library is available on your system,
please skip the subsequent steps and go to [Installing Open MPI
with libfabric support](#installing-open-mpi-with-libfabric-support).
Compiling libfabric with ROCm support can be achieved with the following
steps:
```shell
export OFI_DIR=$INSTALL_DIR/ofi
cd $BUILD_DIR
git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
cd libfabric
./autogen.sh
./configure --prefix=$OFI_DIR \
--with-rocr=/opt/rocm
make -j $(nproc)
make install
```
### Installing Open MPI with libfabric support
These are the steps to build Open MPI with libfabric:
```shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make install
```
### ROCm-aware OSU with Open MPI and libfabric
Compiling a ROCm-aware version of the OSU benchmarks with Open MPI and
libfabric is identical to steps laid out in the section [ROCm-enabled
OSU](#rocm-enabled-osu).
Running an OSU benchmark using multiple nodes requires the following
steps:
```shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
$OMPI_DIR/bin/mpirun -np 2 \
./c/mpi/pt2pt/standard/osu_bw D D
```
### Notes
When using Open MPI v5.0.x with libfabric support, shared memory
communication between processes on the same node will go through the
*ob1/sm* component. While this component has fundamental support for
GPU memory, it will accomplish this in case of ROCm devices by using a
staging host buffer. Consequently, the performance of
device-to-device shared memory communication will be lower than the
theoretical peak performance of the GPU to GPU interconnect would
allow.

View File

@@ -1,6 +1,6 @@
# ROCm API libraries & tools
::::{grid} 1 2 2 2
::::{grid} 1 3 3 3
:class-container: rocm-doc-grid
:::{grid-item-card}

View File

@@ -81,74 +81,6 @@ subtrees:
entries:
- file: reference/library-index.md
title: API libraries & tools
subtrees:
- entries:
- url: ${project:composable_kernel}
title: Composable kernel
- url: ${project:hipblas}
title: hipBLAS
- url: ${project:hipblaslt}
title: hipBLASLt
- url: ${project:hipcc}
title: hipCC
- url: ${project:hipcub}
title: hipCUB
- url: ${project:hipfft}
title: hipFFT
- url: ${project:hipify}
title: HIPIFY
- url: ${project:hiprand}
title: hipRAND
- url: ${project:hip}
title: HIP runtime
- url: ${project:hipsolver}
title: hipSOLVER
- url: ${project:hipsparse}
title: hipSPARSE
- url: ${project:hipsparselt}
title: hipSPARSELt
- url: ${project:hiptensor}
title: hipTensor
- url: ${project:miopen}
title: MIOpen
- url: ${project:amdmigraphx}
title: MIGraphX
- url: ${project:rccl}
title: RCCL
- url: ${project:rocalution}
title: rocALUTION
- url: ${project:rocblas}
title: rocBLAS
- url: ${project:rocdbgapi}
title: ROCdbgapi
- url: ${project:rocfft}
title: rocFFT
- file: reference/rocmcc.md
title: ROCmCC
- url: ${project:rdc}
title: ROCm Data Center Tool
- url: ${project:rocm_smi_lib}
title: ROCm SMI LIB
- url: ${project:rocmvalidationsuite}
title: ROCm validation suite
- url: ${project:rocprim}
title: rocPRIM
- url: ${project:rocprofiler}
title: ROCProfiler
- url: ${project:rocrand}
title: rocRAND
- url: ${project:rocsolver}
title: rocSOLVER
- url: ${project:rocsparse}
title: rocSPARSE
- url: ${project:rocthrust}
title: rocThrust
- url: ${project:roctracer}
title: rocTracer
- url: ${project:rocwmma}
title: rocWMMA
- url: ${project:transferbench}
title: TransferBench
- caption: Conceptual
entries: