Initial GPU-aware MPI port (#2086)

* Initial GPU-aware MPI port

* Remove trailing spaces

* Allowlist word in gpu_aware_mpi
This commit is contained in:
Nagy-Egri Máté Ferenc
2023-05-04 17:42:22 +02:00
committed by GitHub
parent 66ed6adf6c
commit 62ed404058
4 changed files with 166 additions and 0 deletions

17
.wordlist.txt Normal file
View File

@@ -0,0 +1,17 @@
# gpu_aware_mpi
DMA
GDR
HCA
MPI
MVAPICH
Mellanox's
NIC
OFED
OSU
OpenFabrics
PeerDirect
RDMA
UCX
ib_core
# isv_deployment_win
ABI

View File

@@ -187,6 +187,7 @@ subtrees:
- file: how_to/magma_install/magma_install
- file: how_to/pytorch_install/pytorch_install
- file: how_to/tensorflow_install/tensorflow_install
- file: how_to/gpu_aware_mpi
- file: how_to/system_debugging
- caption: Examples

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

View File

@@ -0,0 +1,148 @@
# GPU-Enabled MPI
The Message Passing Interface ([MPI](https://www.mpi-forum.org)) is a standard
API for distributed and parallel application development that can scale to
multi-node clusters. To facilitate the porting of applications to clusters with
GPUs, ROCm enables various technologies. These technologies allow users to
directly use GPU pointers in MPI calls and enable ROCm-aware MPI libraries to
deliver optimal performance for both intra-node and inter-node GPU-to-GPU
communication.
The AMD kernel driver exposes Remote Direct Memory Access (RDMA) through the
*PeerDirect* interfaces to allow Host Channel Adapters (HCA, a type of
Network Interface Card or NIC) to directly read and write to the GPU device
memory with RDMA capabilities. These interfaces are currently registered as a
*peer_memory_client* with Mellanoxs OpenFabrics Enterprise Distribution (OFED)
`ib_core` kernel module to allow high-speed DMA transfers between GPU and HCA.
These interfaces are used to optimize inter-node MPI message communication.
This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
MPI project is an open source implementation of the Message Passing Interface
(MPI) that is developed and maintained by a consortium of academic, research,
and industry partners.
Several MPI implementations can be made ROCm-aware by compiling them with
[Unified Communication Framework](http://www.openucx.org/) (UCX) support. One
notable exception is MVAPICH2: It directly supports AMD GPUs without using UCX,
and you can download it [here](http://mvapich.cse.ohio-state.edu/downloads/).
Use the latest version of the MVAPICH2-GDR package.
The Unified Communication Framework, is an open source cross-platform framework
whose goal is to provide a common set of communication interfaces that targets a
broad set of network programming models and interfaces. UCX is ROCm-aware, and
ROCm technologies are used directly to implement various network operation
primitives. For more details on the UCX design, refer to it's
[documentation](http://www.openucx.org/documentation).
## Building UCX
The following section describes how to set up UCX so it can be used to compile
Open MPI. The following environment variables are set, such that all software
components will be installed in the same base directory (we assume to install
them in your home directory; for other locations adjust the below environment
variables accordingly, and make sure you have write permission for that
location):
```shell
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
```
```note
The following sequences of build commands assume either the ROCmCC or the AOMP
compiler is active in the environment, which will execute the commands.
```
## Install UCX
The next step is to set up UCX by compiling its source code and install it:
```shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.13.0
cd ucx
./autogen.sh
mkdir build
cd build
../contrib/configure-release -prefix=$UCX_DIR \
--with-rocm=/opt/rocm \
--without-cuda -enable-optimizations -disable-logging \
--disable-debug -disable-assertions \
--disable-params-check -without-java
make -j $(nproc)
make -j $(nproc) install
```
## Install Open MPI
These are the steps to build Open MPI:
```shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--enable-mca-no-build=btl-uct --enable-mpi1-compatibility \
CC=clang CXX=clang++ FC=flang
make -j $(nproc)
make -j $(nproc) install
```
## ROCm-enabled OSU
he OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of
various primitives with an AMD GPU device and ROCm support. This functionality
is exposed when configured with `--enable-rocm` option. We can use the following
steps to compile OMB:
```shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xfz osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
--with-rocm=/opt/rocm \
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)
```
## Intra-node Run
The following command runs the OSU bandwidth benchmark between the first two GPU
devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It
measures the unidirectional bandwidth from the first device to the other.
```shell
$OMPI_DIR/bin/mpirun -np 2 --mca btl '^openib' \
-x UCX_TLS=sm,self,rocm_copy,rocm_ipc \
--mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
```
To select different devices, for example 2 and 3, use the following command:
```shell
export HIP_VISIBLE_DEVICES=2,3
export HSA_ENABLE_SDMA=0
```
The following output shows the effective transfer bandwidth measured for
inter-die data transfer between GPU device 2 and 3 (same OAM). For messages
larger than 67MB, an effective utilization of about 150GB/sec is achieved, which
corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that
connection:
:::{figure} /data/how_to/gpu_enabled_mpi_1.png
:name: mpi-bandwidth
:alt: OSU execution showing transfer bandwidth increasing alongside payload inc.
Inter-GPU bandwidth with various payload sizes.
:::