mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 14:48:06 -05:00
321 lines
11 KiB
ReStructuredText
321 lines
11 KiB
ReStructuredText
.. meta::
|
||
:description: Multi-node setup for AI training
|
||
:keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training
|
||
|
||
.. _rocm-for-ai-multi-node-setup:
|
||
|
||
*********************************
|
||
Multi-node setup for AI workloads
|
||
*********************************
|
||
|
||
AMD provides ready-to-use Docker images for AMD Instinct™ MI300X and MI325X
|
||
GPUs containing ROCm-capable deep learning frameworks and essential
|
||
software components. These Docker images can run and leverage multiple nodes if
|
||
they are available. This page describes how to enable the multi-node training
|
||
of AI workloads on AMD Instinct GPUs.
|
||
|
||
Prerequisites
|
||
=============
|
||
|
||
Before starting, ensure your environment meets the following requirements:
|
||
|
||
* Multi-node networking: your cluster should have a configured multi-node network. For setup
|
||
instructions, see the `Multi-node network configuration for AMD Instinct
|
||
accelerators
|
||
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
|
||
guide in the Instinct documentation.
|
||
|
||
* ROCm Docker container to simplify environment setup for AI workloads. See the following resources to get started:
|
||
|
||
* :doc:`Training a model with Megatron-LM and ROCm <../training/benchmark-docker/megatron-lm>`
|
||
|
||
* :doc:`Training a model with PyTorch and ROCm <../training/benchmark-docker/pytorch-training>`
|
||
|
||
* :doc:`Training a model with JAX MaxText and ROCm <../training/benchmark-docker/jax-maxtext>`
|
||
|
||
* Slurm workload manager to run the :ref:`provided examples <multi-node-setup-training-examples>`.
|
||
|
||
Install required packages
|
||
=========================
|
||
|
||
To run multi-node workloads, ensure you have all the required packages installed based on your
|
||
network device. For example, on Ubuntu systems:
|
||
|
||
.. code-block:: shell
|
||
|
||
apt install -y iproute2
|
||
|
||
apt install -y linux-headers-"$(uname -r)" libelf-dev
|
||
|
||
apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace libibmad5 libibnetdisc5 ibverbs-providers libibumad-dev libibumad3 libibverbs1 libnl-3-dev libnl-route-3-dev
|
||
|
||
Compile and install the RoCE library
|
||
------------------------------------
|
||
|
||
If you're using Broadcom NICs, you need to compile and install the RoCE (RDMA
|
||
over Converged Ethernet) library. See `RoCE cluster network configuration guide
|
||
for AMD Instinct accelerators
|
||
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/roce-network-config.html#roce-cluster-network-configuration-guide-for-amd-instinct-accelerators>`__
|
||
for more information.
|
||
|
||
See the `Ethernet networking guide for AMD
|
||
Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source
|
||
<https://docs.broadcom.com/doc/957608-AN2XX#page=81>`_ for more details.
|
||
|
||
.. important::
|
||
|
||
It is crucial to install the exact same version of the RoCE library that
|
||
is installed on your host system. Also, ensure that the path to these
|
||
libraries on the host is correctly mounted into your Docker container.
|
||
Failure to do so can lead to compatibility issues and communication
|
||
failures.
|
||
|
||
1. Set ``BUILD_DIR`` to the path on the host system where the Broadcom drivers and ``bnxt_rocelib`` source are located.
|
||
Then, navigate to the ``bnxt_rocelib`` directory.
|
||
|
||
.. code-block:: shell
|
||
|
||
export BUILD_DIR=/path/to/your/broadcom_drivers_on_host
|
||
cd $BUILD_DIR/drivers_linux/bnxt_rocelib/
|
||
|
||
2. The ``bnxt_rocelib`` directory contains a version of ``libbnxt_re`` in a zipped ``.tar.gz`` file.
|
||
|
||
.. code-block:: shell
|
||
|
||
tar -xf libbnxt_re-a.b.c.d.tar.gz
|
||
cd libbnxt_re-a.b.c.d
|
||
|
||
3. Compile and install the RoCE library.
|
||
|
||
.. code-block:: shell
|
||
|
||
sh autogen.sh
|
||
./configure
|
||
make
|
||
find /usr/lib64/ /usr/lib -name "libbnxt_re-rdmav*.so" -exec mv {} {}.inbox \;
|
||
make install all
|
||
sh -c "echo /usr/local/lib >> /etc/ld.so.conf"
|
||
ldconfig
|
||
cp -f bnxt_re.driver /etc/libibverbs.d/
|
||
find . -name "*.so" -exec md5sum {} \;
|
||
BUILT_MD5SUM=$(find . -name "libbnxt_re-rdmav*.so" -exec md5sum {} \; | cut -d " " -f 1)
|
||
|
||
Environment setup
|
||
=================
|
||
|
||
Before running multi-node workloads, set these essential environment variables:
|
||
|
||
Master address
|
||
--------------
|
||
|
||
By default, ``localhost`` is used for single-node configurations. Change
|
||
``localhost`` to the master node's resolvable hostname or IP address:
|
||
|
||
.. code-block:: bash
|
||
|
||
export MASTER_ADDR="${MASTER_ADDR:-localhost}"
|
||
|
||
Number of nodes
|
||
---------------
|
||
|
||
Set the number of nodes you want to train on (for example, ``2``, ``4``, or ``8``):
|
||
|
||
.. code-block:: bash
|
||
|
||
export NNODES="${NNODES:-<num_nodes>}"
|
||
|
||
Node ranks
|
||
----------
|
||
|
||
Set the rank of each node (``0`` for master, ``1`` for the first worker node, and so on).
|
||
Node ranks should be unique across all nodes in the cluster.
|
||
|
||
.. code-block:: bash
|
||
|
||
export NODE_RANK="${NODE_RANK:-<node_rank>}"
|
||
|
||
Network interface
|
||
-----------------
|
||
|
||
Update the network interface in the script to match your system's network interface. To
|
||
find your network interface, run the following (outside of any Docker container):
|
||
|
||
.. code-block:: bash
|
||
|
||
ip a
|
||
|
||
Look for an active interface (status "UP") with an IP address in the same subnet as
|
||
your other nodes. Then, update the following variable in the script, for
|
||
example:
|
||
|
||
.. code-block:: bash
|
||
|
||
export NCCL_SOCKET_IFNAME=ens50f0np0
|
||
|
||
This variable specifies which network interface to use for inter-node communication.
|
||
Setting this variable to the incorrect interface can result in communication failures
|
||
or significantly reduced performance.
|
||
|
||
.. tip::
|
||
|
||
This command sets ``NCCL_SOCKET_IFNAME``'s value to the last RDMA interface.
|
||
|
||
.. code-block:: bash
|
||
|
||
export NCCL_SOCKET_IFNAME=$(rdma link show | awk '{print $NF}' | sort | tail -n1)
|
||
|
||
RDMA/IB interface
|
||
-----------------
|
||
|
||
Set the RDMA interfaces to be used for communication. NICs can come from different vendors and the names of the RDMA interface can be different. To get the list of all the RDMA/IB devices, run:
|
||
|
||
.. code-block:: bash
|
||
|
||
ibv_devices
|
||
|
||
The command below gets the list of all RDMA/IB devices and puts them in a
|
||
comma-separated format. If
|
||
(``rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7``) are your RDMA
|
||
interfaces, then set:
|
||
|
||
.. code-block:: bash
|
||
|
||
# If using Broadcom NIC
|
||
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
|
||
# If using Mellanox NIC
|
||
# export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9
|
||
|
||
.. tip::
|
||
|
||
Alternatively, if you want to choose the RDMA interface automatically, you
|
||
can use the following. This command will sort the RDMA interfaces and then
|
||
select the first eight RDMA interfaces.
|
||
|
||
.. code-block:: bash
|
||
|
||
export NCCL_IB_HCA=$(ibv_devices | awk 'NR>2 {print $1}' | sort | head -n 8 | paste -sd,)
|
||
|
||
Global ID index
|
||
---------------
|
||
|
||
Update the global ID index if you're using RoCE.
|
||
|
||
.. code-block:: bash
|
||
|
||
export NCCL_IB_GID_INDEX=3
|
||
|
||
.. _multi-node-setup-training-examples:
|
||
|
||
Multi-node training examples
|
||
============================
|
||
|
||
The following examples use the Slurm workload manager to launch jobs on
|
||
multiple nodes. To run these scripts as-is, you must have a Slurm environment
|
||
configured. The scripts are designed to work with both Broadcom Thor 2 and
|
||
Mellanox NICs by automatically installing the required libraries and setting
|
||
the necessary environment variables. For systems with Broadcom NICs, the
|
||
scripts assume the host's RoCE library is located in the ``/opt`` directory.
|
||
|
||
The following benchmarking examples demonstrate the training of a Llama 3 8B model
|
||
across multiple 8-GPU nodes, using FSDP for intra-node parallelism and DP for
|
||
inter-node parallelism.
|
||
|
||
.. _rocm-for-ai-multi-node-setup-jax-train-example:
|
||
|
||
JAX MaxText
|
||
-----------
|
||
|
||
1. Download the desired multi-node benchmarking script from `<https://github.com/ROCm/MAD/tree/develop/scripts/jax-maxtext/gpu-rocm>`__.
|
||
|
||
.. code-block:: shell
|
||
|
||
wget https://raw.githubusercontent.com/ROCm/MAD/refs/heads/develop/scripts/jax-maxtext/gpu-rocm/llama3_8b_multinode.sh
|
||
|
||
Or clone the `<https://github.com/ROCm/MAD>`__ repository.
|
||
|
||
.. code-block:: shell
|
||
|
||
git clone https://github.com/ROCm/MAD
|
||
cd scripts/jax-maxtext/gpu-rocm
|
||
|
||
2. Run the benchmark for multi-node training.
|
||
|
||
.. code-block:: shell
|
||
|
||
sbatch -N <num_nodes> llama3_8b_multinode.sh
|
||
|
||
.. _rocm-for-ai-multi-node-setup-pyt-train-example:
|
||
|
||
PyTorch training
|
||
----------------
|
||
|
||
.. note::
|
||
|
||
The ROCm PyTorch Training Docker image now focuses on :doc:`Training a model
|
||
with Primus and PyTorch <../training/benchmark-docker/primus-pytorch>`. The
|
||
following example refers to the legacy workflow :ref:`Training a
|
||
model with PyTorch <amd-pytorch-training-multinode-examples>`.
|
||
|
||
1. Download the ``run_multinode_train.sh`` benchmarking script from `<https://github.com/ROCm/MAD/tree/develop/scripts/pytorch_train>`__.
|
||
|
||
.. code-block:: shell
|
||
|
||
wget https://raw.githubusercontent.com/ROCm/MAD/refs/heads/develop/scripts/pytorch_train/run_multinode_train.sh
|
||
|
||
Or clone the `<https://github.com/ROCm/MAD>`__ repository.
|
||
|
||
.. code-block:: shell
|
||
|
||
git clone https://github.com/ROCm/MAD
|
||
cd scripts/pytorch_train
|
||
|
||
2. Run the benchmark for multi-node training.
|
||
|
||
.. code-block:: shell
|
||
|
||
sbatch -N <num_nodes> run_multinode_train.sh
|
||
|
||
.. seealso::
|
||
|
||
See :ref:`Training a model with PyTorch <amd-pytorch-training-multinode-examples>` for more examples and information.
|
||
|
||
Megatron-LM
|
||
-----------
|
||
|
||
.. note::
|
||
|
||
The Megatron-LM Docker image now focuses on :ref:`Training a model with
|
||
Primus and Megatron <amd-primus-megatron-multi-node-examples>`. The
|
||
following example refers to the legacy Megatron-LM :ref:`Training a model
|
||
with Megatron-LM <amd-megatron-lm-multi-node-examples>` and might have
|
||
limited support.
|
||
|
||
1. Download the ``train_llama_slurm.sh`` benchmarking script from
|
||
`<https://github.com/ROCm/Megatron-LM/blob/rocm_dev/examples/llama/train_llama_slurm.sh>`__.
|
||
|
||
2. Set the network interface parameters as per the above guidelines and run the script.
|
||
|
||
.. code-block:: shell
|
||
|
||
cd </path/to/your/Megatron-LM>
|
||
export NETWORK_INTERFACE=$NCCL_SOCKET_IFNAME
|
||
export NCCL_IB_HCA=$NCCL_IB_HCA
|
||
export IMAGE=docker.io/rocm/megatron-lm:latest OR your preferred image
|
||
export DATA_CACHE_PATH=/nfs/mounted/repo
|
||
|
||
sbatch –N <num_nodes> examples/llama/train_llama_slurm.sh <MODEL_SIZE> <MBS> <GBS> <SEQ_LENGTH> <FSDP> <RECOMPUTE>
|
||
|
||
2. For example, to run a Llama 3 8B workload in BF16 precision, use the following command.
|
||
|
||
.. code-block:: shell
|
||
|
||
MODEL_NAME=llama3 sbatch –N 8 examples/llama/train_llama_slurm.sh 8 2 128 8192 0 0
|
||
# Other parameters, such as TP, FP8 datatype, can be adjusted in the script.
|
||
|
||
Further reading
|
||
===============
|
||
|
||
* `Multi-node network configuration for AMD Instinct accelerators <https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
|
||
|
||
* `Ethernet networking guide for AMD Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source <https://docs.broadcom.com/doc/957608-AN2XX#page=81>`__
|