From 424e6148bdcf4157f3dd7766392f80c9b5527e01 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Fri, 28 Mar 2025 11:25:06 -0400 Subject: [PATCH] Add MaxText training Docker doc Add MaxText training Docker doc --- .wordlist.txt | 4 + docs/how-to/gpu-performance/mi300x.rst | 6 +- .../training/benchmark-docker/jax-maxtext.rst | 345 ++++++++++++++++++ .../training/benchmark-docker/megatron-lm.rst | 21 +- docs/sphinx/_toc.yml.in | 2 + 5 files changed, 364 insertions(+), 14 deletions(-) create mode 100644 docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst diff --git a/.wordlist.txt b/.wordlist.txt index a784fd483..2f7081dad 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -241,6 +241,7 @@ Makefile Makefiles Matplotlib Matrox +MaxText Megatrends Megatron Mellanox @@ -380,6 +381,7 @@ SIMDs SKU SKUs SLES +SLURM SMEM SMI SMT @@ -573,6 +575,7 @@ distro distros dkms dtype +eb el embeddings enablement @@ -626,6 +629,7 @@ hipify hipsolver hipsparse hlist +hostname hotspotting hpc hpp diff --git a/docs/how-to/gpu-performance/mi300x.rst b/docs/how-to/gpu-performance/mi300x.rst index cc65f14ec..6dce7d9b6 100644 --- a/docs/how-to/gpu-performance/mi300x.rst +++ b/docs/how-to/gpu-performance/mi300x.rst @@ -14,9 +14,9 @@ instructions on system settings and application :doc:`workload tuning leverage the maximum capabilities of these accelerators and achieve superior performance. -* :doc:`../system-optimization/mi300x` covers essential system settings and - system management practices to configure your AMD Instinct MI300X system for - performance. +* `AMD Instinct MI300X system optimization `__ + covers essential system settings and system management practices to configure + your AMD Instinct MI300X system for performance. * :doc:`../rocm-for-ai/inference-optimization/workload` covers steps to optimize the performance of AMD Instinct MI300X series accelerators for HPC diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst new file mode 100644 index 000000000..78a4cf774 --- /dev/null +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst @@ -0,0 +1,345 @@ +.. meta:: + :description: How to train a model using JAX MaxText for ROCm. + :keywords: ROCm, AI, LLM, train, jax, torch, Llama, flux, tutorial, docker + +************************************** +Training a model with MaxText for ROCm +************************************** + +MaxText is a high-performance, open-source framework built on the Google JAX +machine learning library to train LLMs at scale. The MaxText framework for +ROCm is an optimized fork of the upstream +``__ enabling efficient AI workloads +on AMD MI300X series accelerators. + +The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.4``) image +provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators, +including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. +It includes the following software components: + ++--------------------------+--------------------------------+ +| Software component | Version | ++==========================+================================+ +| ROCm | 6.3.0 | ++--------------------------+--------------------------------+ +| JAX | 0.4.31 | ++--------------------------+--------------------------------+ +| Python | 3.10 | ++--------------------------+--------------------------------+ +| Transformer Engine | 1.12.0.dev0+f81a3eb | ++--------------------------+--------------------------------+ +| hipBLASLt | git78ec8622 | ++--------------------------+--------------------------------+ + +Supported features and models +============================= + +MaxText provides the following key features to train large language models efficiently: + +- Transformer Engine (TE) + +- Flash Attention (FA) 3 + +- GEMM tuning + +- Multi-node support + +.. _amd-maxtext-model-support: + +The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. + +* Llama 3.1 8B + +* Llama 3.1 70B + +* Llama 3 8B + +* Llama 3 70B + +* Llama 2 7B + +* Llama 2 70B + +* DeepSeek-V2-Lite + +.. note:: + + Some models, such as Llama 3, require an external license agreement through + a third party (for example, Meta). + +Unsupported features +-------------------- + +Currently, MaxText's default packed input format is not supported. Using this format +with the current Docker image results in incorrect attention calculations +across different input sequences. Support for packed input format is planned for a future release. + +System validation +================= + +If you have already validated your system settings, including NUMA +auto-balancing, skip this step. Otherwise, complete the :ref:`system validation +and optimization steps ` to set up your system +before starting training. + +Environment setup +================= + +This Docker image is optimized for specific model configurations outlined +as follows. Performance can vary for other training workloads, as AMD +doesn’t validate configurations and run conditions outside those described. + +.. _amd-maxtext-multi-node-setup: + +Multi-node setup +---------------- + +For multi-node environments, ensure you have all the necessary packages for +your network device, such as, RDMA. If you're not using a multi-node setup +with RDMA, skip ahead to :ref:`amd-maxtext-download-docker`. + +1. Install the following packages to build and install the RDMA driver. + + .. code-block:: shell + + sudo apt install iproute2 -y + sudo apt install -y linux-headers-"$(uname-r)" libelf-dev + sudo apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace libibmad5 libibnetdisc5 ibverbs-providers libibumad-dev libibumad3 libibverbs1 libnl-3-dev libnl-route-3-dev + + Refer to your NIC manufacturer's documentation for further steps on + compiling and installing the RoCE driver. For example, for Broadcom, + see `Compiling Broadcom NIC software from source `_ + in `Ethernet networking guide for AMD Instinct MI300X GPU clusters `_. + +2. Set the following environment variables. + + a. Master address + + Change `localhost` to the master node's resolvable hostname or IP address: + + .. code-block:: bash + + export MASTER_ADDR="${MASTER_ADDR:-localhost}" + + b. Number of nodes + + Set the number of nodes you want to train on (for example, ``2``, ``4``, or ``8``): + + .. code-block:: bash + + export NNODES="${NNODES:-1}" + + c. Node ranks + + Set the rank of each node (``0`` for master, ``1`` for the first worker node, and so on) + Node ranks should be unique across all nodes in the cluster. + + .. code-block:: bash + + export NODE_RANK="${NODE_RANK:-0}" + + d. Network interface + + Update the network interface in the script to match your system's network interface. To + find your network interface, run the following (outside of any Docker container): + + .. code-block:: bash + + ip a + + Look for an active interface with an IP address in the same subnet as + your other nodes. Then, update the following variable in the script, for + example: + + .. code-block:: bash + + export NCCL_SOCKET_IFNAME=ens50f0np0 + + This variable specifies which network interface to use for inter-node communication. + Setting this variable to the incorrect interface can result in communication failures + or significantly reduced performance. + + e. RDMA interface + + Ensure the :ref:`required packages ` are installed on all nodes. + Then, set the RDMA interfaces to use for communication. + + .. code-block:: bash + + # If using Broadcom NIC + export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 + # If using Mellanox NIC + export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9 + +.. _amd-maxtext-download-docker: + +Download the Docker image +------------------------- + +1. Use the following command to pull the Docker image from Docker Hub. + + .. code-block:: shell + + docker pull rocm/jax-training:maxtext-v25.4 + +2. Run the Docker container. + + .. code-block:: shell + + docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME/.ssh:/root/.ssh --shm-size 128G --name maxtext_training rocm/jax-training:maxtext-v25.4 + +.. _amd-maxtext-get-started: + +Getting started +=============== + +The following examples demonstrate how to get started with single node +and multi-node training using the benchmarking scripts provided at +``__. + +.. important:: + + The provided scripts launch a Docker container and execute a benchmark. Ensure you run these commands outside of any existing Docker container. + +Before running any benchmarks, ensure the ``$HF_HOME`` environment variable is +set correctly and points to your Hugging Face cache directory. Refer to the +README at ``__ +for more detailed instructions. + +Single node training benchmarking examples +------------------------------------------ + +* Example 1: Single node training with Llama 2 7B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b.sh + + Run the single node training benchmark: + + IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama2_7b.sh + +* Example 2: Single node training with Llama 2 70B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b.sh + + Run the single node training benchmark: + + .. code-block:: shell + + IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama2_70b.sh + +* Example 3: Single node training with Llama 3 8B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b.sh + + Run the single node training benchmark: + + .. code-block:: shell + + IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama3_8b.sh + +* Example 4: Single node training with Llama 3 70B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b.sh + + Run the single node training benchmark: + + .. code-block:: shell + + IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama3_70b.sh + +* Example 5: Single node training with DeepSeek V2 16B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/deepseek_v2_16b.sh + + Run the single node training benchmark: + + .. code-block:: shell + + IMAGE="rocm/jax-training:maxtext-v25.4" bash ./deepseek_v2_16b.sh + + .. note:: + + The reported TFLOP/s by MaxText for DeepSeek is not accurate. Use + the tokens/s as a performance indicator. + +Multi-node training benchmarking examples +----------------------------------------- + +The following examples use SLURM for running on multiple nodes -- the commands might need to be adjusted for your +own cluster setup. + +* Example 1: Multi-node training with Llama 2 7B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b_multinode.sh + + Run the multi-node training benchmark. For example: + + .. code-block:: shell + + sbatch -N llama2_7b_multinode.sh + +* Example 2: Multi-node training with Llama 2 70B + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b_multinode.sh + + Run the multi-node training benchmark. For example: + + .. code-block:: shell + + sbatch -N llama2_70b_multinode.sh + +* Example 3: Multi-node training with Llama 3 8B model + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b_multinode.sh + + Run the multi-node training benchmark. For example: + + .. code-block:: shell + + sbatch -N llama3_8b_multinode.sh + +* Example 4: Multi-node training with Llama 3 70B model + + Download the benchmarking script: + + .. code-block:: shell + + wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b_multinode.sh + + Run the multi-node training benchmark. For example: + + .. code-block:: shell + + sbatch -N llama3_70b_multinode.sh diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst index 9fe4ccb52..eacd63ad6 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst @@ -179,23 +179,22 @@ Network interface .. tab-item:: Llama :sync: llama - To avoid connectivity issues in multi-node deployments, ensure the correct network interface - is set in your training scripts. + Update the network interface in the script to match your system's network interface. To + find your network interface, run the following (outside of any Docker container): - 1. Run the following command (outside the container) to find the active network interface on your system. + .. code-block:: bash - .. code-block:: shell + ip a - ip a + Look for an active interface that has an IP address in the same subnet as + your other nodes. Then, update the following variables in the script, for + example: - 2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For - example: + .. code-block:: bash - .. code-block:: shell + export NCCL_SOCKET_IFNAME=ens50f0np0 - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 + export GLOO_SOCKET_IFNAME=ens50f0np0 Dataset options ^^^^^^^^^^^^^^^ diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index feaf3d7fb..2387efda6 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -44,6 +44,8 @@ subtrees: title: Train a model with Megatron-LM - file: how-to/rocm-for-ai/training/benchmark-docker/pytorch-training title: Train a model with PyTorch + - file: how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext + title: Train a model with JAX MaxText - file: how-to/rocm-for-ai/training/scale-model-training.rst title: Scale model training