improve title

wording
words words
2026-02-11 23:15:06 -05:00 · 2024-12-16 12:57:17 -05:00 · 2024-12-16 12:45:53 -05:00 · 2024-12-13 22:19:52 -05:00
14 changed files with 595 additions and 89 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -592,6 +592,7 @@ hpp
 hsa
 hsakmt
 hyperparameter
 hyperparameters
 iDRAC
 ib_core
 inband
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -43,6 +43,7 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/train-a-model", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/accelerate-training", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/deploy-your-model", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/hugging-face-models", "os": ["linux"]},
    {"file": "how-to/rocm-for-hpc/index", "os": ["linux"]},
--- a/docs/data/how-to/rocm-for-ai/2-node-training-master.png
+++ b/docs/data/how-to/rocm-for-ai/2-node-training-master.png
--- a/docs/data/how-to/rocm-for-ai/2-node-training-worker.png
+++ b/docs/data/how-to/rocm-for-ai/2-node-training-worker.png
--- a/docs/data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
+++ b/docs/data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
--- a/docs/data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
+++ b/docs/data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -16,6 +16,8 @@ In this guide, you'll learn about:
 - :doc:`Installing ROCm and machine learning frameworks <install>`
 - :doc:`Scaling model training <scale-model-training>`
 - :doc:`Training a model <train-a-model>`
 - :doc:`Running models from Hugging Face <hugging-face-models>`
--- a/docs/how-to/rocm-for-ai/scale-model-training.rst
+++ b/docs/how-to/rocm-for-ai/scale-model-training.rst
@@ -0,0 +1,135 @@
 .. meta::
   :description: How to scale and accelerate model training
   :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
 **********************
 Scaling model training
 **********************
 To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the
 model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or
 accelerator can simultaneously store and process the entire model's parameters during training. PyTorch
 provides an answer to this computational constraint through its distributed training frameworks.
 .. _rocm-for-ai-pytorch-distributed:
 PyTorch distributed
 ===================
 Features in ``torch.distributed`` are categorized into three main components:
 - `Distributed data-parallel training
  <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
 - `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
 - `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
 In this topic, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
 you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
 The DDP workflow on multiple accelerators or GPUs is as follows:
 #. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
   the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
 #. Copy the model to every device so each can process its local batches independently.
 #. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
   model for that local batch. This happens in parallel on multiple devices.
 #. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
   weights are then redistributed to each device.
 In DDP training, each process or worker owns a replica of the model and processes a batch of data, and then the reducer uses
 ``allreduce`` to sum up gradients over different workers.
 See the following developer blogs for more in-depth explanations and examples.
 *  `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
 *  `Building a decoder transformer model on AMD GPUs — ROCm Blogs
   <https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
 .. _rocm-for-ai-pytorch-fsdp:
 PyTorch FSDP
 ------------
 As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, DDP model weights and optimizer states
 are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
 model parameters, optimizer states, and gradients across DDP ranks.
 When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
 training some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
 comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
 like overlapping communication and computation.
 For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
 <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
 For detailed training steps, see `PyTorch FSDP examples
 <https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
 .. _rocm-for-ai-deepspeed:
 DeepSpeed
 ---------
 `DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
 efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
 the training pillar.
 See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
 <https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
 training with DeepSpeed on an AMD accelerator or GPU.
 .. _rocm-for-ai-automatic-mixed-precision:
 Automatic mixed precision (AMP)
 -------------------------------
 As models increase in size, so do the time and memory needed to train them; their cost also increases. Any measure we
 can take to reduce training time and memory usage through `automatic mixed precision
 <https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
 See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
 <https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
 for more information about running AMP on an AMD accelerator.
 .. _rocm-for-ai-fine-tune:
 Fine-tuning your model
 ======================
 ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
 example, LoRA, QLoRA, PEFT, and FSDP.
 Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
 The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU.
 * Fine-tuning Llama2 with LoRA
  * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering
    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
 * Fine-tuning Llama2 with QLoRA
  * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU
    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
 * Fine-tuning a BERT-based LLM for a text classification task using JAX
  * `LLM distributed supervised fine-tuning with JAX
    <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
 * Fine-tuning StarCoder using PEFT
  * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs
    <https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
 * Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
  * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
    single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/finetuning>`_
--- a/docs/how-to/rocm-for-ai/train-a-model.rst
+++ b/docs/how-to/rocm-for-ai/train-a-model.rst
@@ -1,140 +1,503 @@
 .. meta::
-   :description: How to use ROCm for AI
+   :description: How to train a model using ROCm Megatron-LM
-   :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
+   :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
-****************
+**************************************
-Training a model
+Training a model with ROCm Megatron-LM
-****************
+**************************************
-The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
+.. _amd-megatron-lm:
 and inferencing.
-Accelerating model training
+The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
-===========================
+enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
 accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
 workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support>`
 like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
 efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
-To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
+For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
-required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
+components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
-PyTorch offers distributed training solutions to facilitate this.
+following software to accelerate training workloads:
-.. _rocm-for-ai-pytorch-distributed:
+--------------------------+--------------------------------+
 | Software component       | Version                        |
 +==========================+================================+
 | ROCm                     | 6.1                            |
 +--------------------------+--------------------------------+
 | PyTorch                  | 2.4.0                          |
 +--------------------------+--------------------------------+
 | PyTorch Lightning        | 2.4.0                          |
 +--------------------------+--------------------------------+
 | Megatron Core            | 0.9.0                          |
 +--------------------------+--------------------------------+
 | Transformer Engine       | 1.5.0                          |
 +--------------------------+--------------------------------+
 | Flash Attention          | v2.6                           |
 +--------------------------+--------------------------------+
 | Transformers             | 4.44.0                         |
 +--------------------------+--------------------------------+
-PyTorch distributed
+Supported features and models
-------------------
+=============================
-As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
+Megatron-LM provides the following key features to train large language models efficiently:
- `Distributed data-parallel training
+- Transformer Engine (TE)
  <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
+- APEX
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
+- GEMM tuning
-In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
+- Torch.compile
 let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
-The DDP workflow on multiple accelerators or GPUs is as follows:
+- 3D parallelism: TP + SP + CP
-#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
+- Distributed optimizer
   the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
-#. Copy the model to every device so each device can process its local batches independently.
+- Flash Attention (FA) 2
-#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
+- Fused kernels
   model for that local batch. This happens in parallel on multiple devices.
-#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
+- Pre-training
   weights are then redistributed to each device.
-In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
+.. _amd-megatron-lm-model-support:
 ``allreduce`` to sum up gradients over different workers.
-See the following developer blogs for more in-depth explanations and examples.
+The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
-*  `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
+* Llama 2 7B
-*  `Building a decoder transformer model on AMD GPUs — ROCm Blogs
+* Llama 2 70B
   <https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
-.. _rocm-for-ai-pytorch-fsdp:
+* Llama 3 8B
-PyTorch FSDP
+* Llama 3 70B
 ------------
-As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
+* Llama 3.1 8B
 are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
 model parameters, optimizer states, and gradients across DDP ranks.
-When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
+* Llama 3.1 70B
 the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
 comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
 like overlapping communication and computation.
-For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
+Prerequisite system validation steps
-<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
+====================================
-For detailed training steps, refer to the `PyTorch FSDP examples
+Complete the following system validation and optimization steps to set up your system before starting training.
 <https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
-.. _rocm-for-ai-deepspeed:
+Disable NUMA auto-balancing
 ---------------------------
-DeepSpeed
+Generally, application performance can benefit from disabling NUMA auto-balancing. However,
---------
+it might be detrimental to performance with certain types of workloads.
-`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
+Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
-efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
+Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
-the training pillar.
+the output is ``1``, run the following command to disable NUMA auto-balancing.
-See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
+.. code-block:: shell
 <https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
 training with DeepSpeed on an AMD accelerator or GPU.
-.. _rocm-for-ai-automatic-mixed-precision:
+   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
-Automatic mixed precision (AMP)
+See :ref:`mi300x-disable-numa` for more information.
 Hardware verification with ROCm
 -------------------------------
-As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
+Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
-can take to reduce training time and memory usage through `automatic mixed precision
+instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
-<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
+GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
 You can restore this setting to its default value with the ``rocm-smi -r`` command.
-See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
+Run the command:
 <https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
 for more information about running AMP on an AMD accelerator.
-.. _rocm-for-ai-fine-tune:
+.. code-block:: shell
-Fine-tuning your model
+   rocm-smi --setperfdeterminism 1900
 ======================
-ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
+See :ref:`mi300x-hardware-verification-with-rocm` for more information.
 example, LoRA, QLoRA, PEFT, and FSDP.
-Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
+RCCL Bandwidth Test
 -------------------
-The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
+ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication
 routines for GPUs. See the :doc:`RCCL documentation <rccl:index>` for more information. Before starting
 pre-training, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized
 for efficient distributed training.
-* Fine-tuning Llama2 with LoRA
+Running the RCCL bandwidth test helps verify that:
-  * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
+- The GPUs can communicate across nodes or within a single node.
    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
-* Fine-tuning Llama2 with QLoRA
+- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and
  provides adequate bandwidth for communication.
-  * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
+- No hardware setup or cabling issues could affect the communication between GPUs
    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
-* Fine-tuning a BERT-based LLM for a text classification task using JAX
+Tuning and optimizing hyperparameters
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-  * `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
+In distributed training, specific hyperparameters related to distributed communication can be tuned based on
-    <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
+the results of the RCCL bandwidth test. These variables are already set in the Docker image:
-* Fine-tuning StarCoder using PEFT
+.. code-block:: shell
-  * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
+   # force all RCCL streams to be high priority
-    <https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
+   export TORCH_NCCL_HIGH_PRIORITY=1
-* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
+   # specify which RDMA interfaces to use for communication
   export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
   # define the Global ID index used in RoCE mode
   export NCCL_IB_GID_INDEX=3
   # avoid data corruption/mismatch issue that existed in past releases
   export RCCL_MSCCL_ENABLE=0
 Running the RCCL Bandwidth Test
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 It's recommended you run the RCCL bandwidth test before launching training. It ensures system
 performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker
 image; follow the instructions in `<https://github.com/ROCm/rccl-tests>`__ to get started.
 See :ref:`mi300x-rccl` for more information.
 Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:
 .. code-block:: shell
   ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8
 .. image:: ../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
   :width: 800
 Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
 recommended. So, a run on 8 GPUs looks something like:
 .. code-block:: shell
   mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1
 .. image:: ../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
   :width: 800
 Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
 for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like
 PyTorch and TensorFlow.
 Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed.
 .. code-block::
   /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \
   --mca pml ucx \
   --mca btl ^openib \
   -x NCCL_SOCKET_IFNAME=ens50f0np0 \
   -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \
   -x NCCL_IB_GID_INDEX=3 \
   -x NCCL_MIN_NCHANNELS=40 \
   -x NCCL_DEBUG=version \
   $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1
 .. image:: ../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
   :width: 800
 .. _mi300x-amd-megatron-lm-training:
 Start training on MI300X accelerators
 =====================================
 The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
 training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
 Use the following instructions to set up the environment, configure the script to train models, and
 reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
 image.
 .. _amd-megatron-lm-requirements:
 Download the Docker image and required packages
 -----------------------------------------------
 1. Use the following command to pull the Docker image from Docker Hub.
   .. code-block:: shell
      docker pull rocm/megatron-lm:24.12-dev
 2. Launch the Docker container.
   .. code-block:: shell
      docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
 3. Clone the ROCm Megatron-LM repository to a local directory and install the required packages on the host machine.
   .. code-block:: shell
      git clone https://github.com/ROCm/Megatron-LM
      cd Megatron-LM
   .. note::
      This release is validated with ``ROCm/Megatron-LM`` commit `bb93ccb <https://github.com/ROCm/Megatron-LM/tree/bb93ccbfeae6363c67b361a97a27c74ab86e7e92>`_.
      Checking out this specific commit is recommended for a stable and reproducible environment.
      .. code-block:: shell
         git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92
 Prepare training datasets
 -------------------------
 If you already have the preprocessed data, you can skip this section.
 Use the following command to process datasets. We use GPT data as an example. You may change the merge table, use an
 end-of-document token, remove sentence splitting, and use the tokenizer type.
 .. code-block:: shell
   python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab-file gpt2-vocab.json \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod
 In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and
 ``my-gpt2_text_document.idx``.
 .. image:: ../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
   :width: 800
 .. _amd-megatron-lm-environment-setup:
 Environment setup
 -----------------
 In the ``examples/llama`` directory of Megatron-LM, if you're working with Llama 2 7B or Llama 2 70 B, use the
 ``train_llama2.sh`` configuration script. Likewise, if you're working with Llama 3 or Llama 3.1, then use
 ``train_llama3.sh`` and update the configuration script accordingly.
 Network interface
 ^^^^^^^^^^^^^^^^^
 To avoid connectivity issues, ensure the correct network interface is set in your training scripts.
 1. Run the following command to find the active network interface on your system.
   .. code-block:: shell
      ip a
 2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For
   example:
   .. code-block:: shell
      export NCCL_SOCKET_IFNAME=ens50f0np0
      export GLOO_SOCKET_IFNAME=ens50f0np0
 Dataset options
 ^^^^^^^^^^^^^^^
 You can use either mock data or real data for training.
 * If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
  .. code-block:: shell
     DATA_DIR="/root/.cache/data" # Change to where your dataset is stored
     DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
  .. code-block:: shell
     --data-path $DATA_PATH
  Ensure that the files are accessible inside the Docker container.
 * Mock data can be useful for testing and validation. If you're using mock data, replace ``--data-path $DATA_PATH`` with the ``--mock-data`` option.
  .. code-block:: shell
     --mock-data
 Tokenizer
 ^^^^^^^^^
 Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
 models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
 a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
 fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
 handle a variety of input sequences, including unseen words or domain-specific terms.
 To train any of the Llama 2 models that this Docker image supports, use the ``Llama2Tokenizer``.
 To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
 Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable.
 For example, if you're using the Llama 3.1 8B model:
 .. code-block:: shell
   TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
 Run benchmark tests
 -------------------
 .. note::
   If you're running **multi node training**, update the following environment variables. They can
   also be passed as command line arguments.
   * Change ``localhost`` to the master node's hostname:
     .. code-block:: shell
        MASTER_ADDR="${MASTER_ADDR:-localhost}"
   * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
     .. code-block:: shell
        NNODES="${NNODES:-1}"
   * Set the rank of each node (0 for master, 1 for the first worker node, and so on):
     .. code-block:: shell
        NODE_RANK="${NODE_RANK:-0}"
 * Use this command to run a performance benchmark test of any of the Llama 2 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
  .. code-block:: shell
     {variables} bash examples/llama/train_llama2.sh
 * Use this command to run a performance benchmark test of any of the Llama 3 and Llama 3.1 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
  .. code-block:: shell
     {variables} bash examples/llama/train_llama3.sh
 .. _amd-megatron-lm-benchmark-test-vars:
 The benchmark tests support the same set of variables:
 +--------------------------+-----------------------+-----------------------+
 | Name                     | Options               | Description           |
 +==========================+=======================+=======================+
 | ``TEE_OUTPUT``           | 0 or 1                | 0: disable training   |
 |                          |                       | log                   |
 |                          |                       |                       |
 |                          |                       | 1: enable training    |
 |                          |                       | log                   |
 +--------------------------+-----------------------+-----------------------+
 | ``MBS``                  |                       | Micro batch size      |
 +--------------------------+-----------------------+-----------------------+
 | ``BS``                   |                       | Batch size            |
 +--------------------------+-----------------------+-----------------------+
 | ``TP``                   | 1, 2, 4, 8            | Tensor parallel       |
 +--------------------------+-----------------------+-----------------------+
 | ``TE_FP8``               | 0 or 1                | Datatype.             |
 |                          |                       | If it is set to 1,    |
 |                          |                       | FP8.                  |
 |                          |                       |                       |
 |                          |                       | If it is set to 0.    |
 |                          |                       | BP16                  |
 +--------------------------+-----------------------+-----------------------+
 | ``NO_TORCH_COMPILE``     | 0 or 1                | If it is set to 1,    |
 |                          |                       | enable torch.compile. |
 |                          |                       |                       |
 |                          |                       | If it is set to 0.    |
 |                          |                       | Disable torch.compile |
 |                          |                       | (default)             |
 +--------------------------+-----------------------+-----------------------+
 | ``SEQ_LENGTH``           |                       | Input sequence length |
 +--------------------------+-----------------------+-----------------------+
 | ``GEMM_TUNING``          | 0 or 1                | If it is set to 1,    |
 |                          |                       | enable gemm tuning.   |
 |                          |                       |                       |
 |                          |                       | If it is set to 0,    |
 |                          |                       | disable gemm tuning   |
 +--------------------------+-----------------------+-----------------------+
 | ``USE_FLASH_ATTN``       | 0 or 1                | 0: disable flash      |
 |                          |                       | attention             |
 |                          |                       |                       |
 |                          |                       | 1: enable flash       |
 |                          |                       | attention             |
 +--------------------------+-----------------------+-----------------------+
 | ``ENABLE_PROFILING``     | 0 or 1                | 0: disable torch      |
 |                          |                       | profiling             |
 |                          |                       |                       |
 |                          |                       | 1: enable torch       |
 |                          |                       | profiling             |
 +--------------------------+-----------------------+-----------------------+
 | ``MODEL_SIZE``           |                       | The size of the mode: |
 |                          |                       | 7B/70B, etc.          |
 +--------------------------+-----------------------+-----------------------+
 | ``TOTAL_ITERS``          |                       | Total number of       |
 |                          |                       | iterations            |
 +--------------------------+-----------------------+-----------------------+
 | ``transformer-impl``     | transformer_engine or | Enable transformer    |
 |                          | local                 | engine by default     |
 +--------------------------+-----------------------+-----------------------+
 Benchmarking examples
 ^^^^^^^^^^^^^^^^^^^^^
 .. tab-set::
   .. tab-item:: Single node training
      :sync: single
      Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
      datatype, and so on.
      .. code-block:: bash
         TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
      See the sample output:
      .. image:: ../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
         :width: 800
   .. tab-item:: Multi node training
      :sync: multi
      Launch the Docker container on each node.
      In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
      so on.
      On the master node:
      .. code-block:: bash
         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
      On the worker node:
      .. code-block:: bash
         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
      Sample output for 2-node training:
      Master node:
      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-master.png
         :width: 800
      Worker node:
      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-worker.png
         :width: 800
  * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
    single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/finetuning>`_
--- a/docs/how-to/system-optimization/mi300x.rst
+++ b/docs/how-to/system-optimization/mi300x.rst
@@ -537,6 +537,8 @@ installation was successful, refer to the
 :doc:`rocm-install-on-linux:install/post-install`.
 Should verification fail, consult :doc:`/how-to/system-debugging`.
 .. _mi300x-hardware-verification-with-rocm:
 Hardware verification with ROCm
 -------------------------------
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -40,6 +40,8 @@ subtrees:
        title: Installation
      - file: how-to/rocm-for-ai/train-a-model.rst
        title: Train a model
      - file: how-to/rocm-for-ai/scale-model-training.rst
        title: Scale model training
      - file: how-to/rocm-for-ai/hugging-face-models.rst
        title: Run models from Hugging Face
      - file: how-to/rocm-for-ai/deploy-your-model.rst