Update docs on Megatron-LM and PyTorch training Dockers (#4407)

* Update Megatron-LM and PyTorch Training Docker docs Also restructure TOC * Apply suggestions from code review Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> update "start training" text Apply suggestions from code review Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> update conf.py fix spacing fix branding issue add disable numa reorg remove extra text (cherry picked from commit 389fa7071b)
2026-01-10 23:28:03 -05:00 · 2025-02-21 13:07:18 -05:00
parent 7ae7046301
commit 4af488e27d
8 changed files with 1029 additions and 508 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -117,6 +117,7 @@ FX
 Filesystem
 FindDb
 Flang
+FluxBenchmark
 Fortran
 Fuyu
 GALB
@@ -317,6 +318,7 @@ PipelineParallel
 PnP
 PowerEdge
 PowerShell
+Pretraining
 Profiler's
 PyPi
 Pytest
@@ -716,6 +718,7 @@ preprocessing
 preprocessor
 prequantized
 prerequisites
+pretraining
 profiler
 profilers
 protobuf
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -49,6 +49,9 @@ article_pages = [

    {"file": "how-to/rocm-for-ai/training/index", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/train-a-model", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/prerequisite-system-validation", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/train-a-model/benchmark-docker/megatron-lm", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/train-a-model/benchmark-docker/pytorch-training", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/scale-model-training", "os": ["linux"]},

    {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst
@@ -0,0 +1,541 @@
+:orphan:
+
+.. meta::
+   :description: How to train a model using Megatron-LM for ROCm.
+   :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
+
+******************************************
+Training a model with Megatron-LM for ROCm
+******************************************
+
+The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
+designed to enable efficient training of large-scale language models on AMD
+GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers
+enhanced scalability, performance, and resource utilization for AI workloads.
+It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
+DeepSeek, enabling developers to train next-generation AI models more
+efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
+
+AMD provides a ready-to-use Docker image for MI300X accelerators containing
+essential components, including PyTorch, ROCm libraries, and Megatron-LM
+utilities. It contains the following software components to accelerate training
+workloads:
+
+--------------------------+--------------------------------+
+| Software component       | Version                        |
+==========================+================================+
+| ROCm                     | 6.3.0                          |
+--------------------------+--------------------------------+
+| PyTorch                  | 2.7.0a0+git637433              |
+--------------------------+--------------------------------+
+| Python                   | 3.10                           |
+--------------------------+--------------------------------+
+| Transformer Engine       | 1.11                           |
+--------------------------+--------------------------------+
+| Flash Attention          | 3.0.0                          |
+--------------------------+--------------------------------+
+| hipBLASLt                | git258a2162                    |
+--------------------------+--------------------------------+
+| Triton                   | 3.1                            |
+--------------------------+--------------------------------+
+
+Supported features and models
+=============================
+
+Megatron-LM provides the following key features to train large language models efficiently:
+
+- Transformer Engine (TE)
+
+- APEX
+
+- GEMM tuning
+
+- Torch.compile
+
+- 3D parallelism: TP + SP + CP
+
+- Distributed optimizer
+
+- Flash Attention (FA) 3
+
+- Fused kernels
+
+- Pre-training
+
+.. _amd-megatron-lm-model-support:
+
+The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
+
+* Llama 2 7B
+
+* Llama 2 70B
+
+* Llama 3 8B
+
+* Llama 3 70B
+
+* Llama 3.1 8B
+
+* Llama 3.1 70B
+
+* DeepSeek-V2-Lite
+
+.. note::
+
+   Some models, such as Llama 3, require an external license agreement through
+   a third party (for example, Meta).
+
+System validation
+=================
+
+If you have already validated your system settings, skip this step. Otherwise,
+complete the :ref:`system validation and optimization steps <train-a-model-system-validation>`
+to set up your system before starting training.
+
+Disable NUMA auto-balancing
+---------------------------
+
+Generally, application performance can benefit from disabling NUMA auto-balancing. However,
+it might be detrimental to performance with certain types of workloads.
+
+Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
+Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
+the output is ``1``, run the following command to disable NUMA auto-balancing.
+
+.. code-block:: shell
+
+   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
+
+See :ref:`mi300x-disable-numa` for more information.
+
+.. _mi300x-amd-megatron-lm-training:
+
+Environment setup
+=================
+
+The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
+training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
+
+Use the following instructions to set up the environment, configure the script to train models, and
+reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
+image.
+
+.. _amd-megatron-lm-requirements:
+ 
+Download the Docker image
+-------------------------
+
+1. Use the following command to pull the Docker image from Docker Hub.
+
+   .. code-block:: shell
+
+      docker pull rocm/megatron-lm:v25.3
+
+2. Launch the Docker container.
+
+   .. code-block:: shell
+
+      docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v  $HOME/.ssh:/root/.ssh --shm-size 64G --name megatron_training_env rocm/megatron-lm:v25.3
+
+3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it.
+
+   .. code-block:: shell
+
+      docker start megatron_training_env
+      docker exec -it megatron_training_env bash
+
+The Docker container includes a pre-installed, verified version of Megatron-LM from the `release branch <https://github.com/ROCm/Megatron-LM/tree/megatron_release_v25.3>`_.
+
+.. _amd-megatron-lm-environment-setup:
+
+Configuration scripts
+---------------------
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      If you're working with Llama 2 7B or Llama 2 70 B, use the ``train_llama2.sh`` configuration
+      script in the ``examples/llama`` directory of
+      `<https://github.com/ROCm/Megatron-LM/tree/megatron_release_v25.3/examples/llama>`__.
+      Likewise, if you're working with Llama 3 or Llama 3.1, then use ``train_llama3.sh`` and update
+      the configuration script accordingly.
+
+   .. tab-item:: DeepSeek V2
+      :sync: deepseek
+
+      Use the ``train_deepseek_v2.sh`` configuration script in the ``examples/deepseek_v2``
+      directory of
+      `<https://github.com/ROCm/Megatron-LM/tree/megatron_release_v25.3/examples/deepseek_v2>`__
+      and update the configuration script accordingly.
+
+Network interface
+^^^^^^^^^^^^^^^^^
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      To avoid connectivity issues in multi-node deployments, ensure the correct network interface
+      is set in your training scripts.
+
+      1. Run the following command (outside the container) to find the active network interface on your system.
+
+         .. code-block:: shell
+
+            ip a
+
+      2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For
+         example:
+
+         .. code-block:: shell
+
+            export NCCL_SOCKET_IFNAME=ens50f0np0
+
+            export GLOO_SOCKET_IFNAME=ens50f0np0
+
+Dataset options
+^^^^^^^^^^^^^^^
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      You can use either mock data or real data for training.
+
+      * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default
+        value is ``1`` for enabled.
+
+        .. code-block:: bash
+
+           MOCK_DATA=1
+
+      * If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
+
+        .. code-block:: bash
+
+           MOCK_DATA=0
+
+           DATA_PATH=${DATA_PATH:-"/data/bookcorpus_text_sentence"}  # Change to where your dataset is stored
+
+        Ensure that the files are accessible inside the Docker container.
+
+   .. tab-item:: DeepSeek V2
+      :sync: deepseek
+
+      If you don't already have the dataset, download the DeepSeek dataset using the following
+      commands:
+
+      .. code-block:: shell
+
+         mkdir deepseek-datasets
+         cd deepseek-datasets
+         wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json
+         wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json
+         wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json
+         wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin
+         wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx
+
+      You can use either mock data or real data for training.
+
+      * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default
+        value is ``1`` for enabled.
+
+        .. code-block:: bash
+
+           MOCK_DATA=1
+
+      * If you're using a real dataset, update the ``DATA_DIR`` variable to point to the location of your dataset.
+
+        .. code-block:: bash
+
+           MOCK_DATA=0
+
+           DATA_DIR="/root/data/deepseek-datasets"  # Change to where your dataset is stored
+
+        Ensure that the files are accessible inside the Docker container.
+
+Tokenizer
+^^^^^^^^^
+
+Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
+models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
+a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
+fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
+handle a variety of input sequences, including unseen words or domain-specific terms.
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      To train any of the Llama 2 models that :ref:`this Docker image supports <amd-megatron-lm-model-support>`, use the ``Llama2Tokenizer``.
+
+      To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
+      Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable.
+
+      For example, if you're using the Llama 3.1 8B model:
+
+      .. code-block:: shell
+
+         TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
+
+   .. tab-item:: DeepSeek V2
+      :sync: deepseek
+
+      To train any of the DeepSeek V2 models that :ref:`this Docker image supports <amd-megatron-lm-model-support>`, use the ``DeepSeekV2Tokenizer``.
+
+Multi-node training
+^^^^^^^^^^^^^^^^^^^
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      If you're running multi-node training, update the following environment variables. They can
+      also be passed as command line arguments.
+
+      * Change ``localhost`` to the master node's hostname:
+
+        .. code-block:: shell
+
+           MASTER_ADDR="${MASTER_ADDR:-localhost}"
+
+      * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
+
+        .. code-block:: shell
+
+           NNODES="${NNODES:-1}"
+
+      * Set the rank of each node (0 for master, 1 for the first worker node, and so on):
+
+        .. code-block:: shell
+
+           NODE_RANK="${NODE_RANK:-0}"
+
+      * Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an
+        NFS directory) for multi-node runs:
+
+        .. code-block:: shell
+
+           DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs
+
+      * For multi-node runs, make sure the correct network drivers are installed on the nodes. If
+        inside a Docker, either install the drivers inside the Docker container or pass the network
+        drivers from the host while creating the Docker container.
+
+Start training on AMD Instinct accelerators
+===========================================
+
+The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
+system performance, conduct training benchmarks, and achieve superior
+performance for models like Llama 3.1 and Llama 2. This container should not be
+expected to provide generalized performance across all training workloads. You
+can expect the container to perform in the model configurations described in
+the following section, but other configurations are not validated by AMD.
+
+Use the following instructions to set up the environment, configure the script
+to train models, and reproduce the benchmark results on MI300X series
+accelerators with the AMD Megatron-LM Docker image.
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      .. tab-set::
+
+         .. tab-item:: Single-node training
+            :sync: single-node
+
+            To run training on a single node, navigate to the Megatron-LM folder and use the
+            following command:
+
+            .. code-block:: shell
+
+               TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 bash examples/llama/train_llama3.sh
+
+         .. tab-item:: Multi-node training
+            :sync: multi-node
+
+            To run training on multiple nodes, launch the Docker container on each node. For example, for a two node setup (``NODE0`` as the master node), use these commands.
+
+            * On the master node ``NODE0``:
+
+              .. code-block:: shell
+
+                 TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh
+
+            * On the worker node ``NODE1``:
+
+              .. code-block:: shell
+
+                 TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh
+
+
+   .. tab-item:: DeepSeek V2
+      :sync: deepseek
+
+      To run the training on a single node, go to ``/Megatron-LM`` folder and use the following command:
+
+      .. code-block:: shell
+
+         cd /workspace/Megatron-LM
+         GEMM_TUNING=1 PR=bf16 MBS=4 AC=none bash examples/deepseek_v2/train_deepseekv2.sh
+
+Key options
+-----------
+
+.. _amd-megatron-lm-benchmark-test-vars:
+
+The benchmark tests support the following sets of variables:
+
+.. tab-set::
+
+   .. tab-item:: Llama
+      :sync: llama
+
+      ``TEE_OUTPUT``
+        ``1`` to enable training logs or ``0`` to disable.
+
+      ``TE_FP8``
+        ``0`` for BP16 (default) or ``1`` for FP8 GEMMs.
+
+      ``GEMM_TUNING``
+        ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels.
+
+      ``USE_FLASH_ATTN``
+        ``1`` to enable Flash Attention.
+
+      ``ENABLE_PROFILING``
+        ``1`` to enable PyTorch profiling for performance analysis.
+
+      ``transformer-impl``
+        ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE.
+
+      ``MODEL_SIZE``
+        ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2.
+
+      ``TOTAL_ITERS``
+        The total number of iterations -- ``10`` by default.
+
+      ``MOCK_DATA``
+        ``1`` to use mock data or ``0`` to use real data provided by you.
+
+      ``MBS``
+        Micro batch size.
+
+      ``BS``
+        Global batch size.
+
+      ``TP``
+        Tensor parallel (``1``, ``2``, ``4``, ``8``).
+
+      ``SEQ_LENGTH``
+        Input sequence length.
+
+   .. tab-item:: DeepSeek V2
+
+      ``PR``
+        Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs.
+
+      ``GEMM_TUNING``
+        ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels.
+
+      ``TOTAL_ITERS``
+        The total number of iterations -- ``10`` by default.
+
+      ``MOCK_DATA``
+        ``1`` to use mock data or ``0`` to use real data provided by you.
+
+      ``MBS``
+        Micro batch size.
+
+      ``GBS``
+        Global batch size.
+
+Benchmarking examples
+---------------------
+
+.. tab-set::
+
+   .. tab-item:: Single node training
+      :sync: single-node
+
+      Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
+      datatype, and so on.
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
+
+      See the sample output:
+
+      .. image:: ../../../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
+         :width: 800
+
+   .. tab-item:: Multi-node training
+      :sync: multi-node
+
+      Launch the Docker container on each node.
+
+      In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
+      so on.
+
+      On the master node:
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      On the worker node:
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
+
+      Sample output for 2-node training:
+
+      Master node:
+
+      .. image:: ../../../../data/how-to/rocm-for-ai/2-node-training-master.png
+         :width: 800
+
+      Worker node:
+
+      .. image:: ../../../../data/how-to/rocm-for-ai/2-node-training-worker.png
+         :width: 800
+
+Previous versions
+=================
+
+This table lists previous versions of the ROCm Megatron-LM Docker image for training
+performance validation. For detailed information about available models for
+benchmarking, see the version-specific documentation.
+
+.. list-table::
+   :header-rows: 1
+   :stub-columns: 1
+
+   * - ROCm version
+     - Megatron-LM version
+     - PyTorch version
+     - Resources
+
+   * - 6.1
+     - 24.12-dev
+     - 2.4.0
+     - 
+       * `Documentation <https://rocm.docs.amd.com/en/docs-6.3.0/how-to/rocm-for-ai/train-a-model.html>`_
+       * `Docker Hub <https://hub.docker.com/layers/rocm/megatron-lm/24.12-dev/images/sha256-5818c50334ce3d69deeeb8f589d83ec29003817da34158ebc9e2d112b929bf2e>`_
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst
@@ -0,0 +1,341 @@
+:orphan:
+
+.. meta::
+   :description: How to train a model using PyTorch for ROCm.
+   :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
+
+**************************************
+Training a model with PyTorch for ROCm
+**************************************
+
+PyTorch is an open-source machine learning framework that is widely used for
+model training with GPU-optimized components for transformer-based models.
+
+The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.3``) image
+provides a prebuilt optimized environment for fine-tuning and pretraining a
+model on AMD Instinct MI325X and MI300X accelerators. It includes the following
+software components to accelerate training workloads:
+
+--------------------------+--------------------------------+
+| Software component       | Version                        |
+==========================+================================+
+| ROCm                     | 6.3.0                          |
+--------------------------+--------------------------------+
+| PyTorch                  | 2.7.0a0+git637433              |
+--------------------------+--------------------------------+
+| Python                   | 3.10                           |
+--------------------------+--------------------------------+
+| Transformer Engine       | 1.11                           |
+--------------------------+--------------------------------+
+| Flash Attention          | 3.0.0                          |
+--------------------------+--------------------------------+
+| hipBLASLt                | git258a2162                    |
+--------------------------+--------------------------------+
+| Triton                   | 3.1                            |
+--------------------------+--------------------------------+
+
+.. _amd-pytorch-training-model-support:
+
+Supported models
+================
+
+The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
+
+* Llama 3.1 8B
+
+* Llama 3.1 70B
+
+* FLUX.1-dev
+
+.. note::
+
+   Only these models are supported in the following steps.
+
+   Some models, such as Llama 3, require an external license agreement through
+   a third party (for example, Meta).
+
+System validation
+=================
+
+If you have already validated your system settings, skip this step. Otherwise,
+complete the :ref:`system validation and optimization steps <train-a-model-system-validation>`
+to set up your system before starting training.
+
+Disable NUMA auto-balancing
+---------------------------
+
+Generally, application performance can benefit from disabling NUMA auto-balancing. However,
+it might be detrimental to performance with certain types of workloads.
+
+Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
+Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
+the output is ``1``, run the following command to disable NUMA auto-balancing.
+
+.. code-block:: shell
+
+   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
+
+See :ref:`mi300x-disable-numa` for more information.
+
+Environment setup
+=================
+
+This Docker image is optimized for specific model configurations outlined
+below. Performance can vary for other training workloads, as AMD 
+doesn’t validate configurations and run conditions outside those described.
+
+Download the Docker image
+-------------------------
+
+1. Use the following command to pull the Docker image from Docker Hub.
+
+   .. code-block:: shell
+
+      docker pull rocm/pytorch-training:v25.3
+
+2. Run the Docker container.
+
+   .. code-block:: shell
+
+      docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v  $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.3
+
+3. Use these commands if you exit the ``training_env`` container and need to return to it.
+
+   .. code-block:: shell
+
+      docker start training_env
+      docker exec -it training_env bash
+
+4. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__ repository and navigate to the benchmark scripts directory.
+
+   .. code-block:: shell
+
+      git clone https://github.com/ROCm/MAD
+      cd MAD/scripts/pytorch-train
+
+Prepare training datasets and dependencies
+------------------------------------------
+
+The following benchmarking examples may require downloading models and datasets
+from Hugging Face. To ensure successful access to gated repos, set your
+``HF_TOKEN``.
+
+Run the setup script to install libraries and datasets needed for benchmarking.
+
+.. code-block:: shell
+
+   ./pytorch_benchmark_setup.sh
+
+``pytorch_benchmark_setup.sh`` installs the following libraries:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Library
+     - Benchmark model
+     - Reference
+
+   * - ``accelerate``
+     - Llama 3.1 8B, FLUX
+     - `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
+
+   * - ``datasets``
+     - Llama 3.1 8B, 70B, FLUX
+     - `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
+
+   * - ``torchdata``
+     - Llama 3.1 70B
+     - `TorchData <https://pytorch.org/data/beta/index.html>`_
+
+   * - ``tomli``
+     - Llama 3.1 70B
+     - `Tomli <https://pypi.org/project/tomli/>`_
+
+   * - ``tiktoken``
+     - Llama 3.1 70B
+     - `tiktoken <https://github.com/openai/tiktoken>`_
+
+   * - ``blobfile``
+     - Llama 3.1 70B
+     - `blobfile <https://pypi.org/project/blobfile/>`_
+
+   * - ``tabulate``
+     - Llama 3.1 70B
+     - `tabulate <https://pypi.org/project/tabulate/>`_
+
+   * - ``wandb``
+     - Llama 3.1 70B
+     - `Weights & Biases <https://github.com/wandb/wandb>`_
+
+   * - ``sentencepiece``
+     - Llama 3.1 70B, FLUX
+     - `SentencePiece <https://github.com/google/sentencepiece>`_ 0.2.0
+
+   * - ``tensorboard``
+     - Llama 3.1 70 B, FLUX
+     - `TensorBoard <https://www.tensorflow.org/tensorboard>`_ 2.18.0
+
+   * - ``csvkit``
+     - FLUX
+     - `csvkit <https://csvkit.readthedocs.io/en/latest/>`_ 2.0.1
+
+   * - ``deepspeed``
+     - FLUX
+     - `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`_ 0.16.2
+
+   * - ``diffusers``
+     - FLUX
+     - `Hugging Face Diffusers <https://huggingface.co/docs/diffusers/en/index>`_ 0.31.0
+
+   * - ``GitPython``
+     - FLUX
+     - `GitPython <https://github.com/gitpython-developers/GitPython>`_ 3.1.44
+
+   * - ``opencv-python-headless``
+     - FLUX
+     - `opencv-python-headless <https://pypi.org/project/opencv-python-headless/>`_ 4.10.0.84
+
+   * - ``peft``
+     - FLUX
+     - `PEFT <https://huggingface.co/docs/peft/en/index>`_ 0.14.0
+
+   * - ``protobuf``
+     - FLUX
+     - `Protocol Buffers <https://github.com/protocolbuffers/protobuf>`_ 5.29.2
+
+   * - ``pytest``
+     - FLUX
+     - `PyTest <https://docs.pytest.org/en/stable/>`_ 8.3.4
+
+   * - ``python-dotenv``
+     - FLUX
+     - `python-dotenv <https://pypi.org/project/python-dotenv/>`_ 1.0.1
+
+   * - ``seaborn``
+     - FLUX
+     - `Seaborn <https://seaborn.pydata.org/>`_ 0.13.2
+
+   * - ``transformers``
+     - FLUX
+     - `Transformers <https://huggingface.co/docs/transformers/en/index>`_ 4.47.0
+
+``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face:
+
+* `meta-llama/Llama-3.1-70B-Instruct <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_
+
+* `black-forest-labs/FLUX.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
+
+Along with the following datasets:
+
+* `WikiText <https://huggingface.co/datasets/Salesforce/wikitext>`_
+
+* `bghira/pseudo-camera-10k <https://huggingface.co/datasets/bghira/pseudo-camera-10k>`_
+
+Start training on AMD Instinct accelerators
+===========================================
+
+The prebuilt PyTorch with ROCm training environment allows users to quickly validate
+system performance, conduct training benchmarks, and achieve superior
+performance for models like Llama 3.1 and Llama 2. This container should not be
+expected to provide generalized performance across all training workloads. You
+can expect the container to perform in the model configurations described in
+the following section, but other configurations are not validated by AMD.
+
+Use the following instructions to set up the environment, configure the script
+to train models, and reproduce the benchmark results on MI300X series
+accelerators with the AMD PyTorch training Docker image.
+
+Once your environment is set up, use the following commands and examples to start benchmarking.
+
+Pretraining
+-----------
+
+To start the pretraining benchmark, use the following command with the
+appropriate options. See the following list of options and their descriptions.
+
+.. code-block:: shell
+
+   ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length
+
+Options and available models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Options
+     - Description
+
+   * - ``$training_mode``
+     - ``pretrain``
+     - Benchmark pretraining
+
+   * -
+     - ``finetune_fw``
+     - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16)
+
+   * -
+     - ``finetune_lora``
+     - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16)
+
+   * - ``$datatype``
+     - FP8 or BF16
+     - Only Llama 3.1 8B supports FP8 precision.
+
+   * - ``$model_repo``
+     - Llama-3.1-8B
+     - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct>`_
+
+   * - 
+     - Llama-3.1-70B
+     - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_
+
+   * - 
+     - Flux
+     - `FLUX.1 [dev] <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
+
+Fine-tuning
+-----------
+
+To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 2 70B
+with the WikiText dataset using the AMD fork of `torchtune <https://github.com/AMD-AIG-AIMA/torchtune>`_.
+
+.. code-block:: shell
+
+   ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B
+
+Benchmarking examples
+---------------------
+
+Here are some examples of how to use the command.
+
+* Example 1: Llama 3.1 70B with BF16 precision with `torchtitan <https://github.com/ROCm/torchtitan>`_.
+
+  .. code-block:: shell
+
+     ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192
+
+* Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator.
+
+  .. code-block:: shell
+
+     ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192
+
+* Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark.
+
+  .. code-block:: shell
+
+     ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux
+
+* Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B
+
+  .. code-block:: shell
+
+     ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B
+
+* Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B
+
+  .. code-block:: shell
+
+     ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B
--- a/docs/how-to/rocm-for-ai/training/index.rst
+++ b/docs/how-to/rocm-for-ai/training/index.rst
@@ -19,6 +19,10 @@ training, fine-tuning, and inference. It leverages popular machine learning fram

 In this guide, you'll learn about:

- :doc:`Training a model <train-a-model>`
+- Training a model

- :doc:`Scale model training <scale-model-training>`
+  - :doc:`Train a model with Megatron-LM <benchmark-docker/megatron-lm>`
+
+  - :doc:`Train a model with PyTorch <benchmark-docker/pytorch-training>`
+
+- :doc:`Scaling model training <scale-model-training>`
--- a/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst
+++ b/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst
@@ -0,0 +1,130 @@
+:orphan:
+
+.. meta::
+   :description: Prerequisite system validation before using ROCm for AI.
+   :keywords: ROCm, AI, LLM, train, megatron, Llama, tutorial, docker, torch, pytorch, jax
+
+.. _train-a-model-system-validation:
+
+**********************************************
+Prerequisite system validation before training
+**********************************************
+
+Complete the following system validation and optimization steps to set up your system before starting training.
+
+Disable NUMA auto-balancing
+---------------------------
+
+Generally, application performance can benefit from disabling NUMA auto-balancing. However,
+it might be detrimental to performance with certain types of workloads.
+
+Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
+Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
+the output is ``1``, run the following command to disable NUMA auto-balancing.
+
+.. code-block:: shell
+
+   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
+
+See :ref:`mi300x-disable-numa` for more information.
+
+Hardware verification with ROCm
+-------------------------------
+
+Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
+instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
+GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
+You can restore this setting to its default value with the ``rocm-smi -r`` command.
+
+Run the command:
+
+.. code-block:: shell
+
+   rocm-smi --setperfdeterminism 1900
+
+See :ref:`mi300x-hardware-verification-with-rocm` for more information.
+
+RCCL Bandwidth Test for multi-node setups
+-----------------------------------------
+
+ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication
+routines for GPUs. See the :doc:`RCCL documentation <rccl:index>` for more information. Before starting
+pretraining, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized
+for efficient distributed training.
+
+Running the RCCL bandwidth test helps verify that:
+
+- The GPUs can communicate across nodes or within a single node.
+
+- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and
+  provides adequate bandwidth for communication.
+
+- No hardware setup or cabling issues could affect the communication between GPUs
+
+Tuning and optimizing hyperparameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In distributed training, specific hyperparameters related to distributed communication can be tuned based on
+the results of the RCCL bandwidth test. These variables are already set in the Docker image:
+
+.. code-block:: shell
+
+   # force all RCCL streams to be high priority
+   export TORCH_NCCL_HIGH_PRIORITY=1
+
+   # specify which RDMA interfaces to use for communication
+   export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
+
+   # define the Global ID index used in RoCE mode
+   export NCCL_IB_GID_INDEX=3
+
+   # avoid data corruption/mismatch issue that existed in past releases
+   export RCCL_MSCCL_ENABLE=0
+
+Running the RCCL Bandwidth Test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It's recommended you run the RCCL bandwidth test before launching training. It ensures system
+performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker
+image; follow the instructions in `<https://github.com/ROCm/rccl-tests>`__ to get started.
+See :ref:`mi300x-rccl` for more information.
+
+Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:
+
+.. code-block:: shell
+
+   ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8
+
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
+   :width: 800
+
+Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
+recommended. So, a run on 8 GPUs looks something like:
+
+.. code-block:: shell
+
+   mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1
+
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
+   :width: 800
+
+Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
+for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like
+PyTorch and TensorFlow.
+
+Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed.
+
+.. code-block::
+
+   /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \
+   --mca pml ucx \
+   --mca btl ^openib \
+   -x NCCL_SOCKET_IFNAME=ens50f0np0 \
+   -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \
+   -x NCCL_IB_GID_INDEX=3 \
+   -x NCCL_MIN_NCHANNELS=40 \
+   -x NCCL_DEBUG=version \
+   $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1
+
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
+   :width: 800
--- a/docs/how-to/rocm-for-ai/training/train-a-model.rst
+++ b/docs/how-to/rocm-for-ai/training/train-a-model.rst
@@ -1,503 +0,0 @@
-.. meta::
-   :description: How to train a model using ROCm Megatron-LM
-   :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
-
-**************************************
-Training a model with ROCm Megatron-LM
-**************************************
-
-.. _amd-megatron-lm:
-
-The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
-enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
-accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
-workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support>`
-like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
-efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
-
-For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
-components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
-following software to accelerate training workloads:
-
-+--------------------------+--------------------------------+
-| Software component       | Version                        |
-+==========================+================================+
-| ROCm                     | 6.1                            |
-+--------------------------+--------------------------------+
-| PyTorch                  | 2.4.0                          |
-+--------------------------+--------------------------------+
-| PyTorch Lightning        | 2.4.0                          |
-+--------------------------+--------------------------------+
-| Megatron Core            | 0.9.0                          |
-+--------------------------+--------------------------------+
-| Transformer Engine       | 1.5.0                          |
-+--------------------------+--------------------------------+
-| Flash Attention          | v2.6                           |
-+--------------------------+--------------------------------+
-| Transformers             | 4.44.0                         |
-+--------------------------+--------------------------------+
-
-Supported features and models
-=============================
-
-Megatron-LM provides the following key features to train large language models efficiently:
-
- Transformer Engine (TE)
-
- APEX
-
- GEMM tuning
-
- Torch.compile
-
- 3D parallelism: TP + SP + CP
-
- Distributed optimizer
-
- Flash Attention (FA) 2
-
- Fused kernels
-
- Pre-training
-
-.. _amd-megatron-lm-model-support:
-
-The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
-
-* Llama 2 7B
-
-* Llama 2 70B
-
-* Llama 3 8B
-
-* Llama 3 70B
-
-* Llama 3.1 8B
-
-* Llama 3.1 70B
-
-Prerequisite system validation steps
-====================================
-
-Complete the following system validation and optimization steps to set up your system before starting training.
-
-Disable NUMA auto-balancing
---------------------------
-
-Generally, application performance can benefit from disabling NUMA auto-balancing. However,
-it might be detrimental to performance with certain types of workloads.
-
-Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
-Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
-the output is ``1``, run the following command to disable NUMA auto-balancing.
-
-.. code-block:: shell
-
-   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
-
-See :ref:`mi300x-disable-numa` for more information.
-
-Hardware verification with ROCm
-------------------------------
-
-Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
-instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
-GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
-You can restore this setting to its default value with the ``rocm-smi -r`` command.
-
-Run the command:
-
-.. code-block:: shell
-
-   rocm-smi --setperfdeterminism 1900
-
-See :ref:`mi300x-hardware-verification-with-rocm` for more information.
-
-RCCL Bandwidth Test
-------------------
-
-ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication
-routines for GPUs. See the :doc:`RCCL documentation <rccl:index>` for more information. Before starting
-pre-training, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized
-for efficient distributed training.
-
-Running the RCCL bandwidth test helps verify that:
-
- The GPUs can communicate across nodes or within a single node.
-
- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and
-  provides adequate bandwidth for communication.
-
- No hardware setup or cabling issues could affect the communication between GPUs
-
-Tuning and optimizing hyperparameters
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-In distributed training, specific hyperparameters related to distributed communication can be tuned based on
-the results of the RCCL bandwidth test. These variables are already set in the Docker image:
-
-.. code-block:: shell
-
-   # force all RCCL streams to be high priority
-   export TORCH_NCCL_HIGH_PRIORITY=1
-
-   # specify which RDMA interfaces to use for communication
-   export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
-
-   # define the Global ID index used in RoCE mode
-   export NCCL_IB_GID_INDEX=3
-
-   # avoid data corruption/mismatch issue that existed in past releases
-   export RCCL_MSCCL_ENABLE=0
-
-Running the RCCL Bandwidth Test
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-It's recommended you run the RCCL bandwidth test before launching training. It ensures system
-performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker
-image; follow the instructions in `<https://github.com/ROCm/rccl-tests>`__ to get started.
-See :ref:`mi300x-rccl` for more information.
-
-Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:
-
-.. code-block:: shell
-
-   ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8
-
-.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
-   :width: 800
-
-Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
-recommended. So, a run on 8 GPUs looks something like:
-
-.. code-block:: shell
-
-   mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1
-
-.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
-   :width: 800
-
-Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
-for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like
-PyTorch and TensorFlow.
-
-Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed.
-
-.. code-block::
-
-   /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \
-   --mca pml ucx \
-   --mca btl ^openib \
-   -x NCCL_SOCKET_IFNAME=ens50f0np0 \
-   -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \
-   -x NCCL_IB_GID_INDEX=3 \
-   -x NCCL_MIN_NCHANNELS=40 \
-   -x NCCL_DEBUG=version \
-   $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1
-
-.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
-   :width: 800
-
-.. _mi300x-amd-megatron-lm-training:
-
-Start training on MI300X accelerators
-=====================================
-
-The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
-training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
-
-Use the following instructions to set up the environment, configure the script to train models, and
-reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
-image.
-
-.. _amd-megatron-lm-requirements:
-
-Download the Docker image and required packages
-----------------------------------------------
-
-1. Use the following command to pull the Docker image from Docker Hub.
-
-   .. code-block:: shell
-
-      docker pull rocm/megatron-lm:24.12-dev
-
-2. Launch the Docker container.
-
-   .. code-block:: shell
-
-      docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
-
-3. Clone the ROCm Megatron-LM repository to a local directory and install the required packages on the host machine.
-
-   .. code-block:: shell
-
-      git clone https://github.com/ROCm/Megatron-LM
-      cd Megatron-LM
-
-   .. note::
-
-      This release is validated with ``ROCm/Megatron-LM`` commit `bb93ccb <https://github.com/ROCm/Megatron-LM/tree/bb93ccbfeae6363c67b361a97a27c74ab86e7e92>`_.
-      Checking out this specific commit is recommended for a stable and reproducible environment.
-
-      .. code-block:: shell
-         
-         git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92
-
-Prepare training datasets
-------------------------
-
-If you already have the preprocessed data, you can skip this section.
-
-Use the following command to process datasets. We use GPT data as an example. You may change the merge table, use an
-end-of-document token, remove sentence splitting, and use the tokenizer type.
-
-.. code-block:: shell
-
-   python tools/preprocess_data.py \
-       --input my-corpus.json \
-       --output-prefix my-gpt2 \
-       --vocab-file gpt2-vocab.json \
-       --tokenizer-type GPT2BPETokenizer \
-       --merge-file gpt2-merges.txt \
-       --append-eod
-
-In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and
-``my-gpt2_text_document.idx``.
-
-.. image:: ../../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
-   :width: 800
-
-.. _amd-megatron-lm-environment-setup:
-
-Environment setup
-----------------
-
-In the ``examples/llama`` directory of Megatron-LM, if you're working with Llama 2 7B or Llama 2 70 B, use the
-``train_llama2.sh`` configuration script. Likewise, if you're working with Llama 3 or Llama 3.1, then use
-``train_llama3.sh`` and update the configuration script accordingly.
-
-Network interface
-^^^^^^^^^^^^^^^^^
-
-To avoid connectivity issues, ensure the correct network interface is set in your training scripts.
-
-1. Run the following command to find the active network interface on your system.
-
-   .. code-block:: shell
-
-      ip a
-
-2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For
-   example:
-
-   .. code-block:: shell
-
-      export NCCL_SOCKET_IFNAME=ens50f0np0
-
-      export GLOO_SOCKET_IFNAME=ens50f0np0
-
-Dataset options
-^^^^^^^^^^^^^^^
-
-You can use either mock data or real data for training.
-
-* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
-
-  .. code-block:: shell
-
-     DATA_DIR="/root/.cache/data" # Change to where your dataset is stored
-
-     DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
-
-  .. code-block:: shell
-
-     --data-path $DATA_PATH
-
-  Ensure that the files are accessible inside the Docker container.
-
-* Mock data can be useful for testing and validation. If you're using mock data, replace ``--data-path $DATA_PATH`` with the ``--mock-data`` option.
-
-  .. code-block:: shell
-
-     --mock-data
-
-Tokenizer
-^^^^^^^^^
-
-Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
-models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
-a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
-fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
-handle a variety of input sequences, including unseen words or domain-specific terms.
-
-To train any of the Llama 2 models that this Docker image supports, use the ``Llama2Tokenizer``.
-
-To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
-Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable.
-
-For example, if you're using the Llama 3.1 8B model:
-
-.. code-block:: shell
-
-   TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
-
-Run benchmark tests
-------------------
-
-.. note::
-
-   If you're running **multi node training**, update the following environment variables. They can
-   also be passed as command line arguments.
-
-   * Change ``localhost`` to the master node's hostname:
-
-     .. code-block:: shell
-
-        MASTER_ADDR="${MASTER_ADDR:-localhost}"
-
-   * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
-
-     .. code-block:: shell
-
-        NNODES="${NNODES:-1}"
-
-   * Set the rank of each node (0 for master, 1 for the first worker node, and so on):
-
-     .. code-block:: shell
-
-        NODE_RANK="${NODE_RANK:-0}"
-
-* Use this command to run a performance benchmark test of any of the Llama 2 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
-
-  .. code-block:: shell
-
-     {variables} bash examples/llama/train_llama2.sh
-
-* Use this command to run a performance benchmark test of any of the Llama 3 and Llama 3.1 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
-
-  .. code-block:: shell
-
-     {variables} bash examples/llama/train_llama3.sh
-
-.. _amd-megatron-lm-benchmark-test-vars:
-
-The benchmark tests support the same set of variables:
-
-+--------------------------+-----------------------+-----------------------+
-| Name                     | Options               | Description           |
-+==========================+=======================+=======================+
-| ``TEE_OUTPUT``           | 0 or 1                | 0: disable training   |
-|                          |                       | log                   |
-|                          |                       |                       |
-|                          |                       | 1: enable training    |
-|                          |                       | log                   |
-+--------------------------+-----------------------+-----------------------+
-| ``MBS``                  |                       | Micro batch size      |
-+--------------------------+-----------------------+-----------------------+
-| ``BS``                   |                       | Batch size            |
-+--------------------------+-----------------------+-----------------------+
-| ``TP``                   | 1, 2, 4, 8            | Tensor parallel       |
-+--------------------------+-----------------------+-----------------------+
-| ``TE_FP8``               | 0 or 1                | Datatype.             |
-|                          |                       | If it is set to 1,    |
-|                          |                       | FP8.                  |
-|                          |                       |                       |
-|                          |                       | If it is set to 0.    |
-|                          |                       | BP16                  |
-+--------------------------+-----------------------+-----------------------+
-| ``NO_TORCH_COMPILE``     | 0 or 1                | If it is set to 1,    |
-|                          |                       | enable torch.compile. |
-|                          |                       |                       |
-|                          |                       | If it is set to 0.    |
-|                          |                       | Disable torch.compile |
-|                          |                       | (default)             |
-+--------------------------+-----------------------+-----------------------+
-| ``SEQ_LENGTH``           |                       | Input sequence length |
-+--------------------------+-----------------------+-----------------------+
-| ``GEMM_TUNING``          | 0 or 1                | If it is set to 1,    |
-|                          |                       | enable gemm tuning.   |
-|                          |                       |                       |
-|                          |                       | If it is set to 0,    |
-|                          |                       | disable gemm tuning   |
-+--------------------------+-----------------------+-----------------------+
-| ``USE_FLASH_ATTN``       | 0 or 1                | 0: disable flash      |
-|                          |                       | attention             |
-|                          |                       |                       |
-|                          |                       | 1: enable flash       |
-|                          |                       | attention             |
-+--------------------------+-----------------------+-----------------------+
-| ``ENABLE_PROFILING``     | 0 or 1                | 0: disable torch      |
-|                          |                       | profiling             |
-|                          |                       |                       |
-|                          |                       | 1: enable torch       |
-|                          |                       | profiling             |
-+--------------------------+-----------------------+-----------------------+
-| ``MODEL_SIZE``           |                       | The size of the mode: |
-|                          |                       | 7B/70B, etc.          |
-+--------------------------+-----------------------+-----------------------+
-| ``TOTAL_ITERS``          |                       | Total number of       |
-|                          |                       | iterations            |
-+--------------------------+-----------------------+-----------------------+
-| ``transformer-impl``     | transformer_engine or | Enable transformer    |
-|                          | local                 | engine by default     |
-+--------------------------+-----------------------+-----------------------+
-
-Benchmarking examples
-^^^^^^^^^^^^^^^^^^^^^
-
-.. tab-set::
-
-   .. tab-item:: Single node training
-      :sync: single
-
-      Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
-      datatype, and so on.
-
-      .. code-block:: bash
-
-         TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
-         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
-
-      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
-
-      See the sample output:
-
-      .. image:: ../../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
-         :width: 800
-
-   .. tab-item:: Multi node training
-      :sync: multi
-
-      Launch the Docker container on each node.
-
-      In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
-      so on.
-
-      On the master node:
-
-      .. code-block:: bash
-
-         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
-         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
-
-      On the worker node:
-
-      .. code-block:: bash
-
-         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
-         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
-
-      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
-
-      Sample output for 2-node training:
-
-      Master node:
-
-      .. image:: ../../../data/how-to/rocm-for-ai/2-node-training-master.png
-         :width: 800
-
-      Worker node:
-
-      .. image:: ../../../data/how-to/rocm-for-ai/2-node-training-worker.png
-         :width: 800
-
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -40,11 +40,13 @@ subtrees:
        title: Training
        subtrees:
        - entries:
-          - file: how-to/rocm-for-ai/training/train-a-model.rst
-            title: Train a model
+          - file: how-to/rocm-for-ai/training/benchmark-docker/megatron-lm
+            title: Train a model with Megatron-LM
+          - file: how-to/rocm-for-ai/training/benchmark-docker/pytorch-training
+            title: Train a model with PyTorch
          - file: how-to/rocm-for-ai/training/scale-model-training.rst
            title: Scale model training
-      
+
      - file: how-to/rocm-for-ai/fine-tuning/index.rst
        title: Fine-tuning LLMs
        subtrees: