mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-08 22:28:06 -05:00
* Docs: references of accelerator removal and change to GPU Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
619 lines
21 KiB
ReStructuredText
619 lines
21 KiB
ReStructuredText
:orphan:
|
|
|
|
.. meta::
|
|
:description: How to train a model using Megatron-LM for ROCm.
|
|
:keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
|
|
|
|
******************************************
|
|
Training a model with Megatron-LM for ROCm
|
|
******************************************
|
|
|
|
.. caution::
|
|
|
|
This documentation does not reflect the latest version of ROCm Megatron-LM
|
|
training performance documentation. See :doc:`../megatron-lm` for the latest version.
|
|
|
|
The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
|
|
designed to enable efficient training of large-scale language models on AMD
|
|
GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers
|
|
enhanced scalability, performance, and resource utilization for AI workloads.
|
|
It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
|
|
DeepSeek, enabling developers to train next-generation AI models more
|
|
efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
|
|
|
|
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
|
|
essential components, including PyTorch, ROCm libraries, and Megatron-LM
|
|
utilities. It contains the following software components to accelerate training
|
|
workloads:
|
|
|
|
+--------------------------+--------------------------------+
|
|
| Software component | Version |
|
|
+==========================+================================+
|
|
| ROCm | 6.3.0 |
|
|
+--------------------------+--------------------------------+
|
|
| PyTorch | 2.7.0a0+git637433 |
|
|
+--------------------------+--------------------------------+
|
|
| Python | 3.10 |
|
|
+--------------------------+--------------------------------+
|
|
| Transformer Engine | 1.11 |
|
|
+--------------------------+--------------------------------+
|
|
| Flash Attention | 3.0.0 |
|
|
+--------------------------+--------------------------------+
|
|
| hipBLASLt | git258a2162 |
|
|
+--------------------------+--------------------------------+
|
|
| Triton | 3.1 |
|
|
+--------------------------+--------------------------------+
|
|
|
|
Supported features and models
|
|
=============================
|
|
|
|
Megatron-LM provides the following key features to train large language models efficiently:
|
|
|
|
- Transformer Engine (TE)
|
|
|
|
- APEX
|
|
|
|
- GEMM tuning
|
|
|
|
- Torch.compile
|
|
|
|
- 3D parallelism: TP + SP + CP
|
|
|
|
- Distributed optimizer
|
|
|
|
- Flash Attention (FA) 3
|
|
|
|
- Fused kernels
|
|
|
|
- Pre-training
|
|
|
|
.. _amd-megatron-lm-model-support-25-4:
|
|
|
|
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
|
|
|
|
* Llama 3.1 8B
|
|
|
|
* Llama 3.1 70B
|
|
|
|
* Llama 3 8B
|
|
|
|
* Llama 3 70B
|
|
|
|
* Llama 2 7B
|
|
|
|
* Llama 2 70B
|
|
|
|
* DeepSeek-V2-Lite
|
|
|
|
.. note::
|
|
|
|
Some models, such as Llama, require an external license agreement through
|
|
a third party (for example, Meta).
|
|
|
|
.. _amd-megatron-lm-performance-measurements-v254:
|
|
|
|
Performance measurements
|
|
========================
|
|
|
|
To evaluate performance, the
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`__
|
|
page provides reference throughput and latency measurements for training
|
|
popular AI models.
|
|
|
|
.. important::
|
|
|
|
The performance data presented in
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
|
|
only reflects the :doc:`latest version of this training benchmarking environment <../megatron-lm>`.
|
|
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
|
|
|
|
System validation
|
|
=================
|
|
|
|
If you have already validated your system settings, including NUMA
|
|
auto-balancing, skip this step. Otherwise, complete the :ref:`system validation
|
|
and optimization steps <train-a-model-system-validation>` to set up your system
|
|
before starting training.
|
|
|
|
.. _mi300x-amd-megatron-lm-training-v254:
|
|
|
|
Environment setup
|
|
=================
|
|
|
|
The prebuilt ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
|
|
training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
|
|
|
|
Use the following instructions to set up the environment, configure the script to train models, and
|
|
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
|
|
image.
|
|
|
|
.. _amd-megatron-lm-requirements-v254:
|
|
|
|
Download the Docker image
|
|
-------------------------
|
|
|
|
1. Use the following command to pull the Docker image from Docker Hub.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker pull rocm/megatron-lm:v25.4
|
|
|
|
2. Launch the Docker container.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name megatron_training_env rocm/megatron-lm:v25.4
|
|
|
|
3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker start megatron_training_env
|
|
docker exec -it megatron_training_env bash
|
|
|
|
The Docker container includes a pre-installed, verified version of the ROCm Megatron-LM development branch `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__
|
|
(commit `fd6f01 <https://github.com/ROCm/Megatron-LM/tree/fd6f0d11d7f9480ace32f22eb7e4dab5314fa350>`_).
|
|
|
|
.. _amd-megatron-lm-environment-setup-v254:
|
|
|
|
Configuration scripts
|
|
---------------------
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
If you're working with Llama 2 7B or Llama 2 70 B, use the ``train_llama2.sh`` configuration
|
|
script in the ``examples/llama`` directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__.
|
|
Likewise, if you're working with Llama 3 or Llama 3.1, use ``train_llama3.sh`` and update
|
|
the configuration script accordingly.
|
|
|
|
.. tab-item:: DeepSeek V2
|
|
:sync: deepseek
|
|
|
|
Use the ``train_deepseek_v2.sh`` configuration script in the ``examples/deepseek_v2``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__
|
|
and update the configuration script accordingly.
|
|
|
|
Network interface
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
Update the network interface in the script to match your system's network interface. To
|
|
find your network interface, run the following (outside of any Docker container):
|
|
|
|
.. code-block:: bash
|
|
|
|
ip a
|
|
|
|
Look for an active interface that has an IP address in the same subnet as
|
|
your other nodes. Then, update the following variables in the script, for
|
|
example:
|
|
|
|
.. code-block:: bash
|
|
|
|
export NCCL_SOCKET_IFNAME=ens50f0np0
|
|
|
|
export GLOO_SOCKET_IFNAME=ens50f0np0
|
|
|
|
Dataset options
|
|
^^^^^^^^^^^^^^^
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
You can use either mock data or real data for training.
|
|
|
|
* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default
|
|
value is ``1`` for enabled.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=1
|
|
|
|
* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0
|
|
|
|
DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
To download the dataset, set the ``DATASET`` variable to the dataset you'd like to use. Two datasets are supported: ``DATASET=wiki`` and ``DATASET=bookcorpus``.
|
|
Use the following command to download the dataset.
|
|
|
|
.. code-block:: shell
|
|
|
|
DATASET=wiki bash examples/llama/prepare_dataset.sh # For wiki-en dataset
|
|
DATASET=bookcorpus bash examples/llama/prepare_dataset.sh # For bookcorpus dataset
|
|
|
|
.. tab-item:: DeepSeek V2
|
|
:sync: deepseek
|
|
|
|
If you don't already have the dataset, download the DeepSeek dataset using the following
|
|
commands:
|
|
|
|
.. code-block:: shell
|
|
|
|
mkdir deepseek-datasets
|
|
cd deepseek-datasets
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx
|
|
|
|
You can use either mock data or real data for training.
|
|
|
|
* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default
|
|
value is ``1`` for enabled.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=1
|
|
|
|
* If you're using a real dataset, update the ``DATA_DIR`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0
|
|
|
|
DATA_DIR="/root/data/deepseek-datasets" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
Tokenizer
|
|
^^^^^^^^^
|
|
|
|
Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
|
|
models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
|
|
a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
|
|
fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
|
|
handle a variety of input sequences, including unseen words or domain-specific terms.
|
|
|
|
You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples.
|
|
If the tokenizer is not found, it'll be downloaded to the default tokenizer model path: ``${DATA_DIR}/tokenizer_llama3``
|
|
or ``${DATA_DIR}/tokenizer_llama2``.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
To train any of the Llama 2 models that :ref:`this Docker image supports <amd-megatron-lm-model-support-25-4>`, use the ``Llama2Tokenizer``
|
|
or the default ``HuggingFaceTokenizer``.
|
|
|
|
To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
|
|
Set the Hugging Face model path in the ``TOKENIZER_MODEL`` variable.
|
|
|
|
For example, if you're using the Llama 3.1 8B model:
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
|
|
|
|
.. note::
|
|
|
|
If you don't already have the Llama 3.1 tokenizer locally, set your
|
|
personal Hugging Face access token ``HF_TOKEN`` to download the
|
|
tokenizer. If you encounter the following error, set ``HF_TOKEN`` to
|
|
your access-authorized Hugging Face token.
|
|
|
|
.. code-block:: shell
|
|
|
|
OSError: You are trying to access a gated repo.
|
|
|
|
# pass your HF_TOKEN
|
|
export HF_TOKEN=$your_personal_hf_token
|
|
|
|
.. tab-item:: DeepSeek V2
|
|
:sync: deepseek
|
|
|
|
To train any of the DeepSeek V2 models that :ref:`this Docker image supports <amd-megatron-lm-model-support-25-4>`, use the ``DeepSeekV2Tokenizer``.
|
|
|
|
Multi-node training
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
If you're running multi-node training, update the following environment variables. They can
|
|
also be passed as command line arguments.
|
|
|
|
* Change ``localhost`` to the master node's hostname:
|
|
|
|
.. code-block:: shell
|
|
|
|
MASTER_ADDR="${MASTER_ADDR:-localhost}"
|
|
|
|
* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
|
|
|
|
.. code-block:: shell
|
|
|
|
NNODES="${NNODES:-1}"
|
|
|
|
* Set the rank of each node (0 for master, 1 for the first worker node, and so on):
|
|
|
|
.. code-block:: shell
|
|
|
|
NODE_RANK="${NODE_RANK:-0}"
|
|
|
|
* Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an
|
|
NFS directory) for multi-node runs:
|
|
|
|
.. code-block:: shell
|
|
|
|
DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs
|
|
|
|
* For multi-node runs, make sure the correct network drivers are installed on the nodes. If
|
|
inside a Docker container, either install the drivers inside the Docker container or pass the network
|
|
drivers from the host while creating the Docker container.
|
|
|
|
.. code-block:: shell
|
|
|
|
# Specify which RDMA interfaces to use for communication
|
|
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
|
|
|
|
Start training on AMD Instinct GPUs
|
|
===========================================
|
|
|
|
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
|
|
system performance, conduct training benchmarks, and achieve superior
|
|
performance for models like Llama 3.1 and Llama 2. This container should not be
|
|
expected to provide generalized performance across all training workloads. You
|
|
can expect the container to perform in the model configurations described in
|
|
the following section, but other configurations are not validated by AMD.
|
|
|
|
Use the following instructions to set up the environment, configure the script
|
|
to train models, and reproduce the benchmark results on MI300X Series
|
|
GPUs with the AMD Megatron-LM Docker image.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Single node training
|
|
:sync: single-node
|
|
|
|
To run training on a single node, navigate to the Megatron-LM folder and use one of the
|
|
following commands.
|
|
|
|
- For Llama 3.1 8B FP8:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
- For Llama 3.1 8B BF16:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=0 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
- For Llama 2 7B FP8:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
- For Llama 2 7B BF16:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=0 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
To run training with FSDP2 enabled, add the ``FSDP=1`` argument. For example:
|
|
|
|
- For Llama 3 70B BF16:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=3 BS=24 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=8192 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
- For Llama 2 70B BF16:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=3 BS=56 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=4096 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
.. note::
|
|
|
|
It's suggested to use ``TP=1`` when FSDP is enabled for higher throughput. FSDP2 is not supported with pipeline parallelism,
|
|
expert parallelism, MCore's distributed optimizer, gradient accumulation fusion, and ``FP16`` precision.
|
|
|
|
.. tab-item:: Multi-node training
|
|
:sync: multi-node
|
|
|
|
To run training on multiple nodes, launch the Docker container on each node. For example, for a two node setup (``NODE0`` as the master node), use these commands.
|
|
|
|
* On the master node ``NODE0``:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh
|
|
|
|
* On the worker node ``NODE1``:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh
|
|
|
|
|
|
.. tab-item:: DeepSeek V2
|
|
:sync: deepseek
|
|
|
|
To run the training on a single node, go to ``/Megatron-LM`` folder and use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
cd /workspace/Megatron-LM
|
|
GEMM_TUNING=1 PR=bf16 MBS=4 AC=none SEQ_LEN=4096 PAD_LEN=4096 TRAIN_ITERS=50 bash examples/deepseek_v2/train_deepseekv2.sh
|
|
|
|
Key options
|
|
-----------
|
|
|
|
.. _amd-megatron-lm-benchmark-test-vars-v254:
|
|
|
|
The benchmark tests support the following sets of variables:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
``TEE_OUTPUT``
|
|
``1`` to enable training logs or ``0`` to disable.
|
|
|
|
``TE_FP8``
|
|
``0`` for B16 or ``1`` for FP8 -- ``0`` by default.
|
|
|
|
``GEMM_TUNING``
|
|
``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels.
|
|
|
|
``USE_FLASH_ATTN``
|
|
``1`` to enable Flash Attention.
|
|
|
|
``FSDP``
|
|
``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``,
|
|
``--overlap-param-gather``, and ``--sequence-parallel`` are automaticallyu disabled.
|
|
|
|
``ENABLE_PROFILING``
|
|
``1`` to enable PyTorch profiling for performance analysis.
|
|
|
|
``transformer-impl``
|
|
``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE.
|
|
|
|
``MODEL_SIZE``
|
|
``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2.
|
|
|
|
``TOTAL_ITERS``
|
|
The total number of iterations -- ``10`` by default.
|
|
|
|
``MOCK_DATA``
|
|
``1`` to use mock data or ``0`` to use real data you provide.
|
|
|
|
``MBS``
|
|
Micro batch size.
|
|
|
|
``BS``
|
|
Global batch size.
|
|
|
|
``TP``
|
|
Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on.
|
|
|
|
``SEQ_LENGTH``
|
|
Input sequence length.
|
|
|
|
.. tab-item:: DeepSeek V2
|
|
:sync: deepseek
|
|
|
|
``PR``
|
|
Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs.
|
|
|
|
``GEMM_TUNING``
|
|
``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels.
|
|
|
|
``TRAIN_ITERS``
|
|
The total number of iterations.
|
|
|
|
``MOCK_DATA``
|
|
``1`` to use mock data or ``0`` to use real data you provide.
|
|
|
|
``MBS``
|
|
Micro batch size.
|
|
|
|
``GBS``
|
|
Global batch size.
|
|
|
|
``SEQ_LEN``
|
|
Input sequence length.
|
|
|
|
``AC``
|
|
Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default.
|
|
|
|
Benchmarking examples
|
|
---------------------
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Llama
|
|
:sync: llama
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Single node training
|
|
:sync: single-node
|
|
|
|
Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
|
|
datatype, and so on.
|
|
|
|
.. code-block:: bash
|
|
|
|
TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
|
|
|
You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup-v254>`.
|
|
|
|
See the sample output:
|
|
|
|
.. image:: /data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
|
|
:width: 800
|
|
|
|
.. tab-item:: Multi-node training
|
|
:sync: multi-node
|
|
|
|
Launch the Docker container on each node.
|
|
|
|
In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
|
|
so on.
|
|
|
|
On the master node:
|
|
|
|
.. code-block:: bash
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
|
|
|
On the worker node:
|
|
|
|
.. code-block:: bash
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
|
|
|
You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup-v254>`.
|
|
|
|
Sample output for 2-node training:
|
|
|
|
Master node:
|
|
|
|
.. image:: /data/how-to/rocm-for-ai/2-node-training-master.png
|
|
:width: 800
|
|
|
|
Worker node:
|
|
|
|
.. image:: /data/how-to/rocm-for-ai/2-node-training-worker.png
|
|
:width: 800
|
|
|
|
Previous versions
|
|
=================
|
|
|
|
See :doc:`megatron-lm-history` to find documentation for previous releases
|
|
of the ``ROCm/megatron-lm`` Docker image.
|