mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-08 22:28:06 -05:00
* Docs: references of accelerator removal and change to GPU Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
776 lines
29 KiB
ReStructuredText
776 lines
29 KiB
ReStructuredText
:orphan:
|
|
|
|
.. meta::
|
|
:description: How to train a model using Megatron-LM for ROCm.
|
|
:keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
|
|
|
|
******************************************
|
|
Training a model with Megatron-LM for ROCm
|
|
******************************************
|
|
|
|
.. caution::
|
|
|
|
This documentation does not reflect the latest version of ROCm Megatron-LM
|
|
training performance documentation. See :doc:`../megatron-lm` for the latest version.
|
|
|
|
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
|
|
a specialized fork of the robust Megatron-LM, designed to enable efficient
|
|
training of large-scale language models on AMD GPUs. By leveraging AMD
|
|
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
|
|
scalability, performance, and resource utilization for AI workloads. It is
|
|
purpose-built to support models like Llama, DeepSeek, and Mixtral,
|
|
enabling developers to train next-generation AI models more
|
|
efficiently.
|
|
|
|
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
|
|
essential components, including PyTorch, ROCm libraries, and Megatron-LM
|
|
utilities. It contains the following software components to accelerate training
|
|
workloads:
|
|
|
|
+--------------------------+--------------------------------+
|
|
| Software component | Version |
|
|
+==========================+================================+
|
|
| ROCm | 6.3.4 |
|
|
+--------------------------+--------------------------------+
|
|
| PyTorch | 2.8.0a0+gite2f9759 |
|
|
+--------------------------+--------------------------------+
|
|
| Python | 3.12 or 3.10 |
|
|
+--------------------------+--------------------------------+
|
|
| Transformer Engine | 1.13.0+bb061ade |
|
|
+--------------------------+--------------------------------+
|
|
| Flash Attention | 3.0.0 |
|
|
+--------------------------+--------------------------------+
|
|
| hipBLASLt | 0.13.0-4f18bf6 |
|
|
+--------------------------+--------------------------------+
|
|
| Triton | 3.3.0 |
|
|
+--------------------------+--------------------------------+
|
|
| RCCL | 2.22.3 |
|
|
+--------------------------+--------------------------------+
|
|
|
|
Megatron-LM provides the following key features to train large language models efficiently:
|
|
|
|
- Transformer Engine (TE)
|
|
|
|
- APEX
|
|
|
|
- GEMM tuning
|
|
|
|
- Torch.compile
|
|
|
|
- 3D parallelism: TP + SP + CP
|
|
|
|
- Distributed optimizer
|
|
|
|
- Flash Attention (FA) 3
|
|
|
|
- Fused kernels
|
|
|
|
- Pre-training
|
|
|
|
.. _amd-megatron-lm-model-support-v255:
|
|
|
|
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.5-benchmark-models.yaml
|
|
|
|
Supported models
|
|
================
|
|
|
|
The following models are supported for training performance benchmarking with Megatron-LM and ROCm.
|
|
Some instructions, commands, and training recommendations in this documentation might
|
|
vary by model -- select one to get started.
|
|
|
|
{% set model_groups = data["megatron-lm_benchmark"].model_groups %}
|
|
|
|
.. raw:: html
|
|
|
|
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
|
<div class="row">
|
|
<div class="col-2 me-2 model-param-head">Model</div>
|
|
<div class="row col-10">
|
|
{% for model_group in model_groups %}
|
|
<div class="col-4 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
|
|
<div class="row mt-1">
|
|
<div class="col-2 me-2 model-param-head">Model variant</div>
|
|
<div class="row col-10">
|
|
{% for model_group in model_groups %}
|
|
{% set models = model_group.models %}
|
|
{% for model in models %}
|
|
{% if models|length % 3 == 0 %}
|
|
<div class="col-4 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% else %}
|
|
<div class="col-6 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% endif %}
|
|
{% endfor %}
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
.. note::
|
|
|
|
Some models, such as Llama, require an external license agreement through
|
|
a third party (for example, Meta).
|
|
|
|
.. _amd-megatron-lm-performance-measurements-v255:
|
|
|
|
Performance measurements
|
|
========================
|
|
|
|
To evaluate performance, the
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`__
|
|
page provides reference throughput and latency measurements for training
|
|
popular AI models.
|
|
|
|
.. important::
|
|
|
|
The performance data presented in
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
|
|
only reflects the latest version of this training benchmarking environment.
|
|
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
|
|
|
|
System validation
|
|
=================
|
|
|
|
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
|
correctly and performing optimally.
|
|
|
|
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
|
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
|
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
|
before starting training.
|
|
|
|
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
|
system's configuration.
|
|
|
|
.. _mi300x-amd-megatron-lm-training-v255:
|
|
|
|
Environment setup
|
|
=================
|
|
|
|
Use the following instructions to set up the environment, configure the script to train models, and
|
|
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
|
|
image.
|
|
|
|
.. _amd-megatron-lm-requirements-v255:
|
|
|
|
Download the Docker image
|
|
-------------------------
|
|
|
|
1. Use the following command to pull the Docker image from Docker Hub.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Ubuntu 24.04 + Python 3.12
|
|
:sync: py312
|
|
|
|
.. code-block:: shell
|
|
|
|
docker pull rocm/megatron-lm:v25.5_py312
|
|
|
|
.. tab-item:: Ubuntu 22.04 + Python 3.10
|
|
:sync: py310
|
|
|
|
.. code-block:: shell
|
|
|
|
docker pull rocm/megatron-lm:v25.5_py310
|
|
|
|
2. Launch the Docker container.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: Ubuntu 24.04 + Python 3.12
|
|
:sync: py312
|
|
|
|
.. code-block:: shell
|
|
|
|
docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 128G --name megatron_training_env rocm/megatron-lm:v25.5_py312
|
|
|
|
|
|
.. tab-item:: Ubuntu 22.04 + Python 3.10
|
|
:sync: py310
|
|
|
|
.. code-block:: shell
|
|
|
|
docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 128G --name megatron_training_env rocm/megatron-lm:v25.5_py310
|
|
|
|
3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker start megatron_training_env
|
|
docker exec -it megatron_training_env bash
|
|
|
|
The Docker container includes a pre-installed, verified version of the ROCm
|
|
Megatron-LM development branch
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__, including necessary
|
|
training scripts.
|
|
|
|
.. _amd-megatron-lm-environment-setup-v255:
|
|
|
|
Configuration
|
|
=============
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b
|
|
|
|
Update the ``train_llama3.sh`` configuration script in the ``examples/llama``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v255>`.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
|
|
|
|
Update the ``train_llama2.sh`` configuration script in the ``examples/llama``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v255>`.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
|
|
|
Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run.
|
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v255>`.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
|
|
|
Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run.
|
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v255>`.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
|
|
|
|
Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral``
|
|
directory of
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run.
|
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v255>`.
|
|
|
|
.. note::
|
|
|
|
See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v255>` for more information on configuration options.
|
|
|
|
Network interface
|
|
-----------------
|
|
|
|
Update the network interface in the script to match your system's network interface. To
|
|
find your network interface, run the following (outside of any Docker container):
|
|
|
|
.. code-block:: bash
|
|
|
|
ip a
|
|
|
|
Look for an active interface that has an IP address in the same subnet as
|
|
your other nodes. Then, update the following variables in the script, for
|
|
example:
|
|
|
|
.. code-block:: bash
|
|
|
|
export NCCL_SOCKET_IFNAME=ens50f0np0
|
|
|
|
export GLOO_SOCKET_IFNAME=ens50f0np0
|
|
|
|
.. _amd-megatron-lm-tokenizer-v255:
|
|
|
|
Tokenizer
|
|
---------
|
|
|
|
You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples.
|
|
If the tokenizer is not found, it'll be downloaded if publicly available.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b
|
|
|
|
If you do not have Llama 3.3 tokenizer locally, you need to use your
|
|
personal Hugging Face access token ``HF_TOKEN`` to download the tokenizer.
|
|
See `Llama-3.3-70B-Instruct
|
|
<https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct>`_. After you are
|
|
authorized, use your ``HF_TOKEN`` to download the tokenizer and set the
|
|
variable ``TOKENIZER_MODEL`` to the tokenizer path.
|
|
|
|
.. code-block:: shell
|
|
|
|
export HF_TOKEN=<Your personal Hugging Face access token>
|
|
|
|
The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL="meta-llama/Llama-3.3-70B-Instruct"
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b
|
|
|
|
The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL="meta-llama/Llama-3.1-8B"
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b
|
|
|
|
The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL="meta-llama/Llama-3.1-70B"
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
|
|
|
|
The training script uses either the ``Llama2Tokenizer`` or ``HuggingFaceTokenizer`` by default.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
|
|
|
The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
|
|
|
The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V2-Lite"
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
|
|
|
|
Download the Mixtral tokenizer.
|
|
|
|
.. code-block:: shell
|
|
|
|
mkdir tokenizer
|
|
cd tokenizer
|
|
export HF_TOKEN=<Your personal Hugging Face access token>
|
|
wget --header="Authorization: Bearer $HF_TOKEN" -O ./tokenizer.model https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/tokenizer.model
|
|
|
|
Use the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path.
|
|
|
|
.. code-block:: shell
|
|
|
|
TOKENIZER_MODEL=tokenizer/tokenizer.model
|
|
|
|
Dataset options
|
|
---------------
|
|
|
|
You can use either mock data or real data for training.
|
|
|
|
* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default
|
|
value is ``1`` for enabled.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=1
|
|
|
|
* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0
|
|
|
|
DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
Download the dataset
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
|
|
|
|
For Llama models, use the `prepare_dataset.sh
|
|
<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`_ script
|
|
to prepare your dataset.
|
|
To download the dataset, set the ``DATASET`` variable to the dataset you'd
|
|
like to use. Three datasets are supported: ``DATASET=wiki``, ``DATASET=fineweb``, and
|
|
``DATASET=bookcorpus``.
|
|
|
|
.. code-block:: shell
|
|
|
|
DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset
|
|
DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset
|
|
|
|
``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer.
|
|
Remember to either pre-download the tokenizer or setup Hugging Face access
|
|
otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v255>` section.
|
|
|
|
.. note::
|
|
|
|
When training set ``DATA_PATH`` to the specific file name prefix pointing to the ``.bin`` or ``.idx``
|
|
as in the following example:
|
|
|
|
.. code-block:: shell
|
|
|
|
DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
|
|
|
If you don't already have the dataset, download the DeepSeek dataset using the following
|
|
commands:
|
|
|
|
.. code-block:: shell
|
|
|
|
mkdir deepseek-datasets
|
|
cd deepseek-datasets
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx
|
|
|
|
To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0 # Train on real data
|
|
|
|
DATA_DIR="<path-to>/deepseek-datasets" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
|
|
|
If you don't already have the dataset, download the DeepSeek dataset using the following
|
|
commands:
|
|
|
|
.. code-block:: shell
|
|
|
|
mkdir deepseek-datasets
|
|
cd deepseek-datasets
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx
|
|
|
|
To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0 # Train on real data
|
|
|
|
DATA_DIR="<path-to>/deepseek-datasets" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
|
|
|
|
If you don't already have the dataset, download the Mixtral dataset using the following
|
|
commands:
|
|
|
|
.. code-block:: shell
|
|
|
|
mkdir mixtral-datasets
|
|
cd mixtral-datasets
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.bin
|
|
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.idx
|
|
|
|
To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
MOCK_DATA=0 # Train on real data
|
|
|
|
DATA_DIR="<path-to>/mixtral-datasets" # Change to where your dataset is stored
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
Multi-node configuration
|
|
------------------------
|
|
|
|
If you're running multi-node training, update the following environment variables. They can
|
|
also be passed as command line arguments. Refer to the following example configurations.
|
|
|
|
* Change ``localhost`` to the master node's hostname:
|
|
|
|
.. code-block:: shell
|
|
|
|
MASTER_ADDR="${MASTER_ADDR:-localhost}"
|
|
|
|
* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
|
|
|
|
.. code-block:: shell
|
|
|
|
NNODES="${NNODES:-1}"
|
|
|
|
* Set the rank of each node (0 for master, 1 for the first worker node, and so on):
|
|
|
|
.. code-block:: shell
|
|
|
|
NODE_RANK="${NODE_RANK:-0}"
|
|
|
|
* Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an
|
|
NFS directory) for multi-node runs:
|
|
|
|
.. code-block:: shell
|
|
|
|
DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs
|
|
|
|
* For multi-node runs, make sure the correct network drivers are installed on the nodes. If
|
|
inside a Docker container, either install the drivers inside the Docker container or pass the network
|
|
drivers from the host while creating the Docker container.
|
|
|
|
.. code-block:: shell
|
|
|
|
# Specify which RDMA interfaces to use for communication
|
|
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
|
|
|
|
Getting started
|
|
===============
|
|
|
|
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
|
|
system performance, conduct training benchmarks, and achieve superior
|
|
performance for models like Llama, DeepSeek, and Mixtral. This container should not be
|
|
expected to provide generalized performance across all training workloads. You
|
|
can expect the container to perform in the model configurations described in
|
|
the following section, but other configurations are not validated by AMD.
|
|
|
|
.. _amd-megatron-lm-run-training-v255:
|
|
|
|
Run training
|
|
------------
|
|
|
|
Use the following example commands to set up the environment, configure
|
|
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v255>`, and run training on
|
|
MI300X Series GPUs with the AMD Megatron-LM environment.
|
|
|
|
Single node training
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b
|
|
|
|
To run the training on a single node for Llama 3.3 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument.
|
|
For example, use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 RECOMPUTE=1 SEQ_LENGTH=8192 MBS=2 BS=16 TE_FP8=0 TP=1 PP=1 FSDP=1 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
.. note::
|
|
|
|
It is suggested to use ``TP=1`` when FSDP is enabled for higher
|
|
throughput. FSDP-v2 is not supported with pipeline parallelism, expert
|
|
parallelism, MCore's distributed optimizer, gradient accumulation fusion,
|
|
or FP16.
|
|
|
|
Currently, FSDP is only compatible with BF16 precision.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b
|
|
|
|
To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the
|
|
following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
For Llama 3.1 8B BF16, use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=0 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b
|
|
|
|
To run the training on a single node for Llama 3.1 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument.
|
|
For example, use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=3 BS=24 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=8192 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh
|
|
|
|
.. note::
|
|
|
|
It is suggested to use ``TP=1`` when FSDP is enabled for higher
|
|
throughput. FSDP-v2 is not supported with pipeline parallelism, expert
|
|
parallelism, MCore's distributed optimizer, gradient accumulation fusion,
|
|
or FP16.
|
|
|
|
Currently, FSDP is only compatible with BF16 precision.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b
|
|
|
|
To run training on a single node for Llama 2 7B FP8, navigate to the Megatron-LM folder and use the
|
|
following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
For Llama 2 7B BF16, use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=0 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_llama-2-70b
|
|
|
|
To run the training on a single node for Llama 2 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument.
|
|
For example, use the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=7 BS=56 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=4096 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh
|
|
|
|
.. note::
|
|
|
|
It is suggested to use ``TP=1`` when FSDP is enabled for higher
|
|
throughput. FSDP-v2 is not supported with pipeline parallelism, expert
|
|
parallelism, MCore's distributed optimizer, gradient accumulation fusion,
|
|
or FP16.
|
|
|
|
Currently, FSDP is only compatible with BF16 precision.
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
|
|
|
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy,
|
|
navigate to the Megatron-LM folder and use the following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
FORCE_BANLANCE=true \
|
|
RUN_ENV=cluster \
|
|
MODEL_SIZE=671B \
|
|
TRAIN_ITERS=50 \
|
|
SEQ_LEN=4096 \
|
|
NUM_LAYERS=3 \
|
|
MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=32 \
|
|
PR=bf16 \
|
|
TP=1 PP=1 ETP=1 EP=8 \
|
|
GEMM_TUNING=1 \
|
|
NVTE_CK_USES_BWD_V3=1 \
|
|
USE_GROUPED_GEMM=true MOE_USE_LEGACY_GROUPED_GEMM=true \
|
|
GPT_LAYER_IN_TE=true \
|
|
bash examples/deepseek_v3/train_deepseekv3.sh
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
|
|
|
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel),
|
|
navigate to the Megatron-LM folder and use the following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
GEMM_TUNING=1 PR=bf16 MBS=4 AC=none SEQ_LEN=4096 PAD_LEN=4096 TRAIN_ITERS=50 bash examples/deepseek_v2/train_deepseekv2.sh
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b
|
|
|
|
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
|
|
navigate to the Megatron-LM folder and use the following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
RECOMPUTE_NUM_LAYERS=0 TEE_OUTPUT=1 MBS=1 GBS=16 TP_SIZE=1 PP_SIZE=1 AC=none PR=bf16 EP_SIZE=8 ETP_SIZE=1 SEQLEN=4096 FORCE_BALANCE=true MOCK_DATA=1 RUN_ENV=cluster MODEL_SIZE=8x7B TRAIN_ITERS=50 bash examples/mixtral/train_mixtral_moe.sh
|
|
|
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x22b-proxy
|
|
|
|
To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy,
|
|
navigate to the Megatron-LM folder and use the following command.
|
|
|
|
.. code-block:: shell
|
|
|
|
RECOMPUTE_NUM_LAYERS=4 TEE_OUTPUT=1 MBS=1 GBS=16 TP_SIZE=1 PP_SIZE=1 AC=full NUM_LAYERS=4 PR=bf16 EP_SIZE=8 ETP_SIZE=1 SEQLEN=8192 FORCE_BALANCE=true MOCK_DATA=1 RUN_ENV=cluster MODEL_SIZE=8x22B TRAIN_ITERS=50 bash examples/mixtral/train_mixtral_moe.sh
|
|
|
|
Multi-node training
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
To run training on multiple nodes, launch the Docker container on each node.
|
|
For example, for Llama 3 using a two node setup (``NODE0`` as the master node),
|
|
use these commands.
|
|
|
|
* On the master node ``NODE0``:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh
|
|
|
|
* On the worker node ``NODE1``:
|
|
|
|
.. code-block:: shell
|
|
|
|
TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh
|
|
|
|
Or, for DeepSeek-V3, an example script ``train_deepseek_v3_slurm.sh`` is
|
|
provided in
|
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to
|
|
enable training at scale under a SLURM environment. For example, to run
|
|
training on 16 nodes, try the following command:
|
|
|
|
.. code-block:: shell
|
|
|
|
sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh
|
|
|
|
.. _amd-megatron-lm-benchmark-test-vars-v255:
|
|
|
|
Key options
|
|
-----------
|
|
|
|
The benchmark tests support the following sets of variables.
|
|
|
|
``TEE_OUTPUT``
|
|
``1`` to enable training logs or ``0`` to disable.
|
|
|
|
``TE_FP8``
|
|
``0`` for B16 or ``1`` for FP8 -- ``0`` by default.
|
|
|
|
``GEMM_TUNING``
|
|
``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels.
|
|
|
|
``USE_FLASH_ATTN``
|
|
``1`` to enable Flash Attention.
|
|
|
|
``FSDP``
|
|
``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``,
|
|
``--overlap-param-gather``, and ``--sequence-parallel`` are automatically disabled.
|
|
|
|
``ENABLE_PROFILING``
|
|
``1`` to enable PyTorch profiling for performance analysis.
|
|
|
|
``transformer-impl``
|
|
``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE.
|
|
|
|
``MODEL_SIZE``
|
|
``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2, for example.
|
|
|
|
``TOTAL_ITERS``
|
|
The total number of iterations -- ``10`` by default.
|
|
|
|
``MOCK_DATA``
|
|
``1`` to use mock data or ``0`` to use real data you provide.
|
|
|
|
``MBS``
|
|
Micro batch size.
|
|
|
|
``BS``
|
|
Global batch size.
|
|
|
|
``TP`` / ``TP_SIZE``
|
|
Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on.
|
|
|
|
``EP`` / ``EP_SIZE``
|
|
Expert parallel for MoE models.
|
|
|
|
``SEQ_LENGTH``
|
|
Input sequence length.
|
|
|
|
``PR``
|
|
Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs.
|
|
|
|
``AC``
|
|
Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default.
|
|
|
|
``NUM_LAYERS``
|
|
Use reduced number of layers as a proxy model.
|
|
|
|
``RECOMPUTE_NUM_LAYERS``
|
|
Number of layers used for checkpointing recompute.
|
|
|
|
Previous versions
|
|
=================
|
|
|
|
See :doc:`megatron-lm-history` to find documentation for previous releases
|
|
of the ``ROCm/megatron-lm`` Docker image.
|