mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 14:48:06 -05:00
1013 lines
34 KiB
ReStructuredText
1013 lines
34 KiB
ReStructuredText
.. meta::
|
|
:description: How to train a model using Megatron-LM for ROCm.
|
|
:keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
|
|
|
|
********************************************
|
|
Training a model with Primus and Megatron-LM
|
|
********************************************
|
|
|
|
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
|
|
training framework for AMD Instinct GPUs designed to support multiple training
|
|
engine backends -- including Megatron -- to deliver scalable, high-performance
|
|
model training. Performance acceleration is powered by `Primus Turbo
|
|
<https://github.com/AMD-AGI/Primus-Turbo>`__ and ROCm libraries.
|
|
|
|
.. note::
|
|
|
|
For a unified training solution on AMD GPUs with ROCm, the `rocm/megatron-lm
|
|
<https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
|
|
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
|
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
|
including Megatron-LM and :doc:`torchtitan <primus-pytorch>`.
|
|
|
|
Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM
|
|
training <megatron-lm>` workflow. To learn how to migrate workloads from
|
|
Megatron-LM to Primus with Megatron, see
|
|
:doc:`previous-versions/megatron-lm-primus-migration-guide`.
|
|
|
|
AMD provides a ready-to-use Docker images for MI355X, MI350X,
|
|
MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and
|
|
Megatron-LM.
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
{% set dockers = data.dockers %}
|
|
.. tab-set::
|
|
|
|
{% for supported_gpus, docker in dockers.items() %}
|
|
.. tab-item:: {{ supported_gpus }}
|
|
:sync: {{ supported_gpus }}
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
|
|
* - Software component
|
|
- Version
|
|
|
|
{% for component_name, component_version in docker.components.items() %}
|
|
* - {{ component_name }}
|
|
- {{ component_version }}
|
|
{% endfor %}
|
|
{% endfor %}
|
|
|
|
.. _amd-primus-megatron-lm-model-support-v259:
|
|
|
|
Supported models
|
|
================
|
|
|
|
The following models are pre-optimized for performance on AMD Instinct GPUs.
|
|
Some instructions, commands, and training examples in this documentation
|
|
might vary by model -- select one to get started.
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
{% set model_groups = data.model_groups %}
|
|
.. raw:: html
|
|
|
|
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
|
<div class="row gx-0">
|
|
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
|
<div class="row col-10 pe-0">
|
|
{% for model_group in model_groups %}
|
|
<div class="col-3 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
|
|
<div class="row gx-0 pt-1">
|
|
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
|
|
<div class="row col-10 pe-0">
|
|
{% for model_group in model_groups %}
|
|
{% set models = model_group.models %}
|
|
{% for model in models %}
|
|
{% if models|length % 3 == 0 %}
|
|
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% else %}
|
|
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% endif %}
|
|
{% endfor %}
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
.. note::
|
|
|
|
Some models, such as Llama, require an external license agreement through
|
|
a third party (for example, Meta).
|
|
|
|
System validation
|
|
=================
|
|
|
|
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
|
correctly and performing optimally.
|
|
|
|
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
|
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
|
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
|
before starting training.
|
|
|
|
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
|
system's configuration.
|
|
|
|
.. _mi300x-amd-primus-megatron-lm-training-v259:
|
|
|
|
Environment setup
|
|
=================
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
Use the following instructions to set up the environment, configure the script to train models, and
|
|
reproduce the benchmark results on AMD Instinct GPUs.
|
|
|
|
.. _amd-primus-megatron-lm-requirements-v259:
|
|
|
|
Pull the Docker image
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
{% set dockers = data.dockers %}
|
|
|
|
1. Pull the appropriate Docker image for your AMD GPU architecture from Docker Hub.
|
|
|
|
.. tab-set::
|
|
|
|
{% for supported_gpus, docker in dockers.items() %}
|
|
.. tab-item:: {{ supported_gpus }}
|
|
:sync: {{ supported_gpus }}
|
|
|
|
.. code-block:: shell
|
|
|
|
docker pull {{ docker.pull_tag }}
|
|
{% endfor %}
|
|
|
|
2. Launch the Docker container.
|
|
|
|
.. tab-set::
|
|
|
|
{% for supported_gpus, docker in dockers.items() %}
|
|
.. tab-item:: {{ supported_gpus }}
|
|
:sync: {{ supported_gpus }}
|
|
|
|
.. code-block:: shell
|
|
|
|
docker run -it \
|
|
--device /dev/dri \
|
|
--device /dev/kfd \
|
|
--device /dev/infiniband \
|
|
--network host --ipc host \
|
|
--group-add video \
|
|
--cap-add SYS_PTRACE \
|
|
--security-opt seccomp=unconfined \
|
|
--privileged \
|
|
-v $HOME:$HOME \
|
|
--shm-size 128G \
|
|
--name primus_training_env \
|
|
{{ docker.pull_tag }}
|
|
{% endfor %}
|
|
|
|
3. Use these commands if you exit the ``primus_training_env`` container and need to return to it.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker start primus_training_env
|
|
docker exec -it primus_training_env bash
|
|
|
|
The Docker container hosts verified commit ``e16b27b`` of the `Primus
|
|
<https://github.com/AMD-AGI/Primus/tree/e16b27b>`__ repository.
|
|
|
|
.. _amd-primus-megatron-lm-environment-setup-v259:
|
|
|
|
Configuration
|
|
=============
|
|
|
|
Primus defines a training configuration in YAML for each model in
|
|
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__.
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
{% set model_groups = data.model_groups %}
|
|
{% for model_group in model_groups %}
|
|
{% for model in model_group.models %}
|
|
.. container:: model-doc {{ model.mad_tag }}
|
|
|
|
For example, to update training parameters for {{ model.model }}, you can
|
|
update ``examples/megatron/configs/{{ model.config_name }}``. Training
|
|
configuration YAML files for other models follow this naming convention.
|
|
|
|
{% endfor %}
|
|
{% endfor %}
|
|
|
|
.. note::
|
|
|
|
See :ref:`Key options <amd-primus-megatron-lm-benchmark-test-vars>` for more information on configuration options.
|
|
|
|
Dataset options
|
|
---------------
|
|
|
|
You can use either mock data or real data for training.
|
|
|
|
* Mock data can be useful for testing and validation. Use the ``mock_data`` field to toggle between mock and real data. The default
|
|
value is ``true`` for enabled.
|
|
|
|
.. code-block:: yaml
|
|
|
|
mock_data: true
|
|
|
|
* If you're using a real dataset, update the ``train_data_path`` field to point to the location of your dataset.
|
|
|
|
.. code-block:: bash
|
|
|
|
mock_data: false
|
|
train_data_path: /path/to/your/dataset
|
|
|
|
Ensure that the files are accessible inside the Docker container.
|
|
|
|
.. _amd-primus-megatron-lm-tokenizer-v259:
|
|
|
|
Tokenizer
|
|
---------
|
|
|
|
Set the ``HF_TOKEN`` environment variable with
|
|
right permissions to access the tokenizer for each model.
|
|
|
|
.. code-block:: bash
|
|
|
|
# Export your HF_TOKEN in the workspace
|
|
export HF_TOKEN=<your_hftoken>
|
|
|
|
.. note::
|
|
|
|
In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
|
|
3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and
|
|
``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model
|
|
<https://github.com/AMD-AGI/Primus/blob/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs/llama3.1_8B-pretrain.yaml>`__
|
|
definition.
|
|
|
|
.. _amd-primus-megatron-lm-run-training-v259:
|
|
|
|
Run training
|
|
============
|
|
|
|
Use the following example commands to set up the environment, configure
|
|
:ref:`key options <amd-primus-megatron-lm-benchmark-test-vars>`, and run training on
|
|
AMD Instinct GPUs using Primus with the Megatron backend.
|
|
|
|
Single node training
|
|
--------------------
|
|
|
|
To run training on a single node, navigate to ``/workspace/Primus`` and use the following setup command:
|
|
|
|
.. code-block:: shell
|
|
|
|
pip install -r requirements.txt
|
|
export HSA_NO_SCRATCH_RECLAIM=1
|
|
export NVTE_CK_USES_BWD_V3=1
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.3 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run pre-training for Llama 3.3 70B BF16, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 6 \
|
|
--global_batch_size 48 \
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 2 \
|
|
--global_batch_size 16
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.1 8B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run pre-training for Llama 3.1 8B FP8, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 512 \
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid
|
|
|
|
For Llama 3.1 8B BF16, use the following command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 512 \
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.1 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run pre-training for Llama 3.1 70B BF16, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 32
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
To run the training on a single node for Llama 3.1 70B FP8, use the following command.
|
|
|
|
.. note::
|
|
|
|
The MI300X configuration uses a proxy model. On MI300X GPUs, use two or more nodes
|
|
to run the full Llama 3.1 70B model with FP8 precision. MI355X and MI350X GPUs
|
|
can support the full 70B model with FP8 precision on a single node.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid \
|
|
--no_fp8_weight_transpose_cache true \
|
|
--micro_batch_size 3 \
|
|
--global_batch_size 24
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--num_layers 40 \
|
|
--fp8 hybrid \
|
|
--no_fp8_weight_transpose_cache true
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 7B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run pre-training for Llama 2 7B FP8, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid \
|
|
--micro_batch_size 13 \
|
|
--global_batch_size 416
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid
|
|
|
|
To run pre-training for Llama 2 7B BF16, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 10 \
|
|
--global_batch_size 640
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run pre-training for Llama 2 70B BF16, run:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 17 \
|
|
--global_batch_size 272
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
|
|
bash ./examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to DeepSeek-V3.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
|
|
use the following command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--num_layers 3 \
|
|
--moe_layer_freq 1 \
|
|
--train_iters 50 \
|
|
--micro_batch_size 8 \
|
|
--global_batch_size 64
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--num_layers 3 \
|
|
--moe_layer_freq 1 \
|
|
--train_iters 50
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to DeepSeek-V2-Lite.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
|
|
use the following command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 12 \
|
|
--global_batch_size 768
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--global_batch_size 256
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Mixtral 8x7B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
|
|
use the following command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 256
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Mixtral 8x22B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
|
|
use the following command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--num_layers 4 \
|
|
--pipeline_model_parallel_size 1 \
|
|
--micro_batch_size 2 \
|
|
--global_batch_size 16
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--num_layers 4 \
|
|
--pipeline_model_parallel_size 1 \
|
|
--micro_batch_size 1 \
|
|
--global_batch_size 16
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Qwen 2.5 7B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run training on a single node for Qwen 2.5 7B BF16, use the following
|
|
command:
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 16 \
|
|
--global_batch_size 768
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
For FP8, use the following command.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid
|
|
--micro_batch_size 20 \
|
|
--global_batch_size 800
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--fp8 hybrid
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Qwen 2.5 72B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MI355X and MI350X
|
|
:sync: MI355X and MI350X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50 \
|
|
--micro_batch_size 16 \
|
|
--global_batch_size 256
|
|
|
|
.. tab-item:: MI300X
|
|
:sync: MI325X and MI300X
|
|
|
|
.. code-block:: shell
|
|
|
|
EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
|
|
bash examples/run_pretrain.sh \
|
|
--train_iters 50
|
|
|
|
.. _amd-primus-megatron-multi-node-examples-v259:
|
|
|
|
Multi-node training examples
|
|
----------------------------
|
|
|
|
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
|
|
training.
|
|
|
|
To run training on multiple nodes, you can use the
|
|
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
|
|
to launch the multi-node workload. Use the following steps to setup your environment:
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
|
|
|
{% set dockers = data.dockers %}
|
|
.. tab-set::
|
|
|
|
{% for supported_gpus, docker in dockers.items() %}
|
|
.. tab-item:: {{ supported_gpus }}
|
|
:sync: {{ supported_gpus }}
|
|
|
|
.. code-block:: shell
|
|
|
|
git clone --recurse-submodules https://github.com/AMD-AGI/Primus.git
|
|
cd Primus
|
|
git checkout e16b27b
|
|
|
|
export DOCKER_IMAGE={{ docker.pull_tag }}
|
|
export HF_TOKEN=<your_HF_token>
|
|
export HSA_NO_SCRATCH_RECLAIM=1
|
|
export NVTE_CK_USES_BWD_V3=1
|
|
export NCCL_IB_HCA=<your_NCCL_IB_HCA> # specify which RDMA interfaces to use for communication
|
|
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME> # your Network Interface
|
|
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME> # your Network Interface
|
|
export NCCL_IB_GID_INDEX=3 # Set InfiniBand GID index for NCCL communication. Default is 3 for ROCE
|
|
{% endfor %}
|
|
|
|
.. note::
|
|
|
|
* Make sure correct network drivers are installed on the nodes. If inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating Docker container.
|
|
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
|
|
* To find your network interface, you can use ``ip a``.
|
|
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
|
|
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v259`) as appropriate.
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.1 8B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Llama 3.1 8B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case.
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
|
|
bash ./examples/run_slurm_pretrain.sh \
|
|
--global_batch_size 1024 \
|
|
--fp8 hybrid
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 7B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Llama 2 7B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case.
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
|
|
bash ./examples/run_slurm_pretrain.sh \
|
|
--global_batch_size 2048 \
|
|
--fp8 hybrid
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.1 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Llama 3.1 70B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case.
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 256 \
|
|
--recompute_num_layers 80 \
|
|
--no_fp8_weight_transpose_cache true \
|
|
--fp8 hybrid
|
|
|
|
To train Llama 3.1 70B BF16 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 1 \
|
|
--global_batch_size 256 \
|
|
--recompute_num_layers 12
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Llama 2 70B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case.
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 10 \
|
|
--global_batch_size 640 \
|
|
--recompute_num_layers 80 \
|
|
--no_fp8_weight_transpose_cache true \
|
|
--fp8 hybrid
|
|
|
|
To train Llama 2 70B BF16 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
|
|
bash ./examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 2 \
|
|
--global_batch_size 1536 \
|
|
--recompute_num_layers 12
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 3.3 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Llama 3.3 70B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 4 \
|
|
--global_batch_size 256 \
|
|
--recompute_num_layers 80 \
|
|
--no_fp8_weight_transpose_cache true \
|
|
--fp8 hybrid
|
|
|
|
To train Llama 3.3 70B BF16 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 1 \
|
|
--global_batch_size 256 \
|
|
--recompute_num_layers 12
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Mixtral 8x7B BF16 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 2 \
|
|
--global_batch_size 256
|
|
|
|
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
|
|
|
|
Once setup is complete, run the appropriate training command.
|
|
The following run commands are tailored to Llama 2 70B.
|
|
See :ref:`amd-primus-megatron-lm-model-support-v259` to switch to another available model.
|
|
|
|
To train Qwen2.5 72B FP8 on 8 nodes, run:
|
|
|
|
.. code-block:: shell
|
|
|
|
# Adjust the training parameters.
|
|
# For example, `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
|
|
NNODES=8 \
|
|
EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
|
|
bash examples/run_slurm_pretrain.sh \
|
|
--micro_batch_size 8 \
|
|
--global_batch_size 512 \
|
|
--recompute_num_layers 80 \
|
|
--no_fp8_weight_transpose_cache true \
|
|
--fp8 hybrid
|
|
|
|
.. _amd-primus-megatron-lm-benchmark-test-vars-v259:
|
|
|
|
Key options
|
|
-----------
|
|
|
|
The following are key options to take note of
|
|
|
|
fp8
|
|
``hybrid`` enables FP8 GEMMs.
|
|
|
|
use_torch_fsdp2
|
|
``use_torch_fsdp2: 1`` enables torch fsdp-v2. If FSDP is enabled,
|
|
set ``use_distributed_optimizer`` and ``overlap_param_gather`` to ``false``.
|
|
|
|
profile
|
|
To enable PyTorch profiling, set these parameters:
|
|
|
|
.. code-block:: yaml
|
|
|
|
profile: true
|
|
use_pytorch_profiler: true
|
|
profile_step_end: 7
|
|
profile_step_start: 6
|
|
|
|
train_iters
|
|
The total number of iterations (default: 50).
|
|
|
|
mock_data
|
|
True by default.
|
|
|
|
micro_batch_size
|
|
Micro batch size.
|
|
|
|
global_batch_size
|
|
Global batch size.
|
|
|
|
recompute_granularity
|
|
For activation checkpointing.
|
|
|
|
num_layers
|
|
For using a reduced number of layers as with proxy models.
|
|
|
|
Known issues
|
|
============
|
|
|
|
PyTorch Profiler may produce inaccurate traces when CPU activity profiling is enabled.
|
|
|
|
Further reading
|
|
===============
|
|
|
|
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
|
|
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
|
|
|
|
- To learn more about system settings and management practices to configure your system for
|
|
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
|
|
|
|
- For a list of other ready-made Docker images for AI with ROCm, see
|
|
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
|
|
|
Previous versions
|
|
=================
|
|
|
|
See :doc:`previous-versions/megatron-lm-history` to find documentation for previous releases
|
|
of the ``ROCm/megatron-lm`` Docker image.
|
|
|
|
This training environment now uses Primus with Megatron as the primary
|
|
configuration. Limited support for the legacy ROCm Megatron-LM is still
|
|
available; see the :doc:`megatron-lm` documentation.
|