mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 22:58:17 -05:00
442 lines
17 KiB
ReStructuredText
442 lines
17 KiB
ReStructuredText
.. meta::
|
||
:description: How to train a model using PyTorch for ROCm.
|
||
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
|
||
|
||
****************************************
|
||
Training a model with Primus and PyTorch
|
||
****************************************
|
||
|
||
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
|
||
LLM training framework designed to streamline training. It streamlines LLM
|
||
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
|
||
Primus now supports the PyTorch torchtitan backend.
|
||
|
||
.. note::
|
||
|
||
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
|
||
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
||
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||
including torchtitan and :doc:`Megatron-LM <primus-megatron>`.
|
||
|
||
Primus with the PyTorch torchtitan backend is designed to replace the
|
||
:doc:`ROCm PyTorch training <pytorch-training>` workflow. See
|
||
:doc:`pytorch-training` to see steps to run workloads without Primus.
|
||
|
||
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
|
||
MI300X GPUs containing essential components for Primus and PyTorch training
|
||
with Primus Turbo optimizations.
|
||
|
||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: {{ data.docker.pull_tag }}
|
||
:sync: {{ data.docker.pull_tag }}
|
||
|
||
.. list-table::
|
||
:header-rows: 1
|
||
|
||
* - Software component
|
||
- Version
|
||
|
||
{% for component_name, component_version in data.docker.components.items() %}
|
||
* - {{ component_name }}
|
||
- {{ component_version }}
|
||
{% endfor %}
|
||
|
||
.. _amd-primus-pytorch-model-support-v2510:
|
||
|
||
Supported models
|
||
================
|
||
|
||
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
|
||
Some instructions, commands, and training recommendations in this documentation might
|
||
vary by model -- select one to get started.
|
||
|
||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
||
|
||
{% set model_groups = data.model_groups %}
|
||
.. raw:: html
|
||
|
||
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
||
<div class="row gx-0">
|
||
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||
<div class="row col-10 pe-0">
|
||
{% for model_group in model_groups %}
|
||
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||
{% endfor %}
|
||
</div>
|
||
</div>
|
||
|
||
<div class="row gx-0 pt-1">
|
||
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
|
||
<div class="row col-10 pe-0">
|
||
{% for model_group in model_groups %}
|
||
{% set models = model_group.models %}
|
||
{% for model in models %}
|
||
{% if models|length % 3 == 0 %}
|
||
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||
{% else %}
|
||
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||
{% endif %}
|
||
{% endfor %}
|
||
{% endfor %}
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
.. seealso::
|
||
|
||
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||
see the documentation :doc:`pytorch-training` (without Primus)
|
||
|
||
.. _amd-primus-pytorch-performance-measurements-v2510:
|
||
|
||
System validation
|
||
=================
|
||
|
||
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
||
correctly and performing optimally.
|
||
|
||
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
||
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
||
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
||
before starting training.
|
||
|
||
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
||
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||
system's configuration.
|
||
|
||
This Docker image is optimized for specific model configurations outlined
|
||
below. Performance can vary for other training workloads, as AMD
|
||
doesn’t test configurations and run conditions outside those described.
|
||
|
||
Pull the Docker image
|
||
=====================
|
||
|
||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
||
|
||
Use the following command to pull the Docker image from Docker Hub.
|
||
|
||
.. code-block:: shell
|
||
|
||
docker pull {{ data.docker.pull_tag }}
|
||
|
||
Run training
|
||
============
|
||
|
||
Once the setup is complete, choose between the following two workflows to start benchmarking training.
|
||
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
|
||
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
|
||
tweak some configurations (such as batch sizes).
|
||
|
||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
||
|
||
{% set docker = data.docker %}
|
||
{% set model_groups = data.model_groups %}
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MAD-integrated benchmarking
|
||
|
||
{% for model_group in model_groups %}
|
||
{% for model in model_group.models %}
|
||
|
||
.. container:: model-doc {{ model.mad_tag }}
|
||
|
||
The following run command is tailored to {{ model.model }}.
|
||
See :ref:`amd-primus-pytorch-model-support-v2510` to switch to another available model.
|
||
|
||
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||
directory and install the required packages on the host machine.
|
||
|
||
.. code-block:: shell
|
||
|
||
git clone https://github.com/ROCm/MAD
|
||
cd MAD
|
||
pip install -r requirements.txt
|
||
|
||
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
|
||
using one node with the {{ model.precision }} data type on the host machine.
|
||
|
||
.. code-block:: shell
|
||
|
||
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
|
||
madengine run \
|
||
--tags {{ model.mad_tag }} \
|
||
--keep-model-dir \
|
||
--live-output \
|
||
--timeout 28800
|
||
|
||
MAD launches a Docker container with the name
|
||
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
|
||
model are collected in ``~/MAD/perf.csv``.
|
||
|
||
{% endfor %}
|
||
{% endfor %}
|
||
|
||
.. tab-item:: Primus benchmarking
|
||
|
||
{% for model_group in model_groups %}
|
||
{% for model in model_group.models %}
|
||
|
||
.. container:: model-doc {{ model.mad_tag }}
|
||
|
||
The following run commands are tailored to {{ model.model }}.
|
||
See :ref:`amd-primus-pytorch-model-support-v2510` to switch to another available model.
|
||
|
||
.. rubric:: Download the Docker image and required packages
|
||
|
||
1. Pull the ``{{ docker.pull_tag }}`` Docker image from Docker Hub.
|
||
|
||
.. code-block:: shell
|
||
|
||
docker pull {{ docker.pull_tag }}
|
||
|
||
2. Run the Docker container.
|
||
|
||
.. code-block:: shell
|
||
|
||
docker run -it \
|
||
--device /dev/dri \
|
||
--device /dev/kfd \
|
||
--network host \
|
||
--ipc host \
|
||
--group-add video \
|
||
--cap-add SYS_PTRACE \
|
||
--security-opt seccomp=unconfined \
|
||
--privileged \
|
||
-v $HOME:$HOME \
|
||
-v $HOME/.ssh:/root/.ssh \
|
||
--shm-size 64G \
|
||
--name training_env \
|
||
{{ docker.pull_tag }}
|
||
|
||
Use these commands if you exit the ``training_env`` container and need to return to it.
|
||
|
||
.. code-block:: shell
|
||
|
||
docker start training_env
|
||
docker exec -it training_env bash
|
||
|
||
.. rubric:: Prepare training datasets and dependencies
|
||
|
||
The following benchmarking examples require downloading models and datasets
|
||
from Hugging Face. To ensure successful access to gated repos, set your
|
||
``HF_TOKEN``.
|
||
|
||
.. code-block:: shell
|
||
|
||
export HF_TOKEN=$your_personal_hugging_face_access_token
|
||
|
||
.. rubric:: Pretraining
|
||
|
||
To get started, navigate to the ``Primus`` directory in your container.
|
||
|
||
.. code-block::
|
||
|
||
cd /workspace/Primus
|
||
|
||
Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
|
||
included with Primus with the appropriate options.
|
||
|
||
.. rubric:: Benchmarking examples
|
||
|
||
.. container:: model-doc primus_pyt_train_llama-3.1-8b
|
||
|
||
Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 4
|
||
|
||
|
||
To train Llama 3.1 8B with FP8 precision, use the following command.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 8
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 7
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 5
|
||
|
||
.. container:: model-doc primus_pyt_train_llama-3.1-70b
|
||
|
||
Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X and MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 8
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 4
|
||
|
||
To train Llama 3.1 70B with FP8 precision, use the following command.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 5
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 3
|
||
|
||
.. container:: model-doc primus_pyt_train_deepseek-v2
|
||
|
||
Use the following command to run train DeepSeek V2 16B with BF16 precision using Primus torchtitan.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X and MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 16
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 10
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 8
|
||
|
||
To train DeepSeek V2 16B with FP8 precision, use the following command.
|
||
|
||
.. tab-set::
|
||
|
||
.. tab-item:: MI355X and MI350X
|
||
:sync: MI355X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 16
|
||
|
||
.. tab-item:: MI325X
|
||
:sync: MI325X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 8
|
||
|
||
.. tab-item:: MI300X
|
||
:sync: MI300X
|
||
|
||
.. code-block:: shell
|
||
|
||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||
bash examples/run_pretrain.sh --training.local_batch_size 8
|
||
{% endfor %}
|
||
{% endfor %}
|
||
|
||
Further reading
|
||
===============
|
||
|
||
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
|
||
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
|
||
|
||
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
|
||
|
||
- To learn more about system settings and management practices to configure your system for
|
||
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
|
||
|
||
- For a list of other ready-made Docker images for AI with ROCm, see
|
||
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
||
|
||
Previous versions
|
||
=================
|
||
|
||
See :doc:`previous-versions/pytorch-training-history` to find documentation for previous releases
|
||
of the ``ROCm/pytorch-training`` Docker image.
|