Compare commits
3 Commits
docs/7.11.
...
doc/megatr
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
744eff47a0 | ||
|
|
8c3181295b | ||
|
|
267eda26ea |
@@ -592,6 +592,7 @@ hpp
|
|||||||
hsa
|
hsa
|
||||||
hsakmt
|
hsakmt
|
||||||
hyperparameter
|
hyperparameter
|
||||||
|
hyperparameters
|
||||||
iDRAC
|
iDRAC
|
||||||
ib_core
|
ib_core
|
||||||
inband
|
inband
|
||||||
|
|||||||
@@ -43,6 +43,7 @@ article_pages = [
|
|||||||
{"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/train-a-model", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/train-a-model", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/accelerate-training", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/deploy-your-model", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/deploy-your-model", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/hugging-face-models", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/hugging-face-models", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-hpc/index", "os": ["linux"]},
|
{"file": "how-to/rocm-for-hpc/index", "os": ["linux"]},
|
||||||
|
|||||||
BIN
docs/data/how-to/rocm-for-ai/2-node-training-master.png
Normal file
|
After Width: | Height: | Size: 139 KiB |
BIN
docs/data/how-to/rocm-for-ai/2-node-training-worker.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
docs/data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
Normal file
|
After Width: | Height: | Size: 36 KiB |
|
After Width: | Height: | Size: 35 KiB |
|
After Width: | Height: | Size: 242 KiB |
BIN
docs/data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
Normal file
|
After Width: | Height: | Size: 155 KiB |
BIN
docs/data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
Normal file
|
After Width: | Height: | Size: 242 KiB |
@@ -16,6 +16,8 @@ In this guide, you'll learn about:
|
|||||||
|
|
||||||
- :doc:`Installing ROCm and machine learning frameworks <install>`
|
- :doc:`Installing ROCm and machine learning frameworks <install>`
|
||||||
|
|
||||||
|
- :doc:`Scaling model training <scale-model-training>`
|
||||||
|
|
||||||
- :doc:`Training a model <train-a-model>`
|
- :doc:`Training a model <train-a-model>`
|
||||||
|
|
||||||
- :doc:`Running models from Hugging Face <hugging-face-models>`
|
- :doc:`Running models from Hugging Face <hugging-face-models>`
|
||||||
|
|||||||
135
docs/how-to/rocm-for-ai/scale-model-training.rst
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
.. meta::
|
||||||
|
:description: How to scale and accelerate model training
|
||||||
|
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
|
||||||
|
|
||||||
|
**********************
|
||||||
|
Scaling model training
|
||||||
|
**********************
|
||||||
|
|
||||||
|
To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the
|
||||||
|
model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or
|
||||||
|
accelerator can simultaneously store and process the entire model's parameters during training. PyTorch
|
||||||
|
provides an answer to this computational constraint through its distributed training frameworks.
|
||||||
|
|
||||||
|
.. _rocm-for-ai-pytorch-distributed:
|
||||||
|
|
||||||
|
PyTorch distributed
|
||||||
|
===================
|
||||||
|
|
||||||
|
Features in ``torch.distributed`` are categorized into three main components:
|
||||||
|
|
||||||
|
- `Distributed data-parallel training
|
||||||
|
<https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
|
||||||
|
|
||||||
|
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
|
||||||
|
|
||||||
|
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
|
||||||
|
|
||||||
|
In this topic, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
|
||||||
|
you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
|
||||||
|
|
||||||
|
The DDP workflow on multiple accelerators or GPUs is as follows:
|
||||||
|
|
||||||
|
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
|
||||||
|
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
|
||||||
|
|
||||||
|
#. Copy the model to every device so each can process its local batches independently.
|
||||||
|
|
||||||
|
#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
|
||||||
|
model for that local batch. This happens in parallel on multiple devices.
|
||||||
|
|
||||||
|
#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
|
||||||
|
weights are then redistributed to each device.
|
||||||
|
|
||||||
|
In DDP training, each process or worker owns a replica of the model and processes a batch of data, and then the reducer uses
|
||||||
|
``allreduce`` to sum up gradients over different workers.
|
||||||
|
|
||||||
|
See the following developer blogs for more in-depth explanations and examples.
|
||||||
|
|
||||||
|
* `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
|
||||||
|
|
||||||
|
* `Building a decoder transformer model on AMD GPUs — ROCm Blogs
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
|
||||||
|
|
||||||
|
.. _rocm-for-ai-pytorch-fsdp:
|
||||||
|
|
||||||
|
PyTorch FSDP
|
||||||
|
------------
|
||||||
|
|
||||||
|
As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, DDP model weights and optimizer states
|
||||||
|
are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
|
||||||
|
model parameters, optimizer states, and gradients across DDP ranks.
|
||||||
|
|
||||||
|
When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
|
||||||
|
training some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
|
||||||
|
comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
|
||||||
|
like overlapping communication and computation.
|
||||||
|
|
||||||
|
For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
|
||||||
|
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
|
||||||
|
|
||||||
|
For detailed training steps, see `PyTorch FSDP examples
|
||||||
|
<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
|
||||||
|
|
||||||
|
.. _rocm-for-ai-deepspeed:
|
||||||
|
|
||||||
|
DeepSpeed
|
||||||
|
---------
|
||||||
|
|
||||||
|
`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
|
||||||
|
efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
|
||||||
|
the training pillar.
|
||||||
|
|
||||||
|
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
|
||||||
|
training with DeepSpeed on an AMD accelerator or GPU.
|
||||||
|
|
||||||
|
.. _rocm-for-ai-automatic-mixed-precision:
|
||||||
|
|
||||||
|
Automatic mixed precision (AMP)
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
As models increase in size, so do the time and memory needed to train them; their cost also increases. Any measure we
|
||||||
|
can take to reduce training time and memory usage through `automatic mixed precision
|
||||||
|
<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
|
||||||
|
|
||||||
|
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
|
||||||
|
for more information about running AMP on an AMD accelerator.
|
||||||
|
|
||||||
|
.. _rocm-for-ai-fine-tune:
|
||||||
|
|
||||||
|
Fine-tuning your model
|
||||||
|
======================
|
||||||
|
|
||||||
|
ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
|
||||||
|
example, LoRA, QLoRA, PEFT, and FSDP.
|
||||||
|
|
||||||
|
Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
|
||||||
|
|
||||||
|
The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU.
|
||||||
|
|
||||||
|
* Fine-tuning Llama2 with LoRA
|
||||||
|
|
||||||
|
* `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
|
||||||
|
|
||||||
|
* Fine-tuning Llama2 with QLoRA
|
||||||
|
|
||||||
|
* `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
|
||||||
|
|
||||||
|
* Fine-tuning a BERT-based LLM for a text classification task using JAX
|
||||||
|
|
||||||
|
* `LLM distributed supervised fine-tuning with JAX
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
|
||||||
|
|
||||||
|
* Fine-tuning StarCoder using PEFT
|
||||||
|
|
||||||
|
* `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs
|
||||||
|
<https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
|
||||||
|
|
||||||
|
* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
|
||||||
|
|
||||||
|
* `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
|
||||||
|
single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/finetuning>`_
|
||||||
@@ -1,140 +1,503 @@
|
|||||||
.. meta::
|
.. meta::
|
||||||
:description: How to use ROCm for AI
|
:description: How to train a model using ROCm Megatron-LM
|
||||||
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
|
:keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
|
||||||
|
|
||||||
****************
|
**************************************
|
||||||
Training a model
|
Training a model with ROCm Megatron-LM
|
||||||
****************
|
**************************************
|
||||||
|
|
||||||
The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
|
.. _amd-megatron-lm:
|
||||||
and inferencing.
|
|
||||||
|
|
||||||
Accelerating model training
|
The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
|
||||||
===========================
|
enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
|
||||||
|
accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
|
||||||
|
workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support>`
|
||||||
|
like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
|
||||||
|
efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
|
||||||
|
|
||||||
To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
|
For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
|
||||||
required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
|
components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
|
||||||
PyTorch offers distributed training solutions to facilitate this.
|
following software to accelerate training workloads:
|
||||||
|
|
||||||
.. _rocm-for-ai-pytorch-distributed:
|
+--------------------------+--------------------------------+
|
||||||
|
| Software component | Version |
|
||||||
|
+==========================+================================+
|
||||||
|
| ROCm | 6.1 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| PyTorch | 2.4.0 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| PyTorch Lightning | 2.4.0 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| Megatron Core | 0.9.0 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| Transformer Engine | 1.5.0 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| Flash Attention | v2.6 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
| Transformers | 4.44.0 |
|
||||||
|
+--------------------------+--------------------------------+
|
||||||
|
|
||||||
PyTorch distributed
|
Supported features and models
|
||||||
-------------------
|
=============================
|
||||||
|
|
||||||
As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
|
Megatron-LM provides the following key features to train large language models efficiently:
|
||||||
|
|
||||||
- `Distributed data-parallel training
|
- Transformer Engine (TE)
|
||||||
<https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
|
|
||||||
|
|
||||||
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
|
- APEX
|
||||||
|
|
||||||
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
|
- GEMM tuning
|
||||||
|
|
||||||
In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
|
- Torch.compile
|
||||||
let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
|
|
||||||
|
|
||||||
The DDP workflow on multiple accelerators or GPUs is as follows:
|
- 3D parallelism: TP + SP + CP
|
||||||
|
|
||||||
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
|
- Distributed optimizer
|
||||||
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
|
|
||||||
|
|
||||||
#. Copy the model to every device so each device can process its local batches independently.
|
- Flash Attention (FA) 2
|
||||||
|
|
||||||
#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
|
- Fused kernels
|
||||||
model for that local batch. This happens in parallel on multiple devices.
|
|
||||||
|
|
||||||
#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
|
- Pre-training
|
||||||
weights are then redistributed to each device.
|
|
||||||
|
|
||||||
In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
|
.. _amd-megatron-lm-model-support:
|
||||||
``allreduce`` to sum up gradients over different workers.
|
|
||||||
|
|
||||||
See the following developer blogs for more in-depth explanations and examples.
|
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
|
||||||
|
|
||||||
* `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
|
* Llama 2 7B
|
||||||
|
|
||||||
* `Building a decoder transformer model on AMD GPUs — ROCm Blogs
|
* Llama 2 70B
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
|
|
||||||
|
|
||||||
.. _rocm-for-ai-pytorch-fsdp:
|
* Llama 3 8B
|
||||||
|
|
||||||
PyTorch FSDP
|
* Llama 3 70B
|
||||||
------------
|
|
||||||
|
|
||||||
As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
|
* Llama 3.1 8B
|
||||||
are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
|
|
||||||
model parameters, optimizer states, and gradients across DDP ranks.
|
|
||||||
|
|
||||||
When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
|
* Llama 3.1 70B
|
||||||
the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
|
|
||||||
comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
|
|
||||||
like overlapping communication and computation.
|
|
||||||
|
|
||||||
For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
|
Prerequisite system validation steps
|
||||||
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
|
====================================
|
||||||
|
|
||||||
For detailed training steps, refer to the `PyTorch FSDP examples
|
Complete the following system validation and optimization steps to set up your system before starting training.
|
||||||
<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
|
|
||||||
|
|
||||||
.. _rocm-for-ai-deepspeed:
|
Disable NUMA auto-balancing
|
||||||
|
---------------------------
|
||||||
|
|
||||||
DeepSpeed
|
Generally, application performance can benefit from disabling NUMA auto-balancing. However,
|
||||||
---------
|
it might be detrimental to performance with certain types of workloads.
|
||||||
|
|
||||||
`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
|
Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
|
||||||
efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
|
Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
|
||||||
the training pillar.
|
the output is ``1``, run the following command to disable NUMA auto-balancing.
|
||||||
|
|
||||||
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
|
.. code-block:: shell
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
|
|
||||||
training with DeepSpeed on an AMD accelerator or GPU.
|
|
||||||
|
|
||||||
.. _rocm-for-ai-automatic-mixed-precision:
|
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
|
||||||
|
|
||||||
Automatic mixed precision (AMP)
|
See :ref:`mi300x-disable-numa` for more information.
|
||||||
|
|
||||||
|
Hardware verification with ROCm
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
|
||||||
As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
|
Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
|
||||||
can take to reduce training time and memory usage through `automatic mixed precision
|
instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
|
||||||
<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
|
GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
|
||||||
|
You can restore this setting to its default value with the ``rocm-smi -r`` command.
|
||||||
|
|
||||||
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
|
Run the command:
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
|
|
||||||
for more information about running AMP on an AMD accelerator.
|
|
||||||
|
|
||||||
.. _rocm-for-ai-fine-tune:
|
.. code-block:: shell
|
||||||
|
|
||||||
Fine-tuning your model
|
rocm-smi --setperfdeterminism 1900
|
||||||
======================
|
|
||||||
|
|
||||||
ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
|
See :ref:`mi300x-hardware-verification-with-rocm` for more information.
|
||||||
example, LoRA, QLoRA, PEFT, and FSDP.
|
|
||||||
|
|
||||||
Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
|
RCCL Bandwidth Test
|
||||||
|
-------------------
|
||||||
|
|
||||||
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
|
ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication
|
||||||
|
routines for GPUs. See the :doc:`RCCL documentation <rccl:index>` for more information. Before starting
|
||||||
|
pre-training, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized
|
||||||
|
for efficient distributed training.
|
||||||
|
|
||||||
* Fine-tuning Llama2 with LoRA
|
Running the RCCL bandwidth test helps verify that:
|
||||||
|
|
||||||
* `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
|
- The GPUs can communicate across nodes or within a single node.
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
|
|
||||||
|
|
||||||
* Fine-tuning Llama2 with QLoRA
|
- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and
|
||||||
|
provides adequate bandwidth for communication.
|
||||||
|
|
||||||
* `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
|
- No hardware setup or cabling issues could affect the communication between GPUs
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
|
|
||||||
|
|
||||||
* Fine-tuning a BERT-based LLM for a text classification task using JAX
|
Tuning and optimizing hyperparameters
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
* `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
|
In distributed training, specific hyperparameters related to distributed communication can be tuned based on
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
|
the results of the RCCL bandwidth test. These variables are already set in the Docker image:
|
||||||
|
|
||||||
* Fine-tuning StarCoder using PEFT
|
.. code-block:: shell
|
||||||
|
|
||||||
* `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
|
# force all RCCL streams to be high priority
|
||||||
<https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
|
export TORCH_NCCL_HIGH_PRIORITY=1
|
||||||
|
|
||||||
* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
|
# specify which RDMA interfaces to use for communication
|
||||||
|
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
|
||||||
|
|
||||||
|
# define the Global ID index used in RoCE mode
|
||||||
|
export NCCL_IB_GID_INDEX=3
|
||||||
|
|
||||||
|
# avoid data corruption/mismatch issue that existed in past releases
|
||||||
|
export RCCL_MSCCL_ENABLE=0
|
||||||
|
|
||||||
|
Running the RCCL Bandwidth Test
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
It's recommended you run the RCCL bandwidth test before launching training. It ensures system
|
||||||
|
performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker
|
||||||
|
image; follow the instructions in `<https://github.com/ROCm/rccl-tests>`__ to get started.
|
||||||
|
See :ref:`mi300x-rccl` for more information.
|
||||||
|
|
||||||
|
Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
|
||||||
|
recommended. So, a run on 8 GPUs looks something like:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
|
||||||
|
for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like
|
||||||
|
PyTorch and TensorFlow.
|
||||||
|
|
||||||
|
Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
/home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \
|
||||||
|
--mca pml ucx \
|
||||||
|
--mca btl ^openib \
|
||||||
|
-x NCCL_SOCKET_IFNAME=ens50f0np0 \
|
||||||
|
-x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \
|
||||||
|
-x NCCL_IB_GID_INDEX=3 \
|
||||||
|
-x NCCL_MIN_NCHANNELS=40 \
|
||||||
|
-x NCCL_DEBUG=version \
|
||||||
|
$HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
.. _mi300x-amd-megatron-lm-training:
|
||||||
|
|
||||||
|
Start training on MI300X accelerators
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
|
||||||
|
training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
|
||||||
|
|
||||||
|
Use the following instructions to set up the environment, configure the script to train models, and
|
||||||
|
reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
|
||||||
|
image.
|
||||||
|
|
||||||
|
.. _amd-megatron-lm-requirements:
|
||||||
|
|
||||||
|
Download the Docker image and required packages
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
|
1. Use the following command to pull the Docker image from Docker Hub.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker pull rocm/megatron-lm:24.12-dev
|
||||||
|
|
||||||
|
2. Launch the Docker container.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
|
||||||
|
|
||||||
|
3. Clone the ROCm Megatron-LM repository to a local directory and install the required packages on the host machine.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git clone https://github.com/ROCm/Megatron-LM
|
||||||
|
cd Megatron-LM
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This release is validated with ``ROCm/Megatron-LM`` commit `bb93ccb <https://github.com/ROCm/Megatron-LM/tree/bb93ccbfeae6363c67b361a97a27c74ab86e7e92>`_.
|
||||||
|
Checking out this specific commit is recommended for a stable and reproducible environment.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92
|
||||||
|
|
||||||
|
Prepare training datasets
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
If you already have the preprocessed data, you can skip this section.
|
||||||
|
|
||||||
|
Use the following command to process datasets. We use GPT data as an example. You may change the merge table, use an
|
||||||
|
end-of-document token, remove sentence splitting, and use the tokenizer type.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
python tools/preprocess_data.py \
|
||||||
|
--input my-corpus.json \
|
||||||
|
--output-prefix my-gpt2 \
|
||||||
|
--vocab-file gpt2-vocab.json \
|
||||||
|
--tokenizer-type GPT2BPETokenizer \
|
||||||
|
--merge-file gpt2-merges.txt \
|
||||||
|
--append-eod
|
||||||
|
|
||||||
|
In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and
|
||||||
|
``my-gpt2_text_document.idx``.
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
.. _amd-megatron-lm-environment-setup:
|
||||||
|
|
||||||
|
Environment setup
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
In the ``examples/llama`` directory of Megatron-LM, if you're working with Llama 2 7B or Llama 2 70 B, use the
|
||||||
|
``train_llama2.sh`` configuration script. Likewise, if you're working with Llama 3 or Llama 3.1, then use
|
||||||
|
``train_llama3.sh`` and update the configuration script accordingly.
|
||||||
|
|
||||||
|
Network interface
|
||||||
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
To avoid connectivity issues, ensure the correct network interface is set in your training scripts.
|
||||||
|
|
||||||
|
1. Run the following command to find the active network interface on your system.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
ip a
|
||||||
|
|
||||||
|
2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For
|
||||||
|
example:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
export NCCL_SOCKET_IFNAME=ens50f0np0
|
||||||
|
|
||||||
|
export GLOO_SOCKET_IFNAME=ens50f0np0
|
||||||
|
|
||||||
|
Dataset options
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
You can use either mock data or real data for training.
|
||||||
|
|
||||||
|
* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
DATA_DIR="/root/.cache/data" # Change to where your dataset is stored
|
||||||
|
|
||||||
|
DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
--data-path $DATA_PATH
|
||||||
|
|
||||||
|
Ensure that the files are accessible inside the Docker container.
|
||||||
|
|
||||||
|
* Mock data can be useful for testing and validation. If you're using mock data, replace ``--data-path $DATA_PATH`` with the ``--mock-data`` option.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
--mock-data
|
||||||
|
|
||||||
|
Tokenizer
|
||||||
|
^^^^^^^^^
|
||||||
|
|
||||||
|
Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
|
||||||
|
models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
|
||||||
|
a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
|
||||||
|
fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
|
||||||
|
handle a variety of input sequences, including unseen words or domain-specific terms.
|
||||||
|
|
||||||
|
To train any of the Llama 2 models that this Docker image supports, use the ``Llama2Tokenizer``.
|
||||||
|
|
||||||
|
To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
|
||||||
|
Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable.
|
||||||
|
|
||||||
|
For example, if you're using the Llama 3.1 8B model:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
|
||||||
|
|
||||||
|
Run benchmark tests
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you're running **multi node training**, update the following environment variables. They can
|
||||||
|
also be passed as command line arguments.
|
||||||
|
|
||||||
|
* Change ``localhost`` to the master node's hostname:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
MASTER_ADDR="${MASTER_ADDR:-localhost}"
|
||||||
|
|
||||||
|
* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
NNODES="${NNODES:-1}"
|
||||||
|
|
||||||
|
* Set the rank of each node (0 for master, 1 for the first worker node, and so on):
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
NODE_RANK="${NODE_RANK:-0}"
|
||||||
|
|
||||||
|
* Use this command to run a performance benchmark test of any of the Llama 2 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
{variables} bash examples/llama/train_llama2.sh
|
||||||
|
|
||||||
|
* Use this command to run a performance benchmark test of any of the Llama 3 and Llama 3.1 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
{variables} bash examples/llama/train_llama3.sh
|
||||||
|
|
||||||
|
.. _amd-megatron-lm-benchmark-test-vars:
|
||||||
|
|
||||||
|
The benchmark tests support the same set of variables:
|
||||||
|
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| Name | Options | Description |
|
||||||
|
+==========================+=======================+=======================+
|
||||||
|
| ``TEE_OUTPUT`` | 0 or 1 | 0: disable training |
|
||||||
|
| | | log |
|
||||||
|
| | | |
|
||||||
|
| | | 1: enable training |
|
||||||
|
| | | log |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``MBS`` | | Micro batch size |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``BS`` | | Batch size |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``TP`` | 1, 2, 4, 8 | Tensor parallel |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``TE_FP8`` | 0 or 1 | Datatype. |
|
||||||
|
| | | If it is set to 1, |
|
||||||
|
| | | FP8. |
|
||||||
|
| | | |
|
||||||
|
| | | If it is set to 0. |
|
||||||
|
| | | BP16 |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``NO_TORCH_COMPILE`` | 0 or 1 | If it is set to 1, |
|
||||||
|
| | | enable torch.compile. |
|
||||||
|
| | | |
|
||||||
|
| | | If it is set to 0. |
|
||||||
|
| | | Disable torch.compile |
|
||||||
|
| | | (default) |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``SEQ_LENGTH`` | | Input sequence length |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``GEMM_TUNING`` | 0 or 1 | If it is set to 1, |
|
||||||
|
| | | enable gemm tuning. |
|
||||||
|
| | | |
|
||||||
|
| | | If it is set to 0, |
|
||||||
|
| | | disable gemm tuning |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``USE_FLASH_ATTN`` | 0 or 1 | 0: disable flash |
|
||||||
|
| | | attention |
|
||||||
|
| | | |
|
||||||
|
| | | 1: enable flash |
|
||||||
|
| | | attention |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``ENABLE_PROFILING`` | 0 or 1 | 0: disable torch |
|
||||||
|
| | | profiling |
|
||||||
|
| | | |
|
||||||
|
| | | 1: enable torch |
|
||||||
|
| | | profiling |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``MODEL_SIZE`` | | The size of the mode: |
|
||||||
|
| | | 7B/70B, etc. |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``TOTAL_ITERS`` | | Total number of |
|
||||||
|
| | | iterations |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
| ``transformer-impl`` | transformer_engine or | Enable transformer |
|
||||||
|
| | local | engine by default |
|
||||||
|
+--------------------------+-----------------------+-----------------------+
|
||||||
|
|
||||||
|
Benchmarking examples
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: Single node training
|
||||||
|
:sync: single
|
||||||
|
|
||||||
|
Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
|
||||||
|
datatype, and so on.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
||||||
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
||||||
|
|
||||||
|
You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
|
||||||
|
|
||||||
|
See the sample output:
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
.. tab-item:: Multi node training
|
||||||
|
:sync: multi
|
||||||
|
|
||||||
|
Launch the Docker container on each node.
|
||||||
|
|
||||||
|
In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
|
||||||
|
so on.
|
||||||
|
|
||||||
|
On the master node:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
||||||
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
||||||
|
|
||||||
|
On the worker node:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
|
||||||
|
SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
|
||||||
|
|
||||||
|
You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
|
||||||
|
|
||||||
|
Sample output for 2-node training:
|
||||||
|
|
||||||
|
Master node:
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/2-node-training-master.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
|
Worker node:
|
||||||
|
|
||||||
|
.. image:: ../../data/how-to/rocm-for-ai/2-node-training-worker.png
|
||||||
|
:width: 800
|
||||||
|
|
||||||
* `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
|
|
||||||
single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/finetuning>`_
|
|
||||||
|
|||||||
@@ -537,6 +537,8 @@ installation was successful, refer to the
|
|||||||
:doc:`rocm-install-on-linux:install/post-install`.
|
:doc:`rocm-install-on-linux:install/post-install`.
|
||||||
Should verification fail, consult :doc:`/how-to/system-debugging`.
|
Should verification fail, consult :doc:`/how-to/system-debugging`.
|
||||||
|
|
||||||
|
.. _mi300x-hardware-verification-with-rocm:
|
||||||
|
|
||||||
Hardware verification with ROCm
|
Hardware verification with ROCm
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
|
||||||
|
|||||||
@@ -40,6 +40,8 @@ subtrees:
|
|||||||
title: Installation
|
title: Installation
|
||||||
- file: how-to/rocm-for-ai/train-a-model.rst
|
- file: how-to/rocm-for-ai/train-a-model.rst
|
||||||
title: Train a model
|
title: Train a model
|
||||||
|
- file: how-to/rocm-for-ai/scale-model-training.rst
|
||||||
|
title: Scale model training
|
||||||
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
||||||
title: Run models from Hugging Face
|
title: Run models from Hugging Face
|
||||||
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
||||||
|
|||||||