mirror of
https://github.com/ROCm/ROCm.git
synced 2026-04-05 03:01:17 -04:00
Add "How to use ROCm for AI" (#3117)
* Add Using ROCm for AI:wq Add PyTorch Docker installation images Split doc into subtopics Add metadata Clean up index Clean up hugging face guide Clean up installation guide Fix rST formatting Clean up install and train-a-model Clean up MAD Delete unused file Add ref anchors and clean up MAD doc Add formatting fixes Update toc and section index Format some code blocks Remove install guide and update toc Chop installation guide Clean up deployment and hugging face sections Change headings to end in -ing Fix spelling in Training a model Delete MAD and split out install content Fix formatting Change words to satisfy spellcheck linter * Add review suggestions and add helpful links Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Add helpful links and add review suggestions Remove fine-tuning link and links to D5 and MAGMA Update docs/how-to/rocm-for-ai/deploy-your-model.rst Co-authored-by: Young Hui - AMD <145490163+yhuiYH@users.noreply.github.com> Update DeepSpeed link Add subheading to ML framework installation and closing blurb to hugging face models guide * Reorder topics
This commit is contained in:
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
Normal file
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
Normal file
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 441 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 157 KiB |
113
docs/how-to/rocm-for-ai/deploy-your-model.rst
Normal file
113
docs/how-to/rocm-for-ai/deploy-your-model.rst
Normal file
@@ -0,0 +1,113 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
********************
|
||||
Deploying your model
|
||||
********************
|
||||
|
||||
ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers.
|
||||
This section focuses on deploying transformers-based LLM models.
|
||||
|
||||
ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.
|
||||
|
||||
.. _rocm-for-ai-serve-vllm:
|
||||
|
||||
Serving using vLLM
|
||||
==================
|
||||
|
||||
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and
|
||||
6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions.
|
||||
|
||||
See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation
|
||||
<https://docs.vllm.ai/>`_ for more information.
|
||||
|
||||
For guidance on using vLLM with ROCm, refer to `Installation with ROCm
|
||||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_.
|
||||
|
||||
vLLM installation
|
||||
-----------------
|
||||
|
||||
vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.
|
||||
|
||||
- `Build from source with Docker
|
||||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended)
|
||||
|
||||
- `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_
|
||||
|
||||
vLLM walkthrough
|
||||
----------------
|
||||
|
||||
Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm
|
||||
Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_
|
||||
|
||||
.. _rocm-for-ai-serve-hugging-face-tgi:
|
||||
|
||||
Serving using Hugging Face TGI
|
||||
==============================
|
||||
|
||||
The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_
|
||||
(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI
|
||||
<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details.
|
||||
|
||||
TGI installation
|
||||
----------------
|
||||
|
||||
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
|
||||
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
|
||||
|
||||
TGI walkthrough
|
||||
---------------
|
||||
|
||||
#. Set up the LLM server.
|
||||
|
||||
Deploy the Llama2 7B model with TGI using the official Docker image.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
model=TheBloke/Llama-2-7B-fp16
|
||||
volume=$PWD
|
||||
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
|
||||
|
||||
#. Set up the client.
|
||||
|
||||
a. Open another shell session and run the following command to access the server with the client URL.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
curl 127.0.0.1:8080/generate \\
|
||||
-X POST \\
|
||||
-d '{"inputs":"What is Deep
|
||||
Learning?","parameters":{"max_new_tokens":20}}' \\
|
||||
-H 'Content-Type: application/json'
|
||||
|
||||
b. Access the server with request endpoints.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install request
|
||||
PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py
|
||||
|
||||
``requests_model.py`` should look like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
data = {
|
||||
'inputs': 'What is Deep Learning?',
|
||||
'parameters': { 'max_new_tokens': 20 },
|
||||
}
|
||||
|
||||
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
|
||||
|
||||
print(response.json())
|
||||
|
||||
vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
|
||||
performance, low latency, and scalability.
|
||||
|
||||
Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.
|
||||
210
docs/how-to/rocm-for-ai/hugging-face-models.rst
Normal file
210
docs/how-to/rocm-for-ai/hugging-face-models.rst
Normal file
@@ -0,0 +1,210 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
|
||||
|
||||
********************************
|
||||
Running models from Hugging Face
|
||||
********************************
|
||||
|
||||
`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain
|
||||
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
|
||||
developing and deploying AI solutions.
|
||||
|
||||
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
|
||||
|
||||
.. _rocm-for-ai-hugging-face-transformers:
|
||||
|
||||
Using Hugging Face Transformers
|
||||
-------------------------------
|
||||
|
||||
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
|
||||
which lets you easily import any of the transformer models into your Python application.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install transformers
|
||||
|
||||
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from transformers import GPT2Tokenizer, GPT2Model
|
||||
|
||||
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
||||
|
||||
model = GPT2Model.from_pretrained('gpt2')
|
||||
|
||||
text = "Replace me with any text you'd like."
|
||||
|
||||
encoded_input = tokenizer(text, return_tensors='pt')
|
||||
|
||||
output = model(**encoded_input)
|
||||
|
||||
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
|
||||
models should also function correctly.
|
||||
|
||||
Here are some mainstream models to get you started:
|
||||
|
||||
- `BERT <https://huggingface.co/bert-base-uncased>`_
|
||||
|
||||
- `BLOOM <https://huggingface.co/bigscience/bloom>`_
|
||||
|
||||
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
|
||||
|
||||
- `OPT <https://huggingface.co/facebook/opt-66b>`_
|
||||
|
||||
- `T5 <https://huggingface.co/t5-base>`_
|
||||
|
||||
.. _rocm-for-ai-hugging-face-optimum:
|
||||
|
||||
Using Hugging Face with Optimum-AMD
|
||||
-----------------------------------
|
||||
|
||||
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
|
||||
|
||||
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
|
||||
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
|
||||
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
|
||||
|
||||
Hugging Face libraries natively support AMD Instinct accelerators. For other
|
||||
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
|
||||
validated, but most features are expected to work without issues.
|
||||
|
||||
.. _rocm-for-ai-install-optimum-amd:
|
||||
|
||||
Installation
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Install Optimum-AMD using pip.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install --upgrade --upgrade-strategy eager optimum[amd]
|
||||
|
||||
Or, install from source.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/huggingface/optimum-amd.git
|
||||
cd optimum-amd
|
||||
pip install -e .
|
||||
|
||||
.. _rocm-for-ai-flash-attention:
|
||||
|
||||
Flash Attention
|
||||
---------------
|
||||
|
||||
#. Use `the Hugging Face team's example Dockerfile
|
||||
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
|
||||
Flash Attention with ROCm.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
|
||||
volume=$PWD
|
||||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
|
||||
transformers_pytorch_amd_gpu_flash:latest
|
||||
|
||||
#. Use Flash Attention 2 with `Transformers
|
||||
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
|
||||
``use_flash_attention_2`` parameter to ``from_pretrained()``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
|
||||
|
||||
with torch.device("cuda"):
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"tiiuae/falcon-7b",
|
||||
torch_dtype=torch.float16,
|
||||
use_flash_attention_2=True,
|
||||
)
|
||||
|
||||
.. _rocm-for-ai-gptq:
|
||||
|
||||
GPTQ
|
||||
----
|
||||
|
||||
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
|
||||
|
||||
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
|
||||
|
||||
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
|
||||
in-depth guidance.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
|
||||
|
||||
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
|
||||
|
||||
|
||||
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
|
||||
<https://github.com/PanQiWei/AutoGPTQ>`_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
|
||||
|
||||
with torch.device("cuda"):
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Llama-2-7B-Chat-GPTQ",
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
|
||||
.. _rocm-for-ai-onnx:
|
||||
|
||||
ONNX
|
||||
----
|
||||
|
||||
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
|
||||
straightforward.
|
||||
|
||||
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||||
..
|
||||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||||
..
|
||||
provider="ROCMExecutionProvider"
|
||||
)
|
||||
|
||||
#. Try running a `BERT text classification
|
||||
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||||
from optimum.pipelines import pipeline
|
||||
from transformers import AutoTokenizer
|
||||
import onnxruntime as ort
|
||||
|
||||
session_options = ort.SessionOptions()
|
||||
|
||||
session_options.log_severity_level = 0
|
||||
|
||||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||||
"distilbert-base-uncased-finetuned-sst-2-english",
|
||||
export=True,
|
||||
provider="ROCMExecutionProvider",
|
||||
session_options=session_options
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
|
||||
|
||||
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
|
||||
|
||||
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")
|
||||
|
||||
23
docs/how-to/rocm-for-ai/index.rst
Normal file
23
docs/how-to/rocm-for-ai/index.rst
Normal file
@@ -0,0 +1,23 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, machine learning, LLM, usage, tutorial
|
||||
|
||||
*****************
|
||||
Using ROCm for AI
|
||||
*****************
|
||||
|
||||
ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
|
||||
recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
|
||||
AI software ecosystem, including open frameworks, models, and tools.
|
||||
|
||||
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
|
||||
|
||||
In this guide, you'll learn about:
|
||||
|
||||
- :doc:`Installing ROCm and machine learning frameworks <install>`
|
||||
|
||||
- :doc:`Training a model <train-a-model>`
|
||||
|
||||
- :doc:`Running models from Hugging Face <hugging-face-models>`
|
||||
|
||||
- :doc:`Deploying your model <deploy-your-model>`
|
||||
60
docs/how-to/rocm-for-ai/install.rst
Normal file
60
docs/how-to/rocm-for-ai/install.rst
Normal file
@@ -0,0 +1,60 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
.. _rocm-for-ai-install:
|
||||
|
||||
***********************************************
|
||||
Installing ROCm and machine learning frameworks
|
||||
***********************************************
|
||||
|
||||
Before getting started, install ROCm and supported machine learning frameworks.
|
||||
|
||||
.. grid:: 1
|
||||
|
||||
.. grid-item-card:: Pre-install
|
||||
|
||||
Each release of ROCm supports specific hardware and software configurations. Before installing, consult the
|
||||
:doc:`System requirements <rocm-install-on-linux:reference/system-requirements>` and
|
||||
:doc:`Installation prerequisites <rocm-install-on-linux:how-to/prerequisites>` guides.
|
||||
|
||||
If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux
|
||||
<rocm-install-on-linux:tutorial/quick-start>`.
|
||||
|
||||
If you’re using a Radeon GPU for graphics-accelerated applications, refer to the
|
||||
:doc:`Radeon installation instructions <radeon:docs/install/install-radeon>`.
|
||||
|
||||
ROCm supports two methods for installation. There is no difference in the final ROCm installation between these two
|
||||
methods. You can also opt for :ref:`single-version or multi-version installation
|
||||
<rocm-install-on-linux:installation-types>`.
|
||||
|
||||
* :doc:`Using your Linux distribution's package manager <rocm-install-on-linux:how-to/native-install/index>`
|
||||
|
||||
* :doc:`Using the AMDGPU installer <rocm-install-on-linux:how-to/amdgpu-install>`
|
||||
|
||||
.. grid:: 1
|
||||
|
||||
.. grid-item-card:: Post-install
|
||||
|
||||
Follow the :doc:`post-installation instructions <rocm-install-on-linux:how-to/native-install/post-install>` to
|
||||
configure your system linker, PATH, and verify the installation.
|
||||
|
||||
If you encounter any issues during installation, refer to the
|
||||
:doc:`Installation troubleshooting <rocm-install-on-linux:how-to/native-install/install-faq>` guide.
|
||||
|
||||
Machine learning frameworks
|
||||
===========================
|
||||
|
||||
ROCm supports popular machine learning frameworks and libraries including `PyTorch
|
||||
<https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package>`_, `TensorFlow
|
||||
<https://tensorflow.org>`_, `JAX <https://jax.readthedocs.io/en/latest>`_, and `DeepSpeed
|
||||
<https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/>`_.
|
||||
|
||||
Review the framework installation documentation. For ease-of-use, it's recommended to use official ROCm prebuilt Docker
|
||||
images with the framework pre-installed.
|
||||
|
||||
* :doc:`PyTorch for ROCm <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
|
||||
* :doc:`TensorFlow for ROCm <rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
|
||||
* :doc:`JAX for ROCm <rocm-install-on-linux:how-to/3rd-party/jax-install>`
|
||||
|
||||
The sections that follow in :doc:`Training a model <train-a-model>` are geared for a ROCm with PyTorch installation.
|
||||
137
docs/how-to/rocm-for-ai/train-a-model.rst
Normal file
137
docs/how-to/rocm-for-ai/train-a-model.rst
Normal file
@@ -0,0 +1,137 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
****************
|
||||
Training a model
|
||||
****************
|
||||
|
||||
The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
|
||||
and inferencing.
|
||||
|
||||
Accelerating model training
|
||||
===========================
|
||||
|
||||
To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
|
||||
required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
|
||||
PyTorch offers distributed training solutions to facilitate this.
|
||||
|
||||
.. _rocm-for-ai-pytorch-distributed:
|
||||
|
||||
PyTorch distributed
|
||||
-------------------
|
||||
|
||||
As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
|
||||
|
||||
- `Distributed data-parallel training
|
||||
<https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
|
||||
|
||||
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
|
||||
|
||||
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
|
||||
|
||||
In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
|
||||
let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
|
||||
|
||||
The DDP workflow on multiple accelerators or GPUs is as follows:
|
||||
|
||||
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
|
||||
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
|
||||
|
||||
#. Copy the model to every device so each device can process its local batches independently.
|
||||
|
||||
#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
|
||||
model for that local batch. This happens in parallel on multiple devices.
|
||||
|
||||
#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
|
||||
weights are then redistributed to each device.
|
||||
|
||||
In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
|
||||
``allreduce`` to sum up gradients over different workers.
|
||||
|
||||
See the following developer blogs for more in-depth explanations and examples.
|
||||
|
||||
* `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
|
||||
|
||||
* `Building a decoder transformer model on AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
|
||||
|
||||
.. _rocm-for-ai-pytorch-fsdp:
|
||||
|
||||
PyTorch FSDP
|
||||
------------
|
||||
|
||||
As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
|
||||
are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
|
||||
model parameters, optimizer states, and gradients across DDP ranks.
|
||||
|
||||
When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
|
||||
the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
|
||||
comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
|
||||
like overlapping communication and computation.
|
||||
|
||||
For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
|
||||
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
|
||||
|
||||
For detailed training steps, refer to the `PyTorch FSDP examples
|
||||
<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
|
||||
|
||||
.. _rocm-for-ai-deepspeed:
|
||||
|
||||
DeepSpeed
|
||||
---------
|
||||
|
||||
`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
|
||||
efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
|
||||
the training pillar.
|
||||
|
||||
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
|
||||
training with DeepSpeed on an AMD accelerator or GPU.
|
||||
|
||||
.. _rocm-for-ai-automatic-mixed-precision:
|
||||
|
||||
Automatic mixed precision (AMP)
|
||||
-------------------------------
|
||||
|
||||
As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
|
||||
can take to reduce training time and memory usage through `automatic mixed precision
|
||||
<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
|
||||
|
||||
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
|
||||
for more information about running AMP on an AMD accelerator.
|
||||
|
||||
.. _rocm-for-ai-fine-tune:
|
||||
|
||||
Fine-tuning your model
|
||||
======================
|
||||
|
||||
ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP.
|
||||
|
||||
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
|
||||
|
||||
* Fine-tuning Llama2 with LoRA
|
||||
|
||||
* `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
|
||||
|
||||
* Fine-tuning Llama2 with QLoRA
|
||||
|
||||
* `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
|
||||
|
||||
* Fine-tuning a BERT-based LLM for a text classification task using JAX
|
||||
|
||||
* `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
|
||||
|
||||
* Fine-tuning StarCoder using PEFT
|
||||
|
||||
* `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
|
||||
|
||||
* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
|
||||
|
||||
* `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
|
||||
single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/finetuning>`_
|
||||
@@ -91,6 +91,7 @@ Our documentation is organized into the following categories:
|
||||
:img-alt: How-to documentation
|
||||
:padding: 2
|
||||
|
||||
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
|
||||
* [System tuning for various architectures](./how-to/tuning-guides.md)
|
||||
* [MI100](./how-to/tuning-guides/mi100.md)
|
||||
* [MI200](./how-to/tuning-guides/mi200.md)
|
||||
|
||||
@@ -49,6 +49,15 @@ subtrees:
|
||||
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/rocm-for-ai/index.rst
|
||||
title: Using ROCm for AI
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/rocm-for-ai/install.rst
|
||||
title: Installation
|
||||
- file: how-to/rocm-for-ai/train-a-model.rst
|
||||
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
||||
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
||||
- file: how-to/tuning-guides.md
|
||||
title: System optimization
|
||||
subtrees:
|
||||
@@ -141,4 +150,3 @@ subtrees:
|
||||
title: Provide feedback
|
||||
- file: about/license.md
|
||||
title: ROCm license
|
||||
|
||||
|
||||
Reference in New Issue
Block a user