mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 06:38:00 -05:00
Merge pull request #3193 from peterjunpark/docs/6.1.1
docs/6.1.1: Add "How to use ROCm for AI" (#3117)
This commit is contained in:
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
Normal file
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
Normal file
BIN
docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 441 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 28 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 157 KiB |
113
docs/how-to/rocm-for-ai/deploy-your-model.rst
Normal file
113
docs/how-to/rocm-for-ai/deploy-your-model.rst
Normal file
@@ -0,0 +1,113 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
********************
|
||||
Deploying your model
|
||||
********************
|
||||
|
||||
ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers.
|
||||
This section focuses on deploying transformers-based LLM models.
|
||||
|
||||
ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.
|
||||
|
||||
.. _rocm-for-ai-serve-vllm:
|
||||
|
||||
Serving using vLLM
|
||||
==================
|
||||
|
||||
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and
|
||||
6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions.
|
||||
|
||||
See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation
|
||||
<https://docs.vllm.ai/>`_ for more information.
|
||||
|
||||
For guidance on using vLLM with ROCm, refer to `Installation with ROCm
|
||||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_.
|
||||
|
||||
vLLM installation
|
||||
-----------------
|
||||
|
||||
vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.
|
||||
|
||||
- `Build from source with Docker
|
||||
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended)
|
||||
|
||||
- `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_
|
||||
|
||||
vLLM walkthrough
|
||||
----------------
|
||||
|
||||
Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm
|
||||
Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_
|
||||
|
||||
.. _rocm-for-ai-serve-hugging-face-tgi:
|
||||
|
||||
Serving using Hugging Face TGI
|
||||
==============================
|
||||
|
||||
The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_
|
||||
(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI
|
||||
<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details.
|
||||
|
||||
TGI installation
|
||||
----------------
|
||||
|
||||
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
|
||||
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
|
||||
|
||||
TGI walkthrough
|
||||
---------------
|
||||
|
||||
#. Set up the LLM server.
|
||||
|
||||
Deploy the Llama2 7B model with TGI using the official Docker image.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
model=TheBloke/Llama-2-7B-fp16
|
||||
volume=$PWD
|
||||
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
|
||||
|
||||
#. Set up the client.
|
||||
|
||||
a. Open another shell session and run the following command to access the server with the client URL.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
curl 127.0.0.1:8080/generate \\
|
||||
-X POST \\
|
||||
-d '{"inputs":"What is Deep
|
||||
Learning?","parameters":{"max_new_tokens":20}}' \\
|
||||
-H 'Content-Type: application/json'
|
||||
|
||||
b. Access the server with request endpoints.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install request
|
||||
PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py
|
||||
|
||||
``requests_model.py`` should look like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
data = {
|
||||
'inputs': 'What is Deep Learning?',
|
||||
'parameters': { 'max_new_tokens': 20 },
|
||||
}
|
||||
|
||||
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
|
||||
|
||||
print(response.json())
|
||||
|
||||
vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
|
||||
performance, low latency, and scalability.
|
||||
|
||||
Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.
|
||||
210
docs/how-to/rocm-for-ai/hugging-face-models.rst
Normal file
210
docs/how-to/rocm-for-ai/hugging-face-models.rst
Normal file
@@ -0,0 +1,210 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
|
||||
|
||||
********************************
|
||||
Running models from Hugging Face
|
||||
********************************
|
||||
|
||||
`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain
|
||||
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
|
||||
developing and deploying AI solutions.
|
||||
|
||||
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
|
||||
|
||||
.. _rocm-for-ai-hugging-face-transformers:
|
||||
|
||||
Using Hugging Face Transformers
|
||||
-------------------------------
|
||||
|
||||
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
|
||||
which lets you easily import any of the transformer models into your Python application.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install transformers
|
||||
|
||||
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from transformers import GPT2Tokenizer, GPT2Model
|
||||
|
||||
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
||||
|
||||
model = GPT2Model.from_pretrained('gpt2')
|
||||
|
||||
text = "Replace me with any text you'd like."
|
||||
|
||||
encoded_input = tokenizer(text, return_tensors='pt')
|
||||
|
||||
output = model(**encoded_input)
|
||||
|
||||
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
|
||||
models should also function correctly.
|
||||
|
||||
Here are some mainstream models to get you started:
|
||||
|
||||
- `BERT <https://huggingface.co/bert-base-uncased>`_
|
||||
|
||||
- `BLOOM <https://huggingface.co/bigscience/bloom>`_
|
||||
|
||||
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
|
||||
|
||||
- `OPT <https://huggingface.co/facebook/opt-66b>`_
|
||||
|
||||
- `T5 <https://huggingface.co/t5-base>`_
|
||||
|
||||
.. _rocm-for-ai-hugging-face-optimum:
|
||||
|
||||
Using Hugging Face with Optimum-AMD
|
||||
-----------------------------------
|
||||
|
||||
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
|
||||
|
||||
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
|
||||
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
|
||||
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
|
||||
|
||||
Hugging Face libraries natively support AMD Instinct accelerators. For other
|
||||
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
|
||||
validated, but most features are expected to work without issues.
|
||||
|
||||
.. _rocm-for-ai-install-optimum-amd:
|
||||
|
||||
Installation
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Install Optimum-AMD using pip.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install --upgrade --upgrade-strategy eager optimum[amd]
|
||||
|
||||
Or, install from source.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/huggingface/optimum-amd.git
|
||||
cd optimum-amd
|
||||
pip install -e .
|
||||
|
||||
.. _rocm-for-ai-flash-attention:
|
||||
|
||||
Flash Attention
|
||||
---------------
|
||||
|
||||
#. Use `the Hugging Face team's example Dockerfile
|
||||
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
|
||||
Flash Attention with ROCm.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
|
||||
volume=$PWD
|
||||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
|
||||
transformers_pytorch_amd_gpu_flash:latest
|
||||
|
||||
#. Use Flash Attention 2 with `Transformers
|
||||
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
|
||||
``use_flash_attention_2`` parameter to ``from_pretrained()``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
|
||||
|
||||
with torch.device("cuda"):
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"tiiuae/falcon-7b",
|
||||
torch_dtype=torch.float16,
|
||||
use_flash_attention_2=True,
|
||||
)
|
||||
|
||||
.. _rocm-for-ai-gptq:
|
||||
|
||||
GPTQ
|
||||
----
|
||||
|
||||
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
|
||||
|
||||
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
|
||||
|
||||
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
|
||||
in-depth guidance.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
|
||||
|
||||
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
|
||||
|
||||
|
||||
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
|
||||
<https://github.com/PanQiWei/AutoGPTQ>`_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
|
||||
|
||||
with torch.device("cuda"):
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Llama-2-7B-Chat-GPTQ",
|
||||
torch_dtype=torch.float16,
|
||||
)
|
||||
|
||||
.. _rocm-for-ai-onnx:
|
||||
|
||||
ONNX
|
||||
----
|
||||
|
||||
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
|
||||
straightforward.
|
||||
|
||||
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||||
..
|
||||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||||
..
|
||||
provider="ROCMExecutionProvider"
|
||||
)
|
||||
|
||||
#. Try running a `BERT text classification
|
||||
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||||
from optimum.pipelines import pipeline
|
||||
from transformers import AutoTokenizer
|
||||
import onnxruntime as ort
|
||||
|
||||
session_options = ort.SessionOptions()
|
||||
|
||||
session_options.log_severity_level = 0
|
||||
|
||||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||||
"distilbert-base-uncased-finetuned-sst-2-english",
|
||||
export=True,
|
||||
provider="ROCMExecutionProvider",
|
||||
session_options=session_options
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
|
||||
|
||||
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
|
||||
|
||||
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")
|
||||
|
||||
23
docs/how-to/rocm-for-ai/index.rst
Normal file
23
docs/how-to/rocm-for-ai/index.rst
Normal file
@@ -0,0 +1,23 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, machine learning, LLM, usage, tutorial
|
||||
|
||||
*****************
|
||||
Using ROCm for AI
|
||||
*****************
|
||||
|
||||
ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
|
||||
recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
|
||||
AI software ecosystem, including open frameworks, models, and tools.
|
||||
|
||||
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
|
||||
|
||||
In this guide, you'll learn about:
|
||||
|
||||
- :doc:`Installing ROCm and machine learning frameworks <install>`
|
||||
|
||||
- :doc:`Training a model <train-a-model>`
|
||||
|
||||
- :doc:`Running models from Hugging Face <hugging-face-models>`
|
||||
|
||||
- :doc:`Deploying your model <deploy-your-model>`
|
||||
60
docs/how-to/rocm-for-ai/install.rst
Normal file
60
docs/how-to/rocm-for-ai/install.rst
Normal file
@@ -0,0 +1,60 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
.. _rocm-for-ai-install:
|
||||
|
||||
***********************************************
|
||||
Installing ROCm and machine learning frameworks
|
||||
***********************************************
|
||||
|
||||
Before getting started, install ROCm and supported machine learning frameworks.
|
||||
|
||||
.. grid:: 1
|
||||
|
||||
.. grid-item-card:: Pre-install
|
||||
|
||||
Each release of ROCm supports specific hardware and software configurations. Before installing, consult the
|
||||
:doc:`System requirements <rocm-install-on-linux:reference/system-requirements>` and
|
||||
:doc:`Installation prerequisites <rocm-install-on-linux:how-to/prerequisites>` guides.
|
||||
|
||||
If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux
|
||||
<rocm-install-on-linux:tutorial/quick-start>`.
|
||||
|
||||
If you’re using a Radeon GPU for graphics-accelerated applications, refer to the
|
||||
:doc:`Radeon installation instructions <radeon:docs/install/install-radeon>`.
|
||||
|
||||
ROCm supports two methods for installation. There is no difference in the final ROCm installation between these two
|
||||
methods. You can also opt for :ref:`single-version or multi-version installation
|
||||
<rocm-install-on-linux:installation-types>`.
|
||||
|
||||
* :doc:`Using your Linux distribution's package manager <rocm-install-on-linux:how-to/native-install/index>`
|
||||
|
||||
* :doc:`Using the AMDGPU installer <rocm-install-on-linux:how-to/amdgpu-install>`
|
||||
|
||||
.. grid:: 1
|
||||
|
||||
.. grid-item-card:: Post-install
|
||||
|
||||
Follow the :doc:`post-installation instructions <rocm-install-on-linux:how-to/native-install/post-install>` to
|
||||
configure your system linker, PATH, and verify the installation.
|
||||
|
||||
If you encounter any issues during installation, refer to the
|
||||
:doc:`Installation troubleshooting <rocm-install-on-linux:how-to/native-install/install-faq>` guide.
|
||||
|
||||
Machine learning frameworks
|
||||
===========================
|
||||
|
||||
ROCm supports popular machine learning frameworks and libraries including `PyTorch
|
||||
<https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package>`_, `TensorFlow
|
||||
<https://tensorflow.org>`_, `JAX <https://jax.readthedocs.io/en/latest>`_, and `DeepSpeed
|
||||
<https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/>`_.
|
||||
|
||||
Review the framework installation documentation. For ease-of-use, it's recommended to use official ROCm prebuilt Docker
|
||||
images with the framework pre-installed.
|
||||
|
||||
* :doc:`PyTorch for ROCm <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
|
||||
* :doc:`TensorFlow for ROCm <rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
|
||||
* :doc:`JAX for ROCm <rocm-install-on-linux:how-to/3rd-party/jax-install>`
|
||||
|
||||
The sections that follow in :doc:`Training a model <train-a-model>` are geared for a ROCm with PyTorch installation.
|
||||
137
docs/how-to/rocm-for-ai/train-a-model.rst
Normal file
137
docs/how-to/rocm-for-ai/train-a-model.rst
Normal file
@@ -0,0 +1,137 @@
|
||||
.. meta::
|
||||
:description: How to use ROCm for AI
|
||||
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
|
||||
|
||||
****************
|
||||
Training a model
|
||||
****************
|
||||
|
||||
The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
|
||||
and inferencing.
|
||||
|
||||
Accelerating model training
|
||||
===========================
|
||||
|
||||
To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
|
||||
required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
|
||||
PyTorch offers distributed training solutions to facilitate this.
|
||||
|
||||
.. _rocm-for-ai-pytorch-distributed:
|
||||
|
||||
PyTorch distributed
|
||||
-------------------
|
||||
|
||||
As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
|
||||
|
||||
- `Distributed data-parallel training
|
||||
<https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
|
||||
|
||||
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
|
||||
|
||||
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
|
||||
|
||||
In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
|
||||
let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
|
||||
|
||||
The DDP workflow on multiple accelerators or GPUs is as follows:
|
||||
|
||||
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
|
||||
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
|
||||
|
||||
#. Copy the model to every device so each device can process its local batches independently.
|
||||
|
||||
#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
|
||||
model for that local batch. This happens in parallel on multiple devices.
|
||||
|
||||
#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
|
||||
weights are then redistributed to each device.
|
||||
|
||||
In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
|
||||
``allreduce`` to sum up gradients over different workers.
|
||||
|
||||
See the following developer blogs for more in-depth explanations and examples.
|
||||
|
||||
* `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
|
||||
|
||||
* `Building a decoder transformer model on AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
|
||||
|
||||
.. _rocm-for-ai-pytorch-fsdp:
|
||||
|
||||
PyTorch FSDP
|
||||
------------
|
||||
|
||||
As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
|
||||
are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
|
||||
model parameters, optimizer states, and gradients across DDP ranks.
|
||||
|
||||
When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
|
||||
the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
|
||||
comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
|
||||
like overlapping communication and computation.
|
||||
|
||||
For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
|
||||
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
|
||||
|
||||
For detailed training steps, refer to the `PyTorch FSDP examples
|
||||
<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
|
||||
|
||||
.. _rocm-for-ai-deepspeed:
|
||||
|
||||
DeepSpeed
|
||||
---------
|
||||
|
||||
`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
|
||||
efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
|
||||
the training pillar.
|
||||
|
||||
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
|
||||
training with DeepSpeed on an AMD accelerator or GPU.
|
||||
|
||||
.. _rocm-for-ai-automatic-mixed-precision:
|
||||
|
||||
Automatic mixed precision (AMP)
|
||||
-------------------------------
|
||||
|
||||
As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
|
||||
can take to reduce training time and memory usage through `automatic mixed precision
|
||||
<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
|
||||
|
||||
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
|
||||
for more information about running AMP on an AMD accelerator.
|
||||
|
||||
.. _rocm-for-ai-fine-tune:
|
||||
|
||||
Fine-tuning your model
|
||||
======================
|
||||
|
||||
ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP.
|
||||
|
||||
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
|
||||
|
||||
* Fine-tuning Llama2 with LoRA
|
||||
|
||||
* `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
|
||||
|
||||
* Fine-tuning Llama2 with QLoRA
|
||||
|
||||
* `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
|
||||
|
||||
* Fine-tuning a BERT-based LLM for a text classification task using JAX
|
||||
|
||||
* `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
|
||||
|
||||
* Fine-tuning StarCoder using PEFT
|
||||
|
||||
* `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
|
||||
|
||||
* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
|
||||
|
||||
* `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
|
||||
single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/finetuning>`_
|
||||
@@ -91,6 +91,7 @@ Our documentation is organized into the following categories:
|
||||
:img-alt: How-to documentation
|
||||
:padding: 2
|
||||
|
||||
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
|
||||
* [System tuning for various architectures](./how-to/tuning-guides.md)
|
||||
* [MI100](./how-to/tuning-guides/mi100.md)
|
||||
* [MI200](./how-to/tuning-guides/mi200.md)
|
||||
|
||||
@@ -49,6 +49,15 @@ subtrees:
|
||||
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/rocm-for-ai/index.rst
|
||||
title: Using ROCm for AI
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/rocm-for-ai/install.rst
|
||||
title: Installation
|
||||
- file: how-to/rocm-for-ai/train-a-model.rst
|
||||
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
||||
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
||||
- file: how-to/tuning-guides.md
|
||||
title: System optimization
|
||||
subtrees:
|
||||
@@ -139,4 +148,3 @@ subtrees:
|
||||
title: Provide feedback
|
||||
- file: about/license.md
|
||||
title: ROCm license
|
||||
|
||||
|
||||
Reference in New Issue
Block a user