Merge pull request #3193 from peterjunpark/docs/6.1.1

docs/6.1.1: Add "How to use ROCm for AI" (#3117)
This commit is contained in:
Peter Park
2024-05-30 13:55:43 -07:00
committed by GitHub
11 changed files with 553 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 441 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 157 KiB

View File

@@ -0,0 +1,113 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
********************
Deploying your model
********************
ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers.
This section focuses on deploying transformers-based LLM models.
ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.
.. _rocm-for-ai-serve-vllm:
Serving using vLLM
==================
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and
6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions.
See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation
<https://docs.vllm.ai/>`_ for more information.
For guidance on using vLLM with ROCm, refer to `Installation with ROCm
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_.
vLLM installation
-----------------
vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.
- `Build from source with Docker
<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended)
- `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_
vLLM walkthrough
----------------
Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm
Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_
.. _rocm-for-ai-serve-hugging-face-tgi:
Serving using Hugging Face TGI
==============================
The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_
(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI
<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details.
TGI installation
----------------
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
TGI walkthrough
---------------
#. Set up the LLM server.
Deploy the Llama2 7B model with TGI using the official Docker image.
.. code-block:: shell
model=TheBloke/Llama-2-7B-fp16
volume=$PWD
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
#. Set up the client.
a. Open another shell session and run the following command to access the server with the client URL.
.. code-block:: shell
curl 127.0.0.1:8080/generate \\
-X POST \\
-d '{"inputs":"What is Deep
Learning?","parameters":{"max_new_tokens":20}}' \\
-H 'Content-Type: application/json'
b. Access the server with request endpoints.
.. code-block:: shell
pip install request
PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py
``requests_model.py`` should look like:
.. code-block:: python
import requests
headers = {
"Content-Type": "application/json",
}
data = {
'inputs': 'What is Deep Learning?',
'parameters': { 'max_new_tokens': 20 },
}
response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())
vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
performance, low latency, and scalability.
Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.

View File

@@ -0,0 +1,210 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
********************************
Running models from Hugging Face
********************************
`Hugging Face <https://huggingface.co>`_ hosts the worlds largest AI model repository for developers to obtain
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
developing and deploying AI solutions.
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
.. _rocm-for-ai-hugging-face-transformers:
Using Hugging Face Transformers
-------------------------------
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
which lets you easily import any of the transformer models into your Python application.
.. code-block:: shell
pip install transformers
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
.. code-block:: python
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
models should also function correctly.
Here are some mainstream models to get you started:
- `BERT <https://huggingface.co/bert-base-uncased>`_
- `BLOOM <https://huggingface.co/bigscience/bloom>`_
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
- `OPT <https://huggingface.co/facebook/opt-66b>`_
- `T5 <https://huggingface.co/t5-base>`_
.. _rocm-for-ai-hugging-face-optimum:
Using Hugging Face with Optimum-AMD
-----------------------------------
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
Hugging Face libraries natively support AMD Instinct accelerators. For other
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
validated, but most features are expected to work without issues.
.. _rocm-for-ai-install-optimum-amd:
Installation
~~~~~~~~~~~~
Install Optimum-AMD using pip.
.. code-block:: shell
pip install --upgrade --upgrade-strategy eager optimum[amd]
Or, install from source.
.. code-block:: shell
git clone https://github.com/huggingface/optimum-amd.git
cd optimum-amd
pip install -e .
.. _rocm-for-ai-flash-attention:
Flash Attention
---------------
#. Use `the Hugging Face team's example Dockerfile
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
Flash Attention with ROCm.
.. code-block:: shell
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
volume=$PWD
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
transformers_pytorch_amd_gpu_flash:latest
#. Use Flash Attention 2 with `Transformers
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
``use_flash_attention_2`` parameter to ``from_pretrained()``:
.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.float16,
use_flash_attention_2=True,
)
.. _rocm-for-ai-gptq:
GPTQ
----
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
in-depth guidance.
.. code-block:: shell
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
.. code-block:: shell
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
<https://github.com/PanQiWei/AutoGPTQ>`_:
.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-GPTQ",
torch_dtype=torch.float16,
)
.. _rocm-for-ai-onnx:
ONNX
----
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
straightforward.
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
..
ort_model = ORTModelForSequenceClassification.from_pretrained(
..
provider="ROCMExecutionProvider"
)
#. Try running a `BERT text classification
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.pipelines import pipeline
from transformers import AutoTokenizer
import onnxruntime as ort
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
ort_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True,
provider="ROCMExecutionProvider",
session_options=session_options
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")

View File

@@ -0,0 +1,23 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, machine learning, LLM, usage, tutorial
*****************
Using ROCm for AI
*****************
ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
AI software ecosystem, including open frameworks, models, and tools.
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
In this guide, you'll learn about:
- :doc:`Installing ROCm and machine learning frameworks <install>`
- :doc:`Training a model <train-a-model>`
- :doc:`Running models from Hugging Face <hugging-face-models>`
- :doc:`Deploying your model <deploy-your-model>`

View File

@@ -0,0 +1,60 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
.. _rocm-for-ai-install:
***********************************************
Installing ROCm and machine learning frameworks
***********************************************
Before getting started, install ROCm and supported machine learning frameworks.
.. grid:: 1
.. grid-item-card:: Pre-install
Each release of ROCm supports specific hardware and software configurations. Before installing, consult the
:doc:`System requirements <rocm-install-on-linux:reference/system-requirements>` and
:doc:`Installation prerequisites <rocm-install-on-linux:how-to/prerequisites>` guides.
If youre new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux
<rocm-install-on-linux:tutorial/quick-start>`.
If youre using a Radeon GPU for graphics-accelerated applications, refer to the
:doc:`Radeon installation instructions <radeon:docs/install/install-radeon>`.
ROCm supports two methods for installation. There is no difference in the final ROCm installation between these two
methods. You can also opt for :ref:`single-version or multi-version installation
<rocm-install-on-linux:installation-types>`.
* :doc:`Using your Linux distribution's package manager <rocm-install-on-linux:how-to/native-install/index>`
* :doc:`Using the AMDGPU installer <rocm-install-on-linux:how-to/amdgpu-install>`
.. grid:: 1
.. grid-item-card:: Post-install
Follow the :doc:`post-installation instructions <rocm-install-on-linux:how-to/native-install/post-install>` to
configure your system linker, PATH, and verify the installation.
If you encounter any issues during installation, refer to the
:doc:`Installation troubleshooting <rocm-install-on-linux:how-to/native-install/install-faq>` guide.
Machine learning frameworks
===========================
ROCm supports popular machine learning frameworks and libraries including `PyTorch
<https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package>`_, `TensorFlow
<https://tensorflow.org>`_, `JAX <https://jax.readthedocs.io/en/latest>`_, and `DeepSpeed
<https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/>`_.
Review the framework installation documentation. For ease-of-use, it's recommended to use official ROCm prebuilt Docker
images with the framework pre-installed.
* :doc:`PyTorch for ROCm <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
* :doc:`TensorFlow for ROCm <rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
* :doc:`JAX for ROCm <rocm-install-on-linux:how-to/3rd-party/jax-install>`
The sections that follow in :doc:`Training a model <train-a-model>` are geared for a ROCm with PyTorch installation.

View File

@@ -0,0 +1,137 @@
.. meta::
:description: How to use ROCm for AI
:keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
****************
Training a model
****************
The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
and inferencing.
Accelerating model training
===========================
To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
PyTorch offers distributed training solutions to facilitate this.
.. _rocm-for-ai-pytorch-distributed:
PyTorch distributed
-------------------
As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
- `Distributed data-parallel training
<https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
In this guide, the focus is on the distributed data-parallelism strategy as its the most popular. To get started with DDP,
lets first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
The DDP workflow on multiple accelerators or GPUs is as follows:
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
#. Copy the model to every device so each device can process its local batches independently.
#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
model for that local batch. This happens in parallel on multiple devices.
#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
weights are then redistributed to each device.
In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
``allreduce`` to sum up gradients over different workers.
See the following developer blogs for more in-depth explanations and examples.
* `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
* `Building a decoder transformer model on AMD GPUs — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
.. _rocm-for-ai-pytorch-fsdp:
PyTorch FSDP
------------
As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
model parameters, optimizer states, and gradients across DDP ranks.
When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
like overlapping communication and computation.
For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
For detailed training steps, refer to the `PyTorch FSDP examples
<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
.. _rocm-for-ai-deepspeed:
DeepSpeed
---------
`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
the training pillar.
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
training with DeepSpeed on an AMD accelerator or GPU.
.. _rocm-for-ai-automatic-mixed-precision:
Automatic mixed precision (AMP)
-------------------------------
As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
can take to reduce training time and memory usage through `automatic mixed precision
<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
for more information about running AMP on an AMD accelerator.
.. _rocm-for-ai-fine-tune:
Fine-tuning your model
======================
ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP.
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
* Fine-tuning Llama2 with LoRA
* `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
* Fine-tuning Llama2 with QLoRA
* `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
* Fine-tuning a BERT-based LLM for a text classification task using JAX
* `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
* Fine-tuning StarCoder using PEFT
* `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
* `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/finetuning>`_

View File

@@ -91,6 +91,7 @@ Our documentation is organized into the following categories:
:img-alt: How-to documentation
:padding: 2
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
* [System tuning for various architectures](./how-to/tuning-guides.md)
* [MI100](./how-to/tuning-guides/mi100.md)
* [MI200](./how-to/tuning-guides/mi200.md)

View File

@@ -49,6 +49,15 @@ subtrees:
- caption: How to
entries:
- file: how-to/rocm-for-ai/index.rst
title: Using ROCm for AI
subtrees:
- entries:
- file: how-to/rocm-for-ai/install.rst
title: Installation
- file: how-to/rocm-for-ai/train-a-model.rst
- file: how-to/rocm-for-ai/hugging-face-models.rst
- file: how-to/rocm-for-ai/deploy-your-model.rst
- file: how-to/tuning-guides.md
title: System optimization
subtrees:
@@ -139,4 +148,3 @@ subtrees:
title: Provide feedback
- file: about/license.md
title: ROCm license