diff --git a/docs/data/how-to/rocm-for-ai/pytorch_docker_install.png b/docs/data/how-to/rocm-for-ai/pytorch_docker_install.png new file mode 100644 index 000000000..8e655a3ce Binary files /dev/null and b/docs/data/how-to/rocm-for-ai/pytorch_docker_install.png differ diff --git a/docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png b/docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png new file mode 100644 index 000000000..7ed4568d6 Binary files /dev/null and b/docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png differ diff --git a/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_accuracy.png b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_accuracy.png new file mode 100644 index 000000000..68fe8c4d6 Binary files /dev/null and b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_accuracy.png differ diff --git a/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_compile.png b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_compile.png new file mode 100644 index 000000000..66d0b9704 Binary files /dev/null and b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_compile.png differ diff --git a/docs/how-to/rocm-for-ai/deploy-your-model.rst b/docs/how-to/rocm-for-ai/deploy-your-model.rst new file mode 100644 index 000000000..fd9fe8584 --- /dev/null +++ b/docs/how-to/rocm-for-ai/deploy-your-model.rst @@ -0,0 +1,113 @@ +.. meta:: + :description: How to use ROCm for AI + :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial + +******************** +Deploying your model +******************** + +ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers. +This section focuses on deploying transformers-based LLM models. + +ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks. + +.. _rocm-for-ai-serve-vllm: + +Serving using vLLM +================== + +vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and +6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions. + +See the `GitHub repository `_ and `official vLLM documentation +`_ for more information. + +For guidance on using vLLM with ROCm, refer to `Installation with ROCm +`_. + +vLLM installation +----------------- + +vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links. + +- `Build from source with Docker + `_ (recommended) + +- `Build from source `_ + +vLLM walkthrough +---------------- + +Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm +Blogs `_ + +.. _rocm-for-ai-serve-hugging-face-tgi: + +Serving using Hugging Face TGI +============================== + +The `Hugging Face Text Generation Inference `_ +(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI +`_ for more details. + +TGI installation +---------------- + +The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at +``__. + +TGI walkthrough +--------------- + +#. Set up the LLM server. + + Deploy the Llama2 7B model with TGI using the official Docker image. + + .. code-block:: shell + + model=TheBloke/Llama-2-7B-fp16 + volume=$PWD + docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model + +#. Set up the client. + + a. Open another shell session and run the following command to access the server with the client URL. + + .. code-block:: shell + + curl 127.0.0.1:8080/generate \\ + -X POST \\ + -d '{"inputs":"What is Deep + Learning?","parameters":{"max_new_tokens":20}}' \\ + -H 'Content-Type: application/json' + + b. Access the server with request endpoints. + + .. code-block:: shell + + pip install request + PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py + + ``requests_model.py`` should look like: + + .. code-block:: python + + import requests + + headers = { + "Content-Type": "application/json", + } + + data = { + 'inputs': 'What is Deep Learning?', + 'parameters': { 'max_new_tokens': 20 }, + } + + response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data) + + print(response.json()) + +vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high +performance, low latency, and scalability. + +Visit the topics in :doc:`Using ROCm for AI ` to learn about other ROCm-aware solutions for AI development. diff --git a/docs/how-to/rocm-for-ai/hugging-face-models.rst b/docs/how-to/rocm-for-ai/hugging-face-models.rst new file mode 100644 index 000000000..63b32e006 --- /dev/null +++ b/docs/how-to/rocm-for-ai/hugging-face-models.rst @@ -0,0 +1,210 @@ +.. meta:: + :description: How to use ROCm for AI + :keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial + +******************************** +Running models from Hugging Face +******************************** + +`Hugging Face `_ hosts the world’s largest AI model repository for developers to obtain +transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in +developing and deploying AI solutions. + +This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs. + +.. _rocm-for-ai-hugging-face-transformers: + +Using Hugging Face Transformers +------------------------------- + +First, `install the Hugging Face Transformers library `_, +which lets you easily import any of the transformer models into your Python application. + +.. code-block:: shell + + pip install transformers + +Here is an example of running `GPT2 `_: + +.. code-block:: python + + from transformers import GPT2Tokenizer, GPT2Model + + tokenizer = GPT2Tokenizer.from_pretrained('gpt2') + + model = GPT2Model.from_pretrained('gpt2') + + text = "Replace me with any text you'd like." + + encoded_input = tokenizer(text, return_tensors='pt') + + output = model(**encoded_input) + +Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core +models should also function correctly. + +Here are some mainstream models to get you started: + +- `BERT `_ + +- `BLOOM `_ + +- `Llama `_ + +- `OPT `_ + +- `T5 `_ + +.. _rocm-for-ai-hugging-face-optimum: + +Using Hugging Face with Optimum-AMD +----------------------------------- + +Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack. + +For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the +`Optimum-AMD `_ page on Hugging Face for guidance on +using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration. + +Hugging Face libraries natively support AMD Instinct accelerators. For other +:doc:`ROCm-capable hardware `, support is currently not +validated, but most features are expected to work without issues. + +.. _rocm-for-ai-install-optimum-amd: + +Installation +~~~~~~~~~~~~ + +Install Optimum-AMD using pip. + +.. code-block:: shell + + pip install --upgrade --upgrade-strategy eager optimum[amd] + +Or, install from source. + +.. code-block:: shell + + git clone https://github.com/huggingface/optimum-amd.git + cd optimum-amd + pip install -e . + +.. _rocm-for-ai-flash-attention: + +Flash Attention +--------------- + +#. Use `the Hugging Face team's example Dockerfile + `_ to use + Flash Attention with ROCm. + + .. code-block:: shell + + docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash . + volume=$PWD + docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd + transformers_pytorch_amd_gpu_flash:latest + +#. Use Flash Attention 2 with `Transformers + `_ by adding the + ``use_flash_attention_2`` parameter to ``from_pretrained()``: + + .. code-block:: python + + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + + tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") + + with torch.device("cuda"): + model = AutoModelForCausalLM.from_pretrained( + "tiiuae/falcon-7b", + torch_dtype=torch.float16, + use_flash_attention_2=True, + ) + +.. _rocm-for-ai-gptq: + +GPTQ +---- + +To enable `GPTQ `_, hosted wheels are available for ROCm. + +#. First, :ref:`install Optimum-AMD `. + +#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation `_ for + in-depth guidance. + + .. code-block:: shell + + pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ + + Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable. + + .. code-block:: shell + + ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e . + + +#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library + `_: + + .. code-block:: python + + import torch + from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + + tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ") + + with torch.device("cuda"): + model = AutoModelForCausalLM.from_pretrained( + "TheBloke/Llama-2-7B-Chat-GPTQ", + torch_dtype=torch.float16, + ) + +.. _rocm-for-ai-onnx: + +ONNX +---- + +Hugging Face Optimum also supports the `ONNX Runtime `_ integration. For ONNX models, usage is +straightforward. + +#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method: + + .. code-block:: python + + from optimum.onnxruntime import ORTModelForSequenceClassification + .. + ort_model = ORTModelForSequenceClassification.from_pretrained( + .. + provider="ROCMExecutionProvider" + ) + +#. Try running a `BERT text classification + `_ ONNX model with ROCm: + + .. code-block:: python + + from optimum.onnxruntime import ORTModelForSequenceClassification + from optimum.pipelines import pipeline + from transformers import AutoTokenizer + import onnxruntime as ort + + session_options = ort.SessionOptions() + + session_options.log_severity_level = 0 + + ort_model = ORTModelForSequenceClassification.from_pretrained( + "distilbert-base-uncased-finetuned-sst-2-english", + export=True, + provider="ROCMExecutionProvider", + session_options=session_options + ) + + tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") + + pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") + + result = pipe("Both the music and visual were astounding, not to mention the actors performance.") + diff --git a/docs/how-to/rocm-for-ai/index.rst b/docs/how-to/rocm-for-ai/index.rst new file mode 100644 index 000000000..5be7b0995 --- /dev/null +++ b/docs/how-to/rocm-for-ai/index.rst @@ -0,0 +1,23 @@ +.. meta:: + :description: How to use ROCm for AI + :keywords: ROCm, AI, machine learning, LLM, usage, tutorial + +***************** +Using ROCm for AI +***************** + +ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and +recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader +AI software ecosystem, including open frameworks, models, and tools. + +For more information, see `What is ROCm? `_ + +In this guide, you'll learn about: + +- :doc:`Installing ROCm and machine learning frameworks ` + +- :doc:`Training a model ` + +- :doc:`Running models from Hugging Face ` + +- :doc:`Deploying your model ` diff --git a/docs/how-to/rocm-for-ai/install.rst b/docs/how-to/rocm-for-ai/install.rst new file mode 100644 index 000000000..28f5baba5 --- /dev/null +++ b/docs/how-to/rocm-for-ai/install.rst @@ -0,0 +1,60 @@ +.. meta:: + :description: How to use ROCm for AI + :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial + +.. _rocm-for-ai-install: + +*********************************************** +Installing ROCm and machine learning frameworks +*********************************************** + +Before getting started, install ROCm and supported machine learning frameworks. + +.. grid:: 1 + + .. grid-item-card:: Pre-install + + Each release of ROCm supports specific hardware and software configurations. Before installing, consult the + :doc:`System requirements ` and + :doc:`Installation prerequisites ` guides. + +If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux +`. + +If you’re using a Radeon GPU for graphics-accelerated applications, refer to the +:doc:`Radeon installation instructions `. + +ROCm supports two methods for installation. There is no difference in the final ROCm installation between these two +methods. You can also opt for :ref:`single-version or multi-version installation +`. + +* :doc:`Using your Linux distribution's package manager ` + +* :doc:`Using the AMDGPU installer ` + +.. grid:: 1 + + .. grid-item-card:: Post-install + + Follow the :doc:`post-installation instructions ` to + configure your system linker, PATH, and verify the installation. + + If you encounter any issues during installation, refer to the + :doc:`Installation troubleshooting ` guide. + +Machine learning frameworks +=========================== + +ROCm supports popular machine learning frameworks and libraries including `PyTorch +`_, `TensorFlow +`_, `JAX `_, and `DeepSpeed +`_. + +Review the framework installation documentation. For ease-of-use, it's recommended to use official ROCm prebuilt Docker +images with the framework pre-installed. + +* :doc:`PyTorch for ROCm ` +* :doc:`TensorFlow for ROCm ` +* :doc:`JAX for ROCm ` + +The sections that follow in :doc:`Training a model ` are geared for a ROCm with PyTorch installation. diff --git a/docs/how-to/rocm-for-ai/train-a-model.rst b/docs/how-to/rocm-for-ai/train-a-model.rst new file mode 100644 index 000000000..f9c585445 --- /dev/null +++ b/docs/how-to/rocm-for-ai/train-a-model.rst @@ -0,0 +1,137 @@ +.. meta:: + :description: How to use ROCm for AI + :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial + +**************** +Training a model +**************** + +The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs, +and inferencing. + +Accelerating model training +=========================== + +To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters +required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs? +PyTorch offers distributed training solutions to facilitate this. + +.. _rocm-for-ai-pytorch-distributed: + +PyTorch distributed +------------------- + +As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components: + +- `Distributed data-parallel training + `_ (DDP) + +- `RPC-Based distributed training `_ (RPC) + +- `Collective communication `_ + +In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP, +let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs. + +The DDP workflow on multiple accelerators or GPUs is as follows: + +#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and + the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples. + +#. Copy the model to every device so each device can process its local batches independently. + +#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the + model for that local batch. This happens in parallel on multiple devices. + +#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated + weights are then redistributed to each device. + +In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses +``allreduce`` to sum up gradients over different workers. + +See the following developer blogs for more in-depth explanations and examples. + +* `Multi GPU training with DDP — PyTorch Tutorials `_ + +* `Building a decoder transformer model on AMD GPUs — ROCm Blogs + `_ + +.. _rocm-for-ai-pytorch-fsdp: + +PyTorch FSDP +------------ + +As noted in :ref:`PyTorch distributed `, in DDP model weights and optimizer states +are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards +model parameters, optimizer states, and gradients across DDP ranks. + +When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes +the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this +comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations +like overlapping communication and computation. + +For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel +`_. + +For detailed training steps, refer to the `PyTorch FSDP examples +`_. + +.. _rocm-for-ai-deepspeed: + +DeepSpeed +--------- + +`DeepSpeed `_ offers system innovations that make large-scale deep learning training effective, +efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under +the training pillar. + +See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs +`_ for a detailed example of +training with DeepSpeed on an AMD accelerator or GPU. + +.. _rocm-for-ai-automatic-mixed-precision: + +Automatic mixed precision (AMP) +------------------------------- + +As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we +can take to reduce training time and memory usage through `automatic mixed precision +`_ (AMP) is highly beneficial for most use cases. + +See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs +`_ +for more information about running AMP on an AMD accelerator. + +.. _rocm-for-ai-fine-tune: + +Fine-tuning your model +====================== + +ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP. + +The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU. + +* Fine-tuning Llama2 with LoRA + + * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs + `_ + +* Fine-tuning Llama2 with QLoRA + + * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs + `_ + +* Fine-tuning a BERT-based LLM for a text classification task using JAX + + * `LLM distributed supervised fine-tuning with JAX — ROCm Blogs + `_ + +* Fine-tuning StarCoder using PEFT + + * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs + `_ + +* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes`` + + * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover + single/multi-node GPUs `_ diff --git a/docs/index.md b/docs/index.md index dd30f5079..0f37a3c6a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -91,6 +91,7 @@ Our documentation is organized into the following categories: :img-alt: How-to documentation :padding: 2 +* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst) * [System tuning for various architectures](./how-to/tuning-guides.md) * [MI100](./how-to/tuning-guides/mi100.md) * [MI200](./how-to/tuning-guides/mi200.md) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 11855b6a6..a76f9dd9e 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -49,6 +49,15 @@ subtrees: - caption: How to entries: + - file: how-to/rocm-for-ai/index.rst + title: Using ROCm for AI + subtrees: + - entries: + - file: how-to/rocm-for-ai/install.rst + title: Installation + - file: how-to/rocm-for-ai/train-a-model.rst + - file: how-to/rocm-for-ai/hugging-face-models.rst + - file: how-to/rocm-for-ai/deploy-your-model.rst - file: how-to/tuning-guides.md title: System optimization subtrees: @@ -139,4 +148,3 @@ subtrees: title: Provide feedback - file: about/license.md title: ROCm license -