mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-08 22:28:06 -05:00
* Docs: references of accelerator removal and change to GPU Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
211 lines
6.8 KiB
ReStructuredText
211 lines
6.8 KiB
ReStructuredText
.. meta::
|
||
:description: How to run models from Hugging Face on AMD GPUs.
|
||
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
|
||
|
||
********************************
|
||
Running models from Hugging Face
|
||
********************************
|
||
|
||
`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain
|
||
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
|
||
developing and deploying AI solutions.
|
||
|
||
This section describes how to run popular community transformer models from Hugging Face on AMD GPUs.
|
||
|
||
.. _rocm-for-ai-hugging-face-transformers:
|
||
|
||
Using Hugging Face Transformers
|
||
-------------------------------
|
||
|
||
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
|
||
which lets you easily import any of the transformer models into your Python application.
|
||
|
||
.. code-block:: shell
|
||
|
||
pip install transformers
|
||
|
||
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
|
||
|
||
.. code-block:: python
|
||
|
||
from transformers import GPT2Tokenizer, GPT2Model
|
||
|
||
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
||
|
||
model = GPT2Model.from_pretrained('gpt2')
|
||
|
||
text = "Replace me with any text you'd like."
|
||
|
||
encoded_input = tokenizer(text, return_tensors='pt')
|
||
|
||
output = model(**encoded_input)
|
||
|
||
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
|
||
models should also function correctly.
|
||
|
||
Here are some mainstream models to get you started:
|
||
|
||
- `BERT <https://huggingface.co/bert-base-uncased>`_
|
||
|
||
- `BLOOM <https://huggingface.co/bigscience/bloom>`_
|
||
|
||
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
|
||
|
||
- `OPT <https://huggingface.co/facebook/opt-66b>`_
|
||
|
||
- `T5 <https://huggingface.co/t5-base>`_
|
||
|
||
.. _rocm-for-ai-hugging-face-optimum:
|
||
|
||
Using Hugging Face with Optimum-AMD
|
||
-----------------------------------
|
||
|
||
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
|
||
|
||
For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the
|
||
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
|
||
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
|
||
|
||
Hugging Face libraries natively support AMD Instinct GPUs. For other
|
||
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
|
||
validated, but most features are expected to work without issues.
|
||
|
||
.. _rocm-for-ai-install-optimum-amd:
|
||
|
||
Installation
|
||
~~~~~~~~~~~~
|
||
|
||
Install Optimum-AMD using pip.
|
||
|
||
.. code-block:: shell
|
||
|
||
pip install --upgrade --upgrade-strategy eager optimum[amd]
|
||
|
||
Or, install from source.
|
||
|
||
.. code-block:: shell
|
||
|
||
git clone https://github.com/huggingface/optimum-amd.git
|
||
cd optimum-amd
|
||
pip install -e .
|
||
|
||
.. _rocm-for-ai-flash-attention:
|
||
|
||
Flash Attention
|
||
---------------
|
||
|
||
#. Use `the Hugging Face team's example Dockerfile
|
||
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
|
||
Flash Attention with ROCm.
|
||
|
||
.. code-block:: shell
|
||
|
||
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
|
||
volume=$PWD
|
||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
|
||
transformers_pytorch_amd_gpu_flash:latest
|
||
|
||
#. Use Flash Attention 2 with `Transformers
|
||
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
|
||
``use_flash_attention_2`` parameter to ``from_pretrained()``:
|
||
|
||
.. code-block:: python
|
||
|
||
import torch
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
|
||
|
||
with torch.device("cuda"):
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"tiiuae/falcon-7b",
|
||
torch_dtype=torch.float16,
|
||
use_flash_attention_2=True,
|
||
)
|
||
|
||
.. _rocm-for-ai-gptq:
|
||
|
||
GPTQ
|
||
----
|
||
|
||
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
|
||
|
||
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
|
||
|
||
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
|
||
in-depth guidance.
|
||
|
||
.. code-block:: shell
|
||
|
||
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
|
||
|
||
Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
|
||
|
||
.. code-block:: shell
|
||
|
||
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
|
||
|
||
|
||
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
|
||
<https://github.com/PanQiWei/AutoGPTQ>`_:
|
||
|
||
.. code-block:: python
|
||
|
||
import torch
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
|
||
|
||
with torch.device("cuda"):
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"TheBloke/Llama-2-7B-Chat-GPTQ",
|
||
torch_dtype=torch.float16,
|
||
)
|
||
|
||
.. _rocm-for-ai-onnx:
|
||
|
||
ONNX
|
||
----
|
||
|
||
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
|
||
straightforward.
|
||
|
||
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
|
||
|
||
.. code-block:: python
|
||
|
||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||
..
|
||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||
..
|
||
provider="ROCMExecutionProvider"
|
||
)
|
||
|
||
#. Try running a `BERT text classification
|
||
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
|
||
|
||
.. code-block:: python
|
||
|
||
from optimum.onnxruntime import ORTModelForSequenceClassification
|
||
from optimum.pipelines import pipeline
|
||
from transformers import AutoTokenizer
|
||
import onnxruntime as ort
|
||
|
||
session_options = ort.SessionOptions()
|
||
|
||
session_options.log_severity_level = 0
|
||
|
||
ort_model = ORTModelForSequenceClassification.from_pretrained(
|
||
"distilbert-base-uncased-finetuned-sst-2-english",
|
||
export=True,
|
||
provider="ROCMExecutionProvider",
|
||
session_options=session_options
|
||
)
|
||
|
||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
|
||
|
||
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
|
||
|
||
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")
|
||
|