Files
ROCm/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst
anisha-amd a98236a4e3 Main Docs: references of accelerator removal and change to GPU (#5495)
* Docs: references of accelerator removal and change to GPU

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
2025-10-16 11:22:10 -04:00

211 lines
6.8 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. meta::
:description: How to run models from Hugging Face on AMD GPUs.
:keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
********************************
Running models from Hugging Face
********************************
`Hugging Face <https://huggingface.co>`_ hosts the worlds largest AI model repository for developers to obtain
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
developing and deploying AI solutions.
This section describes how to run popular community transformer models from Hugging Face on AMD GPUs.
.. _rocm-for-ai-hugging-face-transformers:
Using Hugging Face Transformers
-------------------------------
First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
which lets you easily import any of the transformer models into your Python application.
.. code-block:: shell
pip install transformers
Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
.. code-block:: python
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
models should also function correctly.
Here are some mainstream models to get you started:
- `BERT <https://huggingface.co/bert-base-uncased>`_
- `BLOOM <https://huggingface.co/bigscience/bloom>`_
- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
- `OPT <https://huggingface.co/facebook/opt-66b>`_
- `T5 <https://huggingface.co/t5-base>`_
.. _rocm-for-ai-hugging-face-optimum:
Using Hugging Face with Optimum-AMD
-----------------------------------
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
Hugging Face libraries natively support AMD Instinct GPUs. For other
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
validated, but most features are expected to work without issues.
.. _rocm-for-ai-install-optimum-amd:
Installation
~~~~~~~~~~~~
Install Optimum-AMD using pip.
.. code-block:: shell
pip install --upgrade --upgrade-strategy eager optimum[amd]
Or, install from source.
.. code-block:: shell
git clone https://github.com/huggingface/optimum-amd.git
cd optimum-amd
pip install -e .
.. _rocm-for-ai-flash-attention:
Flash Attention
---------------
#. Use `the Hugging Face team's example Dockerfile
<https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
Flash Attention with ROCm.
.. code-block:: shell
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
volume=$PWD
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
transformers_pytorch_amd_gpu_flash:latest
#. Use Flash Attention 2 with `Transformers
<https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
``use_flash_attention_2`` parameter to ``from_pretrained()``:
.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.float16,
use_flash_attention_2=True,
)
.. _rocm-for-ai-gptq:
GPTQ
----
To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
in-depth guidance.
.. code-block:: shell
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
.. code-block:: shell
ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
<https://github.com/PanQiWei/AutoGPTQ>`_:
.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
with torch.device("cuda"):
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-Chat-GPTQ",
torch_dtype=torch.float16,
)
.. _rocm-for-ai-onnx:
ONNX
----
Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
straightforward.
#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
..
ort_model = ORTModelForSequenceClassification.from_pretrained(
..
provider="ROCMExecutionProvider"
)
#. Try running a `BERT text classification
<https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
.. code-block:: python
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.pipelines import pipeline
from transformers import AutoTokenizer
import onnxruntime as ort
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
ort_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english",
export=True,
provider="ROCMExecutionProvider",
session_options=session_options
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipe("Both the music and visual were astounding, not to mention the actors performance.")