.. meta:: :description: How to run models from Hugging Face on AMD GPUs. :keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial ******************************** Running models from Hugging Face ******************************** `Hugging Face `_ hosts the world’s largest AI model repository for developers to obtain transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in developing and deploying AI solutions. This section describes how to run popular community transformer models from Hugging Face on AMD GPUs. .. _rocm-for-ai-hugging-face-transformers: Using Hugging Face Transformers ------------------------------- First, `install the Hugging Face Transformers library `_, which lets you easily import any of the transformer models into your Python application. .. code-block:: shell pip install transformers Here is an example of running `GPT2 `_: .. code-block:: python from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2Model.from_pretrained('gpt2') text = "Replace me with any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core models should also function correctly. Here are some mainstream models to get you started: - `BERT `_ - `BLOOM `_ - `Llama `_ - `OPT `_ - `T5 `_ .. _rocm-for-ai-hugging-face-optimum: Using Hugging Face with Optimum-AMD ----------------------------------- Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack. For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the `Optimum-AMD `_ page on Hugging Face for guidance on using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration. Hugging Face libraries natively support AMD Instinct GPUs. For other :doc:`ROCm-capable hardware `, support is currently not validated, but most features are expected to work without issues. .. _rocm-for-ai-install-optimum-amd: Installation ~~~~~~~~~~~~ Install Optimum-AMD using pip. .. code-block:: shell pip install --upgrade --upgrade-strategy eager optimum[amd] Or, install from source. .. code-block:: shell git clone https://github.com/huggingface/optimum-amd.git cd optimum-amd pip install -e . .. _rocm-for-ai-flash-attention: Flash Attention --------------- #. Use `the Hugging Face team's example Dockerfile `_ to use Flash Attention with ROCm. .. code-block:: shell docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash . volume=$PWD docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd transformers_pytorch_amd_gpu_flash:latest #. Use Flash Attention 2 with `Transformers `_ by adding the ``use_flash_attention_2`` parameter to ``from_pretrained()``: .. code-block:: python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") with torch.device("cuda"): model = AutoModelForCausalLM.from_pretrained( "tiiuae/falcon-7b", torch_dtype=torch.float16, use_flash_attention_2=True, ) .. _rocm-for-ai-gptq: GPTQ ---- To enable `GPTQ `_, hosted wheels are available for ROCm. #. First, :ref:`install Optimum-AMD `. #. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation `_ for in-depth guidance. .. code-block:: shell pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable. .. code-block:: shell ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e . #. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library `_: .. code-block:: python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ") with torch.device("cuda"): model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-Chat-GPTQ", torch_dtype=torch.float16, ) .. _rocm-for-ai-onnx: ONNX ---- Hugging Face Optimum also supports the `ONNX Runtime `_ integration. For ONNX models, usage is straightforward. #. Specify the provider argument in the ``ORTModel.from_pretrained()`` method: .. code-block:: python from optimum.onnxruntime import ORTModelForSequenceClassification .. ort_model = ORTModelForSequenceClassification.from_pretrained( .. provider="ROCMExecutionProvider" ) #. Try running a `BERT text classification `_ ONNX model with ROCm: .. code-block:: python from optimum.onnxruntime import ORTModelForSequenceClassification from optimum.pipelines import pipeline from transformers import AutoTokenizer import onnxruntime as ort session_options = ort.SessionOptions() session_options.log_severity_level = 0 ort_model = ORTModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english", export=True, provider="ROCMExecutionProvider", session_options=session_options ) tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") result = pipe("Both the music and visual were astounding, not to mention the actors performance.")