Files
ROCm/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
Matt Williams 0d17c96f7f Fixing link redirects (#5758)
* Update multi-gpu-fine-tuning-and-inference.rst

* Update pytorch-training-v25.6.rst

* Update pytorch-compatibility.rst
2025-12-10 11:31:26 -05:00

237 lines
9.2 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. meta::
:description: Model fine-tuning and inference on a multi-GPU system
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune
*****************************************************
Fine-tuning and inference using multiple GPUs
*****************************************************
This section explains how to fine-tune a model on a multi-accelerator system. See
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single GPU setup.
.. _fine-tuning-llms-multi-gpu-env:
Environment setup
=================
This section was tested using the following hardware and software environment.
.. list-table::
:stub-columns: 1
* - Hardware
- 4 AMD Instinct MI300X GPUs
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
* - Libraries
- ``transformers`` ``datasets`` ``accelerate`` ``huggingface-hub`` ``peft`` ``trl`` ``scipy``
* - Base model
- ``meta-llama/Llama-2-7b-chat-hf``
.. _fine-tuning-llms-multi-gpu-env-setup:
Setting up the base implementation environment
----------------------------------------------
#. Install PyTorch for ROCm. Refer to the
:doc:`PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`. For consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCm-capable GPUs using the following command.
.. code-block:: shell
rocm-smi --showproductname
#. Check that your GPUs are available to PyTorch.
.. code-block:: python
import torch
print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
If successful, your output should look like this:
.. code-block:: shell
>>> print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
Is a ROCm-GPU detected? True
>>> print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
How many ROCm-GPUs are detected? 4
.. tip::
During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal.
This tool helps you see shows which GPUs are involved.
.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate:
Hugging Face Accelerate for fine-tuning and inference
===========================================================
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`__ is a library that simplifies turning raw
PyTorch code for a single GPU into code for multiple GPUs for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`__, so you can scale your PyTorch
code while maintaining performance and flexibility.
As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama
2 7B model.
Here, let's reuse the code in :ref:`Single-accelerator fine-tuning <fine-tuning-llms-single-gpu-download-model-dataset>`
to load the base model and tokenizer.
Now, it's important to adjust how you load the model. Add the ``device_map`` parameter to your base model configuration.
.. code-block:: python
...
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Load base model to GPU memory
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map = "auto",
trust_remote_code = True)
...
# Run training
sft_trainer.train()
.. note::
You can let Accelerate handle the device map computation by setting ``device_map`` to one of the supported options
(``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``).
It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and
efficiently allocate the model given the available resources (four GPUs in this case).
When you have more GPU memory available than the model size, here is the difference between each ``device_map``
option:
* ``"auto"`` and ``"balanced"`` evenly split the model on all available GPUs, making it possible for you to use a
batch size greater than 1.
* ``"balanced_low_0"`` evenly splits the model on all GPUs except the first
one, and only puts on GPU 0 what does not fit on the others. This
option is great when you need to use GPU 0 for some processing of the
outputs, like when using the generate function for Transformers
models.
* ``"sequential"`` will fit what it can on GPU 0, then move on GPU 1 and so forth. Not all GPUs might be used.
After loading the model in this way, the model is fully ready to use the resources available to it.
.. _fine-tuning-llms-multi-gpu-torchtune:
torchtune for fine-tuning and inference
=============================================
`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
model fine-tuning and inference with LLMs.
#. Install torchtune using pip.
.. code-block:: shell
# Install torchtune with PyTorch release 2.2.2+
pip install torchtune
# To confirm that the package is installed correctly
tune --help
The output should look like this:
.. code-block:: shell
usage: tune [-h] {download,ls,cp,run,validate} ...
Welcome to the TorchTune CLI!
options:
-h, --help show this help message and exit
subcommands:
{download,ls,cp,run,validate}
#. torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
.. code-block:: shell
RECIPE CONFIG
full_finetune_single_device llama2/7B_full_low_memory
llama3/8B_full_single_device
mistral/7B_full_low_memory
full_finetune_distributed llama2/7B_full
llama2/13B_full
llama3/8B_full
mistral/7B_full
gemma/2B_full
lora_finetune_single_device llama2/7B_lora_single_device
llama2/7B_qlora_single_device
llama3/8B_lora_single_device
llama3/8B_qlora_single_device
llama2/13B_qlora_single_device
mistral/7B_lora_single_device
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
evaluation, quantization, or inference recipes.
The snippet shows the architecture of a model's YAML configuration file:
.. code-block:: yaml
# Model arguments
model:
_component_: torchtune.models.llama2.lora_llama2_7b
lora_attn_modules: ['q_proj', 'v_proj']
apply_lora_to_mlp: False
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
# Dataset and sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: True
#. This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
and training data type. To download the base model for fine-tuning, run the following command:
.. code-block:: shell
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
#. To launch ``lora_finetune_distributed`` on four devices, run the following
command:
.. code-block:: shell
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
If successful, you should something like the following output:
.. code-block:: shell
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
INFO:torchtune.utils.logging:Memory Stats after model init:
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
1|111|Loss: 1.5790324211120605: 7%|| 114/1618
Read more about inference frameworks in :doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`.