mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 14:48:06 -05:00
Update fine-tuning guide: title, improve readibility in code blocks, fix typos (#3222)
* Fix typo * Add torchtune link * Add newlines before comments in code blocks for readability * Update title
This commit is contained in:
@@ -2,9 +2,9 @@
|
||||
:description: How to fine-tune LLMs with ROCm
|
||||
:keywords: ROCm, LLM, fine-tuning, usage, tutorial
|
||||
|
||||
**************************
|
||||
Fine-tuning LLMs with ROCm
|
||||
**************************
|
||||
*******************************************
|
||||
Fine-tuning LLMs and inference optimization
|
||||
*******************************************
|
||||
|
||||
ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for
|
||||
specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks,
|
||||
|
||||
@@ -32,7 +32,7 @@ Installing vLLM
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Install from the source
|
||||
# Install from source
|
||||
git clone https://github.com/ROCm/vllm.git
|
||||
cd vllm
|
||||
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #MI300 series
|
||||
|
||||
@@ -40,7 +40,7 @@ ROCm provides two different implementations of Flash Attention 2 modules. They c
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Install from the source
|
||||
# Install from source
|
||||
git clone https://github.com/ROCm/flash-attention.git
|
||||
cd flash-attention/
|
||||
GPU_ARCHS=gfx942 python setup.py install #MI300 series
|
||||
@@ -156,7 +156,7 @@ of the PyTorch compilation.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Sample script to run LLM with the static key-value cache and pytorch compilation
|
||||
# Sample script to run LLM with the static key-value cache and PyTorch compilation
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
|
||||
import torch
|
||||
from typing import Optional
|
||||
@@ -180,7 +180,8 @@ of the PyTorch compilation.
|
||||
return new_token
|
||||
|
||||
batch_size, seq_length = inputs["input_ids"].shape
|
||||
# static key-value cache
|
||||
|
||||
# Static key-value cache
|
||||
max_cache_length = 1024
|
||||
max_new_tokens = 10
|
||||
model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length)
|
||||
@@ -190,6 +191,7 @@ of the PyTorch compilation.
|
||||
|
||||
logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
|
||||
next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
|
||||
|
||||
# torch compilation
|
||||
decode_one_tokens = torch.compile(decode_one_tokens, mode="max-autotune-no-cudagraphs",fullgraph=True)
|
||||
|
||||
@@ -221,10 +223,10 @@ page describes the options.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# To turn on TunableOps, simply set this environmental variable
|
||||
# To turn on TunableOp, simply set this environment variable
|
||||
export PYTORCH_TUNABLEOP_ENABLED=1
|
||||
|
||||
# python
|
||||
# Python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
@@ -32,7 +32,7 @@ The AutoGPTQ library implements the GPTQ algorithm.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# This will install pre-built wheel for a specific ROCm version
|
||||
# This will install pre-built wheel for a specific ROCm version.
|
||||
|
||||
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
|
||||
|
||||
@@ -40,11 +40,11 @@ The AutoGPTQ library implements the GPTQ algorithm.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Clone the source code
|
||||
# Clone the source code.
|
||||
git clone https://github.com/AutoGPTQ/AutoGPTQ.git
|
||||
cd AutoGPTQ
|
||||
|
||||
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device
|
||||
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device.
|
||||
PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install .
|
||||
|
||||
# Show the package after the installation
|
||||
@@ -93,12 +93,14 @@ Using GPTQ with AutoGPTQ
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# import auto_gptq class
|
||||
# Import auto_gptq class.
|
||||
from auto_gptq import AutoGPTQForCausalLM
|
||||
# load non-quantized model
|
||||
|
||||
# Load non-quantized model.
|
||||
base_model = AutoGPTQForCausalLM.from_pretrained(base_model_name, quantize_config, device_map = "auto")
|
||||
base_model.quantize(examples)
|
||||
# save quantized model
|
||||
|
||||
# Save quantized model.
|
||||
base_model.save_quantized(quantized_model_name)
|
||||
|
||||
Using GPTQ with Hugging Face Transformers
|
||||
@@ -201,7 +203,7 @@ Installing bitsandbytes
|
||||
Using bitsandbytes primitives
|
||||
-----------------------------
|
||||
|
||||
To get started with bitsandbytes primitives, use the following code a reference.
|
||||
To get started with bitsandbytes primitives, use the following code as reference.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -230,7 +232,7 @@ To load a Transformers model in 4-bit, set ``load_int_4bt=true`` in ``BitsAndByt
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config)
|
||||
|
||||
# check the memory footprint with get_memory_footprint method
|
||||
# Check the memory footprint with get_memory_footprint method
|
||||
print(bnb_model_4bit.get_memory_footprint())
|
||||
|
||||
To load a model in 8-bit for inference, use the ``load_in_8bit`` option.
|
||||
|
||||
@@ -130,8 +130,8 @@ After loading the model in this way, the model is fully ready to use the resourc
|
||||
torchtune for fine-tuning and inference
|
||||
=============================================
|
||||
|
||||
torchtune is a PyTorch-native library for easy single and multi-accelerator or GPU model fine-tuning and inference with
|
||||
LLMs.
|
||||
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-accelerator or
|
||||
GPU model fine-tuning and inference with LLMs.
|
||||
|
||||
#. Install torchtune using pip.
|
||||
|
||||
@@ -157,80 +157,80 @@ LLMs.
|
||||
subcommands:
|
||||
{download,ls,cp,run,validate}
|
||||
|
||||
torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
|
||||
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
|
||||
#. torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
|
||||
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
|
||||
|
||||
.. code-block:: shell
|
||||
.. code-block:: shell
|
||||
|
||||
RECIPE CONFIG
|
||||
full_finetune_single_device llama2/7B_full_low_memory
|
||||
llama3/8B_full_single_device
|
||||
mistral/7B_full_low_memory
|
||||
full_finetune_distributed llama2/7B_full
|
||||
llama2/13B_full
|
||||
llama3/8B_full
|
||||
mistral/7B_full
|
||||
gemma/2B_full
|
||||
lora_finetune_single_device llama2/7B_lora_single_device
|
||||
llama2/7B_qlora_single_device
|
||||
llama3/8B_lora_single_device
|
||||
llama3/8B_qlora_single_device
|
||||
llama2/13B_qlora_single_device
|
||||
mistral/7B_lora_single_device
|
||||
RECIPE CONFIG
|
||||
full_finetune_single_device llama2/7B_full_low_memory
|
||||
llama3/8B_full_single_device
|
||||
mistral/7B_full_low_memory
|
||||
full_finetune_distributed llama2/7B_full
|
||||
llama2/13B_full
|
||||
llama3/8B_full
|
||||
mistral/7B_full
|
||||
gemma/2B_full
|
||||
lora_finetune_single_device llama2/7B_lora_single_device
|
||||
llama2/7B_qlora_single_device
|
||||
llama3/8B_lora_single_device
|
||||
llama3/8B_qlora_single_device
|
||||
llama2/13B_qlora_single_device
|
||||
mistral/7B_lora_single_device
|
||||
|
||||
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
|
||||
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
|
||||
evaluation, quantization, or inference recipes.
|
||||
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
|
||||
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
|
||||
evaluation, quantization, or inference recipes.
|
||||
|
||||
The snippet shows the architecture of a model's YAML configuration file:
|
||||
The snippet shows the architecture of a model's YAML configuration file:
|
||||
|
||||
.. code-block:: yaml
|
||||
.. code-block:: yaml
|
||||
|
||||
# Model Arguments
|
||||
model:
|
||||
_component_: torchtune.models.llama2.lora_llama2_7b
|
||||
lora_attn_modules: ['q_proj', 'v_proj']
|
||||
apply_lora_to_mlp: False
|
||||
apply_lora_to_output: False
|
||||
lora_rank: 8
|
||||
lora_alpha: 16
|
||||
|
||||
tokenizer:
|
||||
_component_: torchtune.models.llama2.llama2_tokenizer
|
||||
path: /tmp/Llama-2-7b-hf/tokenizer.model
|
||||
|
||||
# Dataset and Sampler
|
||||
dataset:
|
||||
_component_: torchtune.datasets.alpaca_cleaned_dataset
|
||||
train_on_input: True
|
||||
# Model arguments
|
||||
model:
|
||||
_component_: torchtune.models.llama2.lora_llama2_7b
|
||||
lora_attn_modules: ['q_proj', 'v_proj']
|
||||
apply_lora_to_mlp: False
|
||||
apply_lora_to_output: False
|
||||
lora_rank: 8
|
||||
lora_alpha: 16
|
||||
|
||||
tokenizer:
|
||||
_component_: torchtune.models.llama2.llama2_tokenizer
|
||||
path: /tmp/Llama-2-7b-hf/tokenizer.model
|
||||
|
||||
# Dataset and sampler
|
||||
dataset:
|
||||
_component_: torchtune.datasets.alpaca_cleaned_dataset
|
||||
train_on_input: True
|
||||
|
||||
This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
|
||||
and training data type. To download the base model for fine-tuning, run the following command:
|
||||
#. This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
|
||||
and training data type. To download the base model for fine-tuning, run the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
.. code-block:: shell
|
||||
|
||||
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
|
||||
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
|
||||
|
||||
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
|
||||
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
|
||||
|
||||
To launch ``lora_finetune_distributed`` on four devices, run the following
|
||||
command:
|
||||
#. To launch ``lora_finetune_distributed`` on four devices, run the following
|
||||
command:
|
||||
|
||||
.. code-block:: shell
|
||||
.. code-block:: shell
|
||||
|
||||
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
|
||||
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
|
||||
|
||||
If successful, you should something like the following output:
|
||||
If successful, you should something like the following output:
|
||||
|
||||
.. code-block:: shell
|
||||
.. code-block:: shell
|
||||
|
||||
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
|
||||
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
|
||||
INFO:torchtune.utils.logging:Memory Stats after model init:
|
||||
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
|
||||
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
|
||||
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
|
||||
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
|
||||
1|111|Loss: 1.5790324211120605: 7%|█ | 114/1618
|
||||
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
|
||||
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
|
||||
INFO:torchtune.utils.logging:Memory Stats after model init:
|
||||
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
|
||||
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
|
||||
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
|
||||
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
|
||||
1|111|Loss: 1.5790324211120605: 7%|█ | 114/1618
|
||||
|
||||
Read more about inference frameworks in :doc:`LLM inference frameworks <llm-inference-frameworks>`.
|
||||
|
||||
@@ -7,7 +7,7 @@ Conceptual overview of fine-tuning LLMs
|
||||
***************************************
|
||||
|
||||
Large language models (LLMs) are trained on massive amounts of text data to generate coherent and fluent text. The
|
||||
underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers serve as the
|
||||
underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers
|
||||
enable LLMs to understand and generate text by capturing contextual relationships and long-range dependencies. To better
|
||||
understand the philosophy of the transformer architecture, review the foundational
|
||||
`Attention is all you need <https://arxiv.org/pdf/1706.03762.pdf>`_ paper.
|
||||
@@ -60,7 +60,7 @@ overcome this issue of high memory consumption.
|
||||
LoRA accelerates the adjustment process and reduces related memory costs. To be precise, LoRA decomposes the portion of
|
||||
weight changes :math:`ΔW` into high-precision low-rank representations, which do not require the calculations of all
|
||||
:math:`ΔW`. It learns the decomposition representation of :math:`ΔW` during training, as shown in
|
||||
:ref:`the weight update diagram <fine-tuning-llms-concept-challenge>`. This is how LoRA saves on
|
||||
the :ref:`weight update diagram <fine-tuning-llms-concept-challenge>`. This is how LoRA saves on
|
||||
computing resources.
|
||||
|
||||
LoRA is integrated into the `Hugging Face Parameter-Efficient Fine-Tuning (PEFT)
|
||||
|
||||
@@ -87,7 +87,8 @@ Setting up the base implementation environment
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Install `bitsandbytes` for ROCm 6.0+, use -DBNB_ROCM_ARCH to target specific GPU arch
|
||||
# Install `bitsandbytes` for ROCm 6.0+.
|
||||
# Use -DBNB_ROCM_ARCH to target a specific GPU architecture.
|
||||
git clone --recurse https://github.com/ROCm/bitsandbytes.git
|
||||
cd bitsandbytes
|
||||
git checkout rocm_enabled
|
||||
@@ -95,13 +96,13 @@ Setting up the base implementation environment
|
||||
cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
|
||||
python setup.py install
|
||||
|
||||
# To leverage the SFTTrainer in TRL for model fine-tuning
|
||||
# To leverage the SFTTrainer in TRL for model fine-tuning.
|
||||
pip install trl
|
||||
|
||||
# To leverage PEFT for efficiently adapting pre-trained language models
|
||||
# To leverage PEFT for efficiently adapting pre-trained language models .
|
||||
pip install peft
|
||||
|
||||
# Install the other dependencies:
|
||||
# Install the other dependencies.
|
||||
pip install transformers, datasets, huggingface-hub, scipy
|
||||
|
||||
#. Check that the required packages can be imported.
|
||||
@@ -139,14 +140,14 @@ Download the base model and fine-tuning dataset
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Base model and tokenizer names
|
||||
# Base model and tokenizer names.
|
||||
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
|
||||
|
||||
# Load base model to GPU memory
|
||||
# Load base model to GPU memory.
|
||||
device = "cuda:0"
|
||||
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True).to(device)
|
||||
|
||||
# Load tokenizer
|
||||
# Load tokenizer.
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
base_model_name,
|
||||
trust_remote_code = True)
|
||||
@@ -159,14 +160,14 @@ Download the base model and fine-tuning dataset
|
||||
|
||||
.. code-block::
|
||||
|
||||
# Dataset for fine-tuning
|
||||
# Dataset for fine-tuning.
|
||||
training_dataset_name = "mlabonne/guanaco-llama2-1k"
|
||||
training_dataset = load_dataset(training_dataset_name, split = "train")
|
||||
|
||||
# Check the data
|
||||
# Check the data.
|
||||
print(training_dataset)
|
||||
|
||||
# #11 is a QA sample in English
|
||||
# Dataset 11 is a QA sample in English.
|
||||
print(training_dataset[11])
|
||||
|
||||
#. With the base model and the dataset, let's start fine-tuning!
|
||||
@@ -180,7 +181,7 @@ To set up ``SFTTrainer`` parameters, you can use the following code as reference
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Training Params for SFTTrainer
|
||||
# Training parameters for SFTTrainer.
|
||||
training_arguments = TrainingArguments(
|
||||
output_dir = "./results",
|
||||
num_train_epochs = 1,
|
||||
@@ -228,7 +229,7 @@ Compare the number of trainable parameters and training time under the two diffe
|
||||
bias = "none",
|
||||
task_type = "CAUSAL_LM"
|
||||
)
|
||||
# View the number of Trainable Params
|
||||
# View the number of trainable parameters.
|
||||
from peft import get_peft_model
|
||||
peft_model = get_peft_model(base_model, peft_config)
|
||||
peft_model.print_trainable_parameters()
|
||||
@@ -244,7 +245,7 @@ Compare the number of trainable parameters and training time under the two diffe
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Initialize a sft trainer
|
||||
# Initialize an SFT trainer.
|
||||
sft_trainer = SFTTrainer(
|
||||
model = base_model,
|
||||
train_dataset = training_dataset,
|
||||
@@ -254,7 +255,7 @@ Compare the number of trainable parameters and training time under the two diffe
|
||||
args = training_arguments
|
||||
)
|
||||
|
||||
# Run the trainer
|
||||
# Run the trainer.
|
||||
sft_trainer.train()
|
||||
|
||||
The output should look like this:
|
||||
@@ -302,7 +303,7 @@ Compare the number of trainable parameters and training time under the two diffe
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Trainer without LoRA config
|
||||
# Trainer without LoRA config.
|
||||
trainer_full = SFTTrainer(
|
||||
model = base_model,
|
||||
train_dataset = training_dataset,
|
||||
@@ -311,7 +312,7 @@ Compare the number of trainable parameters and training time under the two diffe
|
||||
args = training_arguments
|
||||
)
|
||||
|
||||
# Training
|
||||
# Training.
|
||||
trainer_full.train()
|
||||
|
||||
The output should look like this:
|
||||
@@ -347,20 +348,20 @@ store, and load.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# PEFT adapter name
|
||||
# PEFT adapter name.
|
||||
adapter_name = "llama-2-7b-enhanced-adapter"
|
||||
|
||||
# Save PEFT adapter
|
||||
# Save PEFT adapter.
|
||||
sft_trainer.model.save_pretrained(adapter_name)
|
||||
|
||||
The saved PEFT adapter should look like this on your system:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Access adapter directory
|
||||
# Access adapter directory.
|
||||
cd llama-2-7b-enhanced-adapter
|
||||
|
||||
# List all adapter files
|
||||
# List all adapter files.
|
||||
README.md adapter_config.json adapter_model.safetensors
|
||||
|
||||
.. tab-item:: Saving a fully fine-tuned model
|
||||
@@ -371,20 +372,20 @@ store, and load.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# fully fine-tuned model name
|
||||
# Fully fine-tuned model name.
|
||||
new_model_name = "llama-2-7b-enhanced"
|
||||
|
||||
# Save the fully fine-tuned model
|
||||
# Save the fully fine-tuned model.
|
||||
full_trainer.model.save_pretrained(new_model_name)
|
||||
|
||||
The saved new full model should look like this on your system:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Access new model directory
|
||||
# Access new model directory.
|
||||
cd llama-2-7b-enhanced
|
||||
|
||||
# List all model files
|
||||
# List all model files.
|
||||
config.json model-00002-of-00006.safetensors model-00005-of-00006.safetensors
|
||||
generation_config.json model-00003-of-00006.safetensors model-00006-of-00006.safetensors
|
||||
model-00001-of-00006.safetensors model-00004-of-00006.safetensors model.safetensors.index.json
|
||||
|
||||
@@ -92,7 +92,7 @@ Our documentation is organized into the following categories:
|
||||
:padding: 2
|
||||
|
||||
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
|
||||
* [Fine-tuning LLMs with ROCm](./how-to/fine-tuning-llms/index.rst)
|
||||
* [Fine-tuning LLMs and inference optimization](./how-to/fine-tuning-llms/index.rst)
|
||||
* [System tuning for various architectures](./how-to/tuning-guides.md)
|
||||
* [MI100](./how-to/tuning-guides/mi100.md)
|
||||
* [MI200](./how-to/tuning-guides/mi200.md)
|
||||
|
||||
@@ -59,7 +59,7 @@ subtrees:
|
||||
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
||||
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
||||
- file: how-to/fine-tuning-llms/index.rst
|
||||
title: Fine-tuning LLMs with ROCm
|
||||
title: Fine-tuning LLMs and inference optimization
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/fine-tuning-llms/overview.rst
|
||||
|
||||
Reference in New Issue
Block a user