Rename fine-tuning and optimization guide directory and fix index.md (#3242)
* Mv fine-tuning and optimization files * Reorder index.md * Rename images directory * Fix internal links
|
Before Width: | Height: | Size: 44 KiB After Width: | Height: | Size: 44 KiB |
|
Before Width: | Height: | Size: 112 KiB After Width: | Height: | Size: 112 KiB |
|
Before Width: | Height: | Size: 188 KiB After Width: | Height: | Size: 188 KiB |
|
Before Width: | Height: | Size: 138 KiB After Width: | Height: | Size: 138 KiB |
|
Before Width: | Height: | Size: 62 KiB After Width: | Height: | Size: 62 KiB |
|
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 27 KiB |
|
Before Width: | Height: | Size: 86 KiB After Width: | Height: | Size: 86 KiB |
|
Before Width: | Height: | Size: 49 KiB After Width: | Height: | Size: 49 KiB |
|
Before Width: | Height: | Size: 45 KiB After Width: | Height: | Size: 45 KiB |
|
Before Width: | Height: | Size: 288 KiB After Width: | Height: | Size: 288 KiB |
|
Before Width: | Height: | Size: 153 KiB After Width: | Height: | Size: 153 KiB |
|
Before Width: | Height: | Size: 219 KiB After Width: | Height: | Size: 219 KiB |
|
Before Width: | Height: | Size: 80 KiB After Width: | Height: | Size: 80 KiB |
|
Before Width: | Height: | Size: 73 KiB After Width: | Height: | Size: 73 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
Before Width: | Height: | Size: 43 KiB After Width: | Height: | Size: 43 KiB |
|
Before Width: | Height: | Size: 25 KiB After Width: | Height: | Size: 25 KiB |
@@ -65,4 +65,4 @@ through the following guides.
|
||||
|
||||
* :doc:`rocm-for-ai/index`
|
||||
|
||||
* :doc:`fine-tuning-llms/index`
|
||||
* :doc:`llm-fine-tuning-optimization/index`
|
||||
|
||||
@@ -77,7 +77,7 @@ Installing vLLM
|
||||
|
||||
The following log message is displayed in your command line indicates that the server is listening for requests.
|
||||
|
||||
.. image:: ../../data/how-to/fine-tuning-llms/vllm-single-gpu-log.png
|
||||
.. image:: ../../data/how-to/llm-fine-tuning-optimization/vllm-single-gpu-log.png
|
||||
:alt: vLLM API server log message
|
||||
:align: center
|
||||
|
||||
@@ -18,7 +18,7 @@ Attention (GQA), and Multi-Query Attention (MQA). This reduction in memory movem
|
||||
time-to-first-token (TTFT) latency for large batch sizes and long prompt sequences, thereby enhancing overall
|
||||
performance.
|
||||
|
||||
.. image:: ../../data/how-to/fine-tuning-llms/attention-module.png
|
||||
.. image:: ../../data/how-to/llm-fine-tuning-optimization/attention-module.png
|
||||
:alt: Attention module of a large language module utilizing tiling
|
||||
:align: center
|
||||
|
||||
@@ -243,7 +243,7 @@ page describes the options.
|
||||
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
|
||||
GemmTunableOp_float_TN,tn_200_100_20,Gemm_Rocblas_32323,0.00669595
|
||||
|
||||
.. image:: ../../data/how-to/fine-tuning-llms/tunableop.png
|
||||
.. image:: ../../data/how-to/llm-fine-tuning-optimization/tunableop.png
|
||||
:alt: GEMM and TunableOp
|
||||
:align: center
|
||||
|
||||
@@ -31,7 +31,7 @@ Each accelerator or GPU has multiple Compute Units (CUs) and various CUs do comp
|
||||
can a compute kernel can allocate its task to? For the :doc:`AMD MI300X accelerator <../../reference/gpu-arch-specs>`, the
|
||||
grid should have at least 1024 thread blocks or workgroups.
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/compute-unit.png
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/compute-unit.png
|
||||
|
||||
Schematic representation of a CU in the CDNA2 or CDNA3 architecture.
|
||||
|
||||
@@ -187,7 +187,7 @@ Kernel occupancy
|
||||
|
||||
.. _fine-tuning-llms-occupancy-vgpr-table:
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/occupancy-vgpr.png
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/occupancy-vgpr.png
|
||||
:alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator.
|
||||
:align: center
|
||||
|
||||
@@ -32,7 +32,7 @@ The template parameters of the instance are grouped into four parameter types:
|
||||
================
|
||||
### Figure 2
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-template_parameters.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-template_parameters.jpg
|
||||
The template parameters of the selected GEMM kernel are classified into four groups. These template parameter groups should be defined properly before running the instance.
|
||||
```
|
||||
|
||||
@@ -126,7 +126,7 @@ The row and column, and stride information of input matrices are also passed to
|
||||
================
|
||||
### Figure 3
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-kernel_launch.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-kernel_launch.jpg
|
||||
Templated kernel launching consists of kernel instantiation, making arguments by passing in actual application parameters, creating an invoker, and running the instance through the invoker.
|
||||
```
|
||||
|
||||
@@ -155,7 +155,7 @@ The first operation in the process is to perform the multiplication of input mat
|
||||
================
|
||||
### Figure 4
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-operation_flow.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-operation_flow.jpg
|
||||
Operation flow.
|
||||
```
|
||||
|
||||
@@ -171,7 +171,7 @@ Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_ke
|
||||
================
|
||||
### Figure 5
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-root_instance.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-root_instance.jpg
|
||||
Use the ‘DeviceBatchedGemmMultiD_Xdl’ instance as a root.
|
||||
```
|
||||
|
||||
@@ -421,7 +421,7 @@ Run `python setup.py install` to build and install the extension. It should look
|
||||
================
|
||||
### Figure 6
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-compilation.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-compilation.jpg
|
||||
Compilation and installation of the INT8 kernels.
|
||||
```
|
||||
|
||||
@@ -433,7 +433,7 @@ The implementation architecture of running SmoothQuant models on MI300X GPUs is
|
||||
================
|
||||
### Figure 7
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-inference_flow.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg
|
||||
The implementation architecture of running SmoothQuant models on AMD MI300X accelerators.
|
||||
```
|
||||
|
||||
@@ -459,7 +459,7 @@ Figure 8 shows the performance comparisons between the original FP16 and the Smo
|
||||
================
|
||||
### Figure 8
|
||||
================ -->
|
||||
```{figure} ../../data/how-to/fine-tuning-llms/ck-comparisons.jpg
|
||||
```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg
|
||||
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator.
|
||||
```
|
||||
|
||||
@@ -41,7 +41,7 @@ The weight update is as follows: :math:`W_{updated} = W + ΔW`.
|
||||
If the weight matrix :math:`W` contains 7B parameters, then the weight update matrix :math:`ΔW` should also
|
||||
contain 7B parameters. Therefore, the :math:`ΔW` calculation is computationally and memory intensive.
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/weight-update.png
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/weight-update.png
|
||||
:alt: Weight update diagram
|
||||
|
||||
(a) Weight update in regular fine-tuning. (b) Weight update in LoRA where the product of matrix A (:math:`M\times K`)
|
||||
@@ -38,7 +38,7 @@ You can then visualize and view these metrics using an open-source profile visua
|
||||
shows transactions denoting the CPU activities that launch GPU kernels while the lower section shows the actual GPU
|
||||
activities where it processes the ``resnet18`` inferences layer by layer.
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/perfetto-trace.svg
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/perfetto-trace.svg
|
||||
|
||||
Perfetto trace visualization example.
|
||||
|
||||
@@ -100,7 +100,7 @@ analyze bottlenecks and stressors for their computational workloads on AMD Insti
|
||||
Omniperf collects hardware counters in multiple passes, and will therefore re-run the application during each pass
|
||||
to collect different sets of metrics.
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/omniperf-analysis.png
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/omniperf-analysis.png
|
||||
|
||||
Omniperf memory chat analysis panel.
|
||||
|
||||
@@ -130,7 +130,7 @@ hardware counters are also included.
|
||||
have the greatest impact on the end-to-end execution of the application and to discover what else is happening on the
|
||||
system during a performance bottleneck.
|
||||
|
||||
.. figure:: ../../data/how-to/fine-tuning-llms/omnitrace-timeline.png
|
||||
.. figure:: ../../data/how-to/llm-fine-tuning-optimization/omnitrace-timeline.png
|
||||
|
||||
Omnitrace timeline trace example.
|
||||
|
||||
@@ -110,7 +110,7 @@ Fine-tuning your model
|
||||
ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
|
||||
example, LoRA, QLoRA, PEFT, and FSDP.
|
||||
|
||||
Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning-llms/index`.
|
||||
Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
|
||||
|
||||
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
|
||||
|
||||
|
||||
@@ -34,16 +34,16 @@ Our documentation is organized into the following categories:
|
||||
* {doc}`Quick start guide<rocm-install-on-linux:tutorial/quick-start>`
|
||||
* {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`
|
||||
* {doc}`Package manager integration<rocm-install-on-linux:how-to/native-install/package-manager-integration>`
|
||||
* {doc}`Install Docker containers<rocm-install-on-linux:how-to/docker>`
|
||||
* {doc}`ROCm & Spack<rocm-install-on-linux:how-to/spack>`
|
||||
* Windows
|
||||
* {doc}`Windows install guide<rocm-install-on-windows:how-to/install>`
|
||||
* {doc}`Application deployment guidelines<rocm-install-on-windows:conceptual/deployment-guidelines>`
|
||||
* [Deep learning frameworks](./how-to/deep-learning-rocm.rst)
|
||||
* {doc}`Install Docker containers<rocm-install-on-linux:how-to/docker>`
|
||||
* {doc}`PyTorch for ROCm<rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
|
||||
* {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
|
||||
* {doc}`JAX for ROCm<rocm-install-on-linux:how-to/3rd-party/jax-install>`
|
||||
* {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`
|
||||
* {doc}`ROCm & Spack<rocm-install-on-linux:how-to/spack>`
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
@@ -92,7 +92,7 @@ Our documentation is organized into the following categories:
|
||||
:padding: 2
|
||||
|
||||
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
|
||||
* [Fine-tuning LLMs and inference optimization](./how-to/fine-tuning-llms/index.rst)
|
||||
* [Fine-tuning LLMs and inference optimization](./how-to/llm-fine-tuning-optimization/index.rst)
|
||||
* [System tuning for various architectures](./how-to/tuning-guides.md)
|
||||
* [MI100](./how-to/tuning-guides/mi100.md)
|
||||
* [MI200](./how-to/tuning-guides/mi200.md)
|
||||
|
||||
@@ -58,27 +58,27 @@ subtrees:
|
||||
- file: how-to/rocm-for-ai/train-a-model.rst
|
||||
- file: how-to/rocm-for-ai/hugging-face-models.rst
|
||||
- file: how-to/rocm-for-ai/deploy-your-model.rst
|
||||
- file: how-to/fine-tuning-llms/index.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/index.rst
|
||||
title: Fine-tuning LLMs and inference optimization
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/fine-tuning-llms/overview.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/overview.rst
|
||||
title: Conceptual overview
|
||||
- file: how-to/fine-tuning-llms/fine-tuning-and-inference.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.rst
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/fine-tuning-llms/single-gpu-fine-tuning-and-inference.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.rst
|
||||
title: Using a single accelerator
|
||||
- file: how-to/fine-tuning-llms/multi-gpu-fine-tuning-and-inference.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference.rst
|
||||
title: Using multiple accelerators
|
||||
- file: how-to/fine-tuning-llms/model-quantization.rst
|
||||
- file: how-to/fine-tuning-llms/model-acceleration-libraries.rst
|
||||
- file: how-to/fine-tuning-llms/llm-inference-frameworks.rst
|
||||
- file: how-to/fine-tuning-llms/optimizing-with-composable-kernel.md
|
||||
- file: how-to/llm-fine-tuning-optimization/model-quantization.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/optimizing-with-composable-kernel.md
|
||||
title: Optimizing with Composable Kernel
|
||||
- file: how-to/fine-tuning-llms/optimizing-triton-kernel.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/optimizing-triton-kernel.rst
|
||||
title: Optimizing Triton kernels
|
||||
- file: how-to/fine-tuning-llms/profiling-and-debugging.rst
|
||||
- file: how-to/llm-fine-tuning-optimization/profiling-and-debugging.rst
|
||||
- file: how-to/tuning-guides.md
|
||||
title: System optimization
|
||||
subtrees:
|
||||
|
||||