Merge pull request #3193 from peterjunpark/docs/6.1.1

docs/6.1.1: Add "How to use ROCm for AI" (#3117)
2026-01-09 06:38:00 -05:00 · 2024-05-30 13:55:43 -07:00
parent 7db5854c78 af84ba09c6
commit 230790d794
11 changed files with 553 additions and 1 deletions
--- a/docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
+++ b/docs/data/how-to/rocm-for-ai/pytorch_docker_install.png
--- a/docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
+++ b/docs/data/how-to/rocm-for-ai/pytorch_docker_install_output.png
--- a/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_accuracy.png
+++ b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_accuracy.png
--- a/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_compile.png
+++ b/docs/data/how-to/rocm-for-ai/pytorch_docker_test_model_compile.png
--- a/docs/how-to/rocm-for-ai/deploy-your-model.rst
+++ b/docs/how-to/rocm-for-ai/deploy-your-model.rst
@@ -0,0 +1,113 @@
+.. meta::
+   :description: How to use ROCm for AI
+   :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
+
+********************
+Deploying your model
+********************
+
+ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers.
+This section focuses on deploying transformers-based LLM models.
+
+ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.
+
+.. _rocm-for-ai-serve-vllm:
+
+Serving using vLLM
+==================
+
+vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM officially supports ROCm versions 5.7 and
+6.0. AMD is actively working with the vLLM team to improve performance and support later ROCm versions.
+
+See the `GitHub repository <https://github.com/vllm-project/vllm>`_ and `official vLLM documentation
+<https://docs.vllm.ai/>`_ for more information.
+
+For guidance on using vLLM with ROCm, refer to `Installation with ROCm
+<https://docs.vllm.ai/en/latest/getting_started/amd-installation.html>`_.
+
+vLLM installation
+-----------------
+
+vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.
+
+-  `Build from source with Docker
+   <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-docker-rocm>`_ (recommended)
+
+-  `Build from source <https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm>`_
+
+vLLM walkthrough
+----------------
+
+Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm
+Blogs <https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html>`_
+
+.. _rocm-for-ai-serve-hugging-face-tgi:
+
+Serving using Hugging Face TGI
+==============================
+
+The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-generation-inference/index>`_
+(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI
+<https://huggingface.co/docs/text-generation-inference/quicktour>`_ for more details.
+
+TGI installation
+----------------
+
+The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
+`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
+
+TGI walkthrough
+---------------
+
+#. Set up the LLM server.
+
+   Deploy the Llama2 7B model with TGI using the official Docker image.
+
+   .. code-block:: shell
+
+      model=TheBloke/Llama-2-7B-fp16
+      volume=$PWD
+      docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
+
+#. Set up the client.
+
+   a. Open another shell session and run the following command to access the server with the client URL.
+
+   .. code-block:: shell
+
+      curl 127.0.0.1:8080/generate \\
+      -X POST \\
+      -d '{"inputs":"What is Deep
+      Learning?","parameters":{"max_new_tokens":20}}' \\
+      -H 'Content-Type: application/json'
+
+   b. Access the server with request endpoints.
+
+   .. code-block:: shell
+
+      pip install request
+      PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py
+
+      ``requests_model.py`` should look like:
+
+      .. code-block:: python
+
+         import requests
+
+         headers = {
+           "Content-Type": "application/json",
+         }
+
+         data = {
+            'inputs': 'What is Deep Learning?',
+            'parameters': { 'max_new_tokens': 20 },
+         }
+
+         response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
+
+         print(response.json())
+
+vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
+performance, low latency, and scalability.
+
+Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.
--- a/docs/how-to/rocm-for-ai/hugging-face-models.rst
+++ b/docs/how-to/rocm-for-ai/hugging-face-models.rst
@@ -0,0 +1,210 @@
+.. meta::
+   :description: How to use ROCm for AI
+   :keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial
+
+********************************
+Running models from Hugging Face
+********************************
+
+`Hugging Face <https://huggingface.co>`_ hosts the world’s largest AI model repository for developers to obtain
+transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
+developing and deploying AI solutions.
+
+This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
+
+.. _rocm-for-ai-hugging-face-transformers:
+
+Using Hugging Face Transformers
+-------------------------------
+
+First, `install the Hugging Face Transformers library <https://huggingface.co/docs/transformers/en/installation>`_,
+which lets you easily import any of the transformer models into your Python application.
+
+.. code-block:: shell
+
+   pip install transformers
+
+Here is an example of running `GPT2 <https://huggingface.co/openai-community/gpt2>`_:
+
+.. code-block:: python
+
+   from transformers import GPT2Tokenizer, GPT2Model
+
+   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+
+   model = GPT2Model.from_pretrained('gpt2')
+
+   text = "Replace me with any text you'd like."
+
+   encoded_input = tokenizer(text, return_tensors='pt')
+
+   output = model(**encoded_input)
+
+Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core
+models should also function correctly.
+
+Here are some mainstream models to get you started:
+
+- `BERT <https://huggingface.co/bert-base-uncased>`_
+
+- `BLOOM <https://huggingface.co/bigscience/bloom>`_
+
+- `Llama <https://huggingface.co/huggyllama/llama-7b>`_
+
+- `OPT <https://huggingface.co/facebook/opt-66b>`_
+
+- `T5 <https://huggingface.co/t5-base>`_
+
+.. _rocm-for-ai-hugging-face-optimum:
+
+Using Hugging Face with Optimum-AMD
+-----------------------------------
+
+Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
+
+For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
+`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
+using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
+
+Hugging Face libraries natively support AMD Instinct accelerators. For other
+:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
+validated, but most features are expected to work without issues.
+
+.. _rocm-for-ai-install-optimum-amd:
+
+Installation
+~~~~~~~~~~~~
+
+Install Optimum-AMD using pip.
+
+.. code-block:: shell
+
+   pip install --upgrade --upgrade-strategy eager optimum[amd]
+
+Or, install from source.
+
+.. code-block:: shell
+
+   git clone https://github.com/huggingface/optimum-amd.git
+   cd optimum-amd
+   pip install -e .
+
+.. _rocm-for-ai-flash-attention:
+
+Flash Attention
+---------------
+
+#. Use `the Hugging Face team's example Dockerfile
+   <https://github.com/huggingface/optimum-amd/blob/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile>`_ to use
+   Flash Attention with ROCm.
+
+   .. code-block:: shell
+
+      docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
+      volume=$PWD
+      docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
+      transformers_pytorch_amd_gpu_flash:latest
+
+#. Use Flash Attention 2 with `Transformers
+   <https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2>`_ by adding the
+   ``use_flash_attention_2`` parameter to ``from_pretrained()``:
+
+   .. code-block:: python
+
+      import torch
+      from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
+
+      tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
+
+      with torch.device("cuda"):
+        model = AutoModelForCausalLM.from_pretrained(
+        "tiiuae/falcon-7b",
+        torch_dtype=torch.float16,
+        use_flash_attention_2=True,
+        )
+
+.. _rocm-for-ai-gptq:
+
+GPTQ
+----
+
+To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are available for ROCm.
+
+#. First, :ref:`install Optimum-AMD <rocm-for-ai-install-optimum-amd>`.
+
+#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation <https://github.com/AutoGPTQ/AutoGPTQ#Installation>`_ for
+   in-depth guidance.
+
+   .. code-block:: shell
+
+      pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
+
+   Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
+
+   .. code-block:: shell
+
+      ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
+
+
+#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library
+   <https://github.com/PanQiWei/AutoGPTQ>`_:
+
+   .. code-block:: python
+
+      import torch
+      from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
+
+      tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
+
+      with torch.device("cuda"):
+        model = AutoModelForCausalLM.from_pretrained(
+        "TheBloke/Llama-2-7B-Chat-GPTQ",
+        torch_dtype=torch.float16,
+        )
+
+.. _rocm-for-ai-onnx:
+
+ONNX
+----
+
+Hugging Face Optimum also supports the `ONNX Runtime <https://onnxruntime.ai>`_ integration. For ONNX models, usage is
+straightforward.
+
+#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method:
+
+   .. code-block:: python
+
+      from optimum.onnxruntime import ORTModelForSequenceClassification
+      ..
+      ort_model = ORTModelForSequenceClassification.from_pretrained(
+      ..
+      provider="ROCMExecutionProvider"
+      )
+
+#. Try running a `BERT text classification
+   <https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english>`_ ONNX model with ROCm:
+
+   .. code-block:: python
+
+      from optimum.onnxruntime import ORTModelForSequenceClassification
+      from optimum.pipelines import pipeline
+      from transformers import AutoTokenizer
+      import onnxruntime as ort
+
+      session_options = ort.SessionOptions()
+
+      session_options.log_severity_level = 0
+
+      ort_model = ORTModelForSequenceClassification.from_pretrained(
+         "distilbert-base-uncased-finetuned-sst-2-english",
+         export=True,
+         provider="ROCMExecutionProvider",
+         session_options=session_options
+         )
+
+      tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
+
+      pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
+
+      result = pipe("Both the music and visual were astounding, not to mention the actors performance.")
+
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -0,0 +1,23 @@
+.. meta::
+   :description: How to use ROCm for AI
+   :keywords: ROCm, AI, machine learning, LLM, usage, tutorial
+
+*****************
+Using ROCm for AI
+*****************
+
+ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
+recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
+AI software ecosystem, including open frameworks, models, and tools.
+
+For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
+
+In this guide, you'll learn about:
+
+- :doc:`Installing ROCm and machine learning frameworks <install>`
+
+- :doc:`Training a model <train-a-model>`
+
+- :doc:`Running models from Hugging Face <hugging-face-models>`
+
+- :doc:`Deploying your model <deploy-your-model>`
--- a/docs/how-to/rocm-for-ai/install.rst
+++ b/docs/how-to/rocm-for-ai/install.rst
@@ -0,0 +1,60 @@
+.. meta::
+   :description: How to use ROCm for AI
+   :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
+
+.. _rocm-for-ai-install:
+
+***********************************************
+Installing ROCm and machine learning frameworks
+***********************************************
+
+Before getting started, install ROCm and supported machine learning frameworks.
+
+.. grid:: 1
+
+   .. grid-item-card:: Pre-install
+
+      Each release of ROCm supports specific hardware and software configurations. Before installing, consult the
+      :doc:`System requirements <rocm-install-on-linux:reference/system-requirements>` and
+      :doc:`Installation prerequisites <rocm-install-on-linux:how-to/prerequisites>` guides.
+
+If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux
+<rocm-install-on-linux:tutorial/quick-start>`.
+
+If you’re using a Radeon GPU for graphics-accelerated applications, refer to the
+:doc:`Radeon installation instructions <radeon:docs/install/install-radeon>`.
+
+ROCm supports two methods for installation. There is no difference in the final ROCm installation between these two
+methods. You can also opt for :ref:`single-version or multi-version installation
+<rocm-install-on-linux:installation-types>`.
+
+*  :doc:`Using your Linux distribution's package manager <rocm-install-on-linux:how-to/native-install/index>`
+
+*  :doc:`Using the AMDGPU installer <rocm-install-on-linux:how-to/amdgpu-install>`
+
+.. grid:: 1
+
+   .. grid-item-card:: Post-install
+
+      Follow the :doc:`post-installation instructions <rocm-install-on-linux:how-to/native-install/post-install>` to
+      configure your system linker, PATH, and verify the installation.
+
+      If you encounter any issues during installation, refer to the
+      :doc:`Installation troubleshooting <rocm-install-on-linux:how-to/native-install/install-faq>` guide.
+
+Machine learning frameworks
+===========================
+
+ROCm supports popular machine learning frameworks and libraries including `PyTorch
+<https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package>`_, `TensorFlow
+<https://tensorflow.org>`_, `JAX <https://jax.readthedocs.io/en/latest>`_, and `DeepSpeed
+<https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/>`_.
+
+Review the framework installation documentation. For ease-of-use, it's recommended to use official ROCm prebuilt Docker
+images with the framework pre-installed.
+
+* :doc:`PyTorch for ROCm <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
+* :doc:`TensorFlow for ROCm <rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
+* :doc:`JAX for ROCm <rocm-install-on-linux:how-to/3rd-party/jax-install>`
+
+The sections that follow in :doc:`Training a model <train-a-model>` are geared for a ROCm with PyTorch installation.
--- a/docs/how-to/rocm-for-ai/train-a-model.rst
+++ b/docs/how-to/rocm-for-ai/train-a-model.rst
@@ -0,0 +1,137 @@
+.. meta::
+   :description: How to use ROCm for AI
+   :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
+
+****************
+Training a model
+****************
+
+The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
+and inferencing.
+
+Accelerating model training
+===========================
+
+To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
+required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
+PyTorch offers distributed training solutions to facilitate this.
+
+.. _rocm-for-ai-pytorch-distributed:
+
+PyTorch distributed
+-------------------
+
+As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
+
+- `Distributed data-parallel training
+  <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
+
+- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
+
+- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
+
+In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
+let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
+
+The DDP workflow on multiple accelerators or GPUs is as follows:
+
+#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
+   the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
+
+#. Copy the model to every device so each device can process its local batches independently.
+
+#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
+   model for that local batch. This happens in parallel on multiple devices.
+
+#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
+   weights are then redistributed to each device.
+
+In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
+``allreduce`` to sum up gradients over different workers.
+
+See the following developer blogs for more in-depth explanations and examples.
+
+*  `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
+
+*  `Building a decoder transformer model on AMD GPUs — ROCm Blogs
+   <https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
+
+.. _rocm-for-ai-pytorch-fsdp:
+
+PyTorch FSDP
+------------
+
+As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
+are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
+model parameters, optimizer states, and gradients across DDP ranks.
+
+When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
+the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
+comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
+like overlapping communication and computation.
+
+For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
+<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
+
+For detailed training steps, refer to the `PyTorch FSDP examples
+<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
+
+.. _rocm-for-ai-deepspeed:
+
+DeepSpeed
+---------
+
+`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
+efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
+the training pillar.
+
+See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
+<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
+training with DeepSpeed on an AMD accelerator or GPU.
+
+.. _rocm-for-ai-automatic-mixed-precision:
+
+Automatic mixed precision (AMP)
+-------------------------------
+
+As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
+can take to reduce training time and memory usage through `automatic mixed precision
+<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
+
+See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
+<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
+for more information about running AMP on an AMD accelerator.
+
+.. _rocm-for-ai-fine-tune:
+
+Fine-tuning your model
+======================
+
+ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP.
+
+The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
+
+* Fine-tuning Llama2 with LoRA
+
+  * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
+    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
+
+* Fine-tuning Llama2 with QLoRA
+
+  * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
+    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
+
+* Fine-tuning a BERT-based LLM for a text classification task using JAX
+
+  * `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
+    <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
+
+* Fine-tuning StarCoder using PEFT
+
+  * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
+    <https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
+
+* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
+
+  * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
+    single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/finetuning>`_
--- a/docs/index.md
+++ b/docs/index.md
@@ -91,6 +91,7 @@ Our documentation is organized into the following categories:
 :img-alt: How-to documentation
 :padding: 2

+* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
 * [System tuning for various architectures](./how-to/tuning-guides.md)
  * [MI100](./how-to/tuning-guides/mi100.md)
  * [MI200](./how-to/tuning-guides/mi200.md)
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -49,6 +49,15 @@ subtrees:

 - caption: How to
  entries:
+  - file: how-to/rocm-for-ai/index.rst
+    title: Using ROCm for AI
+    subtrees:
+    - entries:
+      - file: how-to/rocm-for-ai/install.rst
+        title: Installation
+      - file: how-to/rocm-for-ai/train-a-model.rst
+      - file: how-to/rocm-for-ai/hugging-face-models.rst
+      - file: how-to/rocm-for-ai/deploy-your-model.rst
  - file: how-to/tuning-guides.md
    title: System optimization
    subtrees:
@@ -139,4 +148,3 @@ subtrees:
    title: Provide feedback
  - file: about/license.md
    title: ROCm license
-