2nd POC for How to Use ROCm for AI (#282) (#4301)

New TOC for ROCm for AI --------- Co-authored-by: Peter Park <peter.park@amd.com>
2026-04-05 03:01:17 -04:00 · 2025-01-27 17:09:46 -05:00
parent a633bef754
commit f05dcf6b00
37 changed files with 311 additions and 227 deletions
--- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
@@ -896,21 +896,21 @@ The following are GPU-accelerated PyTorch features not currently supported by RO
 Use cases and recommendations
 ================================================================================

-* :doc:`Using ROCm for AI: training a model </how-to/rocm-for-ai/train-a-model>` provides
+* :doc:`Using ROCm for AI: training a model </how-to/rocm-for-ai/training/train-a-model>` provides
  guidance on how to leverage the ROCm platform for training AI models. It covers the steps, tools, and best practices
  for optimizing training workflows on AMD GPUs using PyTorch features.

-* :doc:`Single-GPU fine-tuning and inference </how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference>`
+* :doc:`Single-GPU fine-tuning and inference </how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference>`
  describes and demonstrates how to use the ROCm platform for the fine-tuning and inference of
  machine learning models, particularly large language models (LLMs), on systems with a single AMD
  Instinct MI300X accelerator. This page provides a detailed guide for setting up, optimizing, and
  executing fine-tuning and inference workflows in such environments.

-* :doc:`Multi-GPU fine-tuning and inference optimization </how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference>`
+* :doc:`Multi-GPU fine-tuning and inference optimization </how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference>`
  describes and demonstrates the fine-tuning and inference of machine learning models on systems
  with multi MI300X accelerators.

-* The :doc:`Instinct MI300X workload optimization guide </how-to/tuning-guides/mi300x/workload>` provides detailed
+* The :doc:`Instinct MI300X workload optimization guide </how-to/rocm-for-ai/inference-optimization/workload>` provides detailed
  guidance on optimizing workloads for the AMD Instinct MI300X accelerator using ROCm. This guide is aimed at helping
  users achieve optimal performance for deep learning and other high-performance computing tasks on the MI300X
  accelerator.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -43,49 +43,34 @@ article_pages = [
    {"file": "compatibility/ml-compatibility/tensorflow-compatibility", "os": ["linux"]},
    {"file": "compatibility/ml-compatibility/jax-compatibility", "os": ["linux"]},
    {"file": "how-to/deep-learning-rocm", "os": ["linux"]},
+
    {"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/train-a-model", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/accelerate-training", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/deploy-your-model", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/hugging-face-models", "os": ["linux"]},
-    {"file": "how-to/rocm-for-hpc/index", "os": ["linux"]},
-    {"file": "how-to/llm-fine-tuning-optimization/index", "os": ["linux"]},
-    {"file": "how-to/llm-fine-tuning-optimization/overview", "os": ["linux"]},
-    {
-        "file": "how-to/llm-fine-tuning-optimization/fine-tuning-and-inference",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/llm-inference-frameworks",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/model-acceleration-libraries",
-        "os": ["linux"],
-    },
-    {"file": "how-to/llm-fine-tuning-optimization/model-quantization", "os": ["linux"]},
-    {
-        "file": "how-to/llm-fine-tuning-optimization/optimizing-with-composable-kernel",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/optimizing-triton-kernel",
-        "os": ["linux"],
-    },
-    {
-        "file": "how-to/llm-fine-tuning-optimization/profiling-and-debugging",
-        "os": ["linux"],
-    },
-    {"file": "how-to/performance-validation/mi300x/vllm-benchmark", "os": ["linux"]},
+
+    {"file": "how-to/rocm-for-ai/training/index", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/train-a-model", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/scale-model-training", "os": ["linux"]},
+
+    {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference", "os": ["linux"]},
+
+    {"file": "how-to/rocm-for-ai/inference/index", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/install", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/hugging-face-models", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/llm-inference-frameworks", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/vllm-benchmark", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},
+    
+    {"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/model-quantization", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/profiling-and-debugging", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference-optimization/workload", "os": ["linux"]},
+
    {"file": "how-to/system-optimization/index", "os": ["linux"]},
    {"file": "how-to/system-optimization/mi300x", "os": ["linux"]},
    {"file": "how-to/system-optimization/mi200", "os": ["linux"]},
@@ -104,6 +89,9 @@ extensions = ["rocm_docs", "sphinx_reredirects", "sphinx_sitemap"]

 external_projects_current_project = "rocm"

+# Uncomment if facing rate limit exceed issue with local build 
+# external_projects_remote_repository = ""
+
 html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "https://rocm-stg.amd.com/")
 html_context = {}
 if os.environ.get("READTHEDOCS", "") == "True":
--- a/docs/how-to/deep-learning-rocm.rst
+++ b/docs/how-to/deep-learning-rocm.rst
@@ -39,5 +39,11 @@ through the following guides.

 * :doc:`rocm-for-ai/index`

-* :doc:`llm-fine-tuning-optimization/index`
+* :doc:`Training <rocm-for-ai/training/index>`
+
+* :doc:`Fine-tuning LLMs <rocm-for-ai/fine-tuning/index>`
+
+* :doc:`Inference <rocm-for-ai/inference/index>`
+
+* :doc:`Inference optimization <rocm-for-ai/inference-optimization/index>`

--- a/docs/how-to/llm-fine-tuning-optimization/index.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/index.rst
@@ -1,37 +0,0 @@
-.. meta::
-   :description: How to fine-tune LLMs with ROCm
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial
-
-*******************************************
-Fine-tuning LLMs and inference optimization
-*******************************************
-
-ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for
-specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks,
-models, and tools.
-
-For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
-
-Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language
-model <fine-tuning-llms-concept-challenge>` like Llama 2. Then, it introduces :ref:`common methods of optimizing your
-fine-tuning <fine-tuning-llms-concept-optimizations>` using techniques like LoRA with libraries like PEFT. In the
-sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning.
-
- :doc:`Conceptual overview of fine-tuning LLMs <overview>`
-
- :doc:`Fine-tuning and inference <fine-tuning-and-inference>` using a
-  :doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` or
-  :doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` system.
-
- :doc:`Model quantization <model-quantization>`
-
- :doc:`Model acceleration libraries <model-acceleration-libraries>`
-
- :doc:`LLM inference frameworks <llm-inference-frameworks>`
-
- :doc:`Optimizing with Composable Kernel <optimizing-with-composable-kernel>`
-
- :doc:`Optimizing Triton kernels <optimizing-triton-kernel>`
-
- :doc:`Profiling and debugging <profiling-and-debugging>`
-
--- a/docs/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.rst
@@ -1,6 +1,6 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
-   :keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial
+   :description: How to fine-tune models with ROCm
+   :keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial, deep learning, PyTorch, TensorFlow, JAX

 *************************
 Fine-tuning and inference
@@ -9,7 +9,7 @@ Fine-tuning and inference
 Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries <rocm:reference/api-libraries>` and
 :doc:`tools <rocm:reference/rocm-tools>` to optimize and train deep learning models. ROCm provides a comprehensive
 ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and
-ROCm-aware versions of :doc:`deep learning frameworks <../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.
+ROCm-aware versions of :doc:`deep learning frameworks <../../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.

 Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for
 smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately
--- a/docs/how-to/rocm-for-ai/fine-tuning/index.rst
+++ b/docs/how-to/rocm-for-ai/fine-tuning/index.rst
@@ -0,0 +1,25 @@
+.. meta::
+   :description: How to fine-tune LLMs with ROCm
+   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, GPUs, Llama, accelerators
+
+*******************************************
+Use ROCm for fine-tuning LLMs
+*******************************************
+
+Fine-tuning is an essential technique in machine learning, where a pre-trained model, typically trained on a large-scale dataset, is further refined to achieve better performance and adapt to a particular task or dataset of interest.
+
+With AMD GPUs, the fine-tuning process benefits from the parallel processing capabilities and efficient resource management, ultimately leading to improved performance and faster model adaptation to the target domain.
+
+The ROCm™ software platform helps you optimize this fine-tuning process by supporting various optimization techniques tailored for AMD GPUs. It empowers the fine-tuning of large language models, making them accessible and efficient for specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks, models, and tools. 
+
+Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language
+model <fine-tuning-llms-concept-challenge>` like Llama 2. In the
+sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning.
+
+- :doc:`Conceptual overview of fine-tuning LLMs <overview>`
+
+- :doc:`Fine-tuning and inference <fine-tuning-and-inference>` using a
+  :doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` or
+  :doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` system.
+
+
--- a/docs/how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference.rst
@@ -1,6 +1,6 @@
 .. meta::
   :description: Model fine-tuning and inference on a multi-GPU system
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference
+   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune

 *****************************************************
 Fine-tuning and inference using multiple accelerators
@@ -233,4 +233,4 @@ GPU model fine-tuning and inference with LLMs.
      INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
      1|111|Loss: 1.5790324211120605:   7%|█                                          | 114/1618

-Read more about inference frameworks in :doc:`LLM inference frameworks <llm-inference-frameworks>`.
+Read more about inference frameworks in :doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`.
--- a/docs/how-to/llm-fine-tuning-optimization/overview.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/overview.rst
@@ -1,6 +1,6 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, optimzation, LoRA, walkthrough
+   :description: Conceptual overview of fine-tuning LLMs
+   :keywords: ROCm, LLM, Llama, fine-tuning, usage, tutorial, optimzation, LoRA, walkthrough, PEFT, Reinforcement

 ***************************************
 Conceptual overview of fine-tuning LLMs
@@ -41,7 +41,7 @@ The weight update is as follows: :math:`W_{updated} = W + ΔW`.
 If the weight matrix :math:`W` contains 7B parameters, then the weight update matrix :math:`ΔW` should also
 contain 7B parameters. Therefore, the :math:`ΔW` calculation is computationally and memory intensive.

-.. figure:: ../../data/how-to/llm-fine-tuning-optimization/weight-update.png
+.. figure:: ../../../data/how-to/llm-fine-tuning-optimization/weight-update.png
   :alt: Weight update diagram

   (a) Weight update in regular fine-tuning. (b) Weight update in LoRA where the product of matrix A (:math:`M\times K`)
--- a/docs/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.rst
@@ -1,6 +1,6 @@
 .. meta::
   :description: Model fine-tuning and inference on a single-GPU system
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference
+   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference, SFTTrainer

 ****************************************************
 Fine-tuning and inference using a single accelerator
@@ -80,7 +80,7 @@ Setting up the base implementation environment
 #. Install the required dependencies.

   bitsandbytes is a library that facilitates quantization to improve the efficiency of deep learning models. Learn more
-   about its use in :doc:`model-quantization`.
+   about its use in :doc:`../inference-optimization/model-quantization`.

   See the :ref:`Optimizations for model fine-tuning <fine-tuning-llms-concept-optimizations>` for a brief discussion on
   PEFT and TRL.
@@ -507,4 +507,4 @@ If using multiple accelerators, see
 popular libraries that simplify fine-tuning and inference in a multi-accelerator system.

 Read more about inference frameworks like vLLM and Hugging Face TGI in
-:doc:`LLM inference frameworks <llm-inference-frameworks>`.
+:doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`.
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -1,28 +1,27 @@
 .. meta::
-   :description: How to use ROCm for AI
+   :description: Learn how to use ROCm for AI.
   :keywords: ROCm, AI, machine learning, LLM, usage, tutorial

-*****************
-Using ROCm for AI
-*****************
+**************************
+Use ROCm for AI
+**************************

-ROCm offers a suite of optimizations for AI workloads from large language models (LLMs) to image and video detection and
-recognition, life sciences and drug discovery, autonomous driving, robotics, and more. ROCm proudly supports the broader
-AI software ecosystem, including open frameworks, models, and tools.
+ROCm™ is an open-source software platform that enables high-performance computing and machine learning applications. It features the ability to accelerate training, fine-tuning, and inference for AI application development. With ROCm, you can access the full power of AMD GPUs, which can significantly improve the performance and efficiency of AI workloads.

-For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
+You can use ROCm to perform distributed training, which enables you to train models across multiple GPUs or nodes simultaneously. Additionally, ROCm supports mixed-precision training, which can help reduce the memory and compute requirements of training workloads. For fine-tuning, ROCm provides access to various algorithms and optimization techniques. In terms of inference, ROCm provides several techniques that can help you optimize your models for deployment, such as quantization, GEMM tuning, and optimization with composable kernel.
+ 
+Overall, ROCm can be used to improve the performance and efficiency of your AI applications. With its training, fine-tuning, and inference support, ROCm provides a complete solution for optimizing AI workflows and achieving the optimum results possible on AMD GPUs. 

-In this guide, you'll learn about:
+In this guide, you'll learn how to use ROCm for AI:

- :doc:`Installing ROCm and machine learning frameworks <install>`
+- :doc:`Training <training/index>`

- :doc:`Scaling model training <scale-model-training>`
+- :doc:`Fine-tuning LLMs <fine-tuning/index>`

- :doc:`Training a model <train-a-model>`
+- :doc:`Inference <inference/index>`

- :doc:`Running models from Hugging Face <hugging-face-models>`
+- :doc:`Inference optimization <inference-optimization/index>`

- :doc:`Deploying your model <deploy-your-model>`

 To learn about ROCm for HPC applications and scientific computing, see
 :doc:`../rocm-for-hpc/index`.
--- a/docs/how-to/rocm-for-ai/inference-optimization/index.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/index.rst
@@ -0,0 +1,36 @@
+.. meta::
+   :description: How to Use ROCm for AI inference optimization
+   :keywords: ROCm, LLM, AI inference, Optimization, GPUs, usage, tutorial
+
+*******************************************
+Use ROCm for AI inference optimization
+*******************************************
+
+AI inference optimization is the process of improving the performance of machine learning models and speeding up the inference process. It includes:
+
+- **Quantization**: This involves reducing the precision of model weights and activations while maintaining acceptable accuracy levels. Reduced precision improves inference efficiency because lower precision data requires less storage and better utilizes the hardware's computation power.  
+
+- **Kernel optimization**: This technique involves optimizing computation kernels to exploit the underlying hardware capabilities. For example, the kernels can be optimized to use multiple GPU cores or utilize specialized hardware like tensor cores to accelerate the computations. 
+
+- **Libraries**: Libraries such as Flash Attention, xFormers, and PyTorch TunableOp are used to accelerate deep learning models and improve the performance of inference workloads. 
+
+- **Hardware acceleration**: Hardware acceleration techniques, like GPUs for AI inference, can significantly improve performance due to their parallel processing capabilities. 
+
+- **Pruning**: This involves removing unnecessary connections, layers, or weights from a pre-trained model while maintaining acceptable accuracy levels, resulting in a smaller model that requires fewer computational resources to run inference. 
+
+Utilizing these optimization techniques with the ROCm™ software platform can significantly reduce inference time, improve performance, and reduce the cost of your AI applications. 
+
+Throughout the following topics, this guide discusses optimization techniques for inference workloads.
+
+- :doc:`Model quantization <model-quantization>`
+
+- :doc:`Model acceleration libraries <model-acceleration-libraries>`
+
+- :doc:`Optimizing with Composable Kernel <optimizing-with-composable-kernel>`
+
+- :doc:`Optimizing Triton kernels <optimizing-triton-kernel>`
+
+- :doc:`Profiling and debugging <profiling-and-debugging>`
+
+- :doc:`Workload tuning <workload>`
+
--- a/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
+   :description: How to use model acceleration techniques and libraries to improve memory efficiency and performance.
   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, Flash Attention, Hugging Face, xFormers, vLLM, PyTorch

 ****************************
@@ -20,7 +20,7 @@ Attention (GQA), and Multi-Query Attention (MQA). This reduction in memory movem
 time-to-first-token (TTFT) latency for large batch sizes and long prompt sequences, thereby enhancing overall
 performance.

-.. image:: ../../data/how-to/llm-fine-tuning-optimization/attention-module.png
+.. image:: ../../../data/how-to/llm-fine-tuning-optimization/attention-module.png
   :alt: Attention module of a large language module utilizing tiling
   :align: center

@@ -245,7 +245,7 @@ page describes the options.
   Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
   GemmTunableOp_float_TN,tn_200_100_20,Gemm_Rocblas_32323,0.00669595

-.. image:: ../../data/how-to/llm-fine-tuning-optimization/tunableop.png
+.. image:: ../../../data/how-to/llm-fine-tuning-optimization/tunableop.png
   :alt: GEMM and TunableOp
   :align: center

--- a/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
+   :description: How to use model quantization techniques to speed up inference.
   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, quantization, GPTQ, transformers, bitsandbytes

 *****************************
--- a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst
@@ -1,6 +1,6 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, Triton, kernel, performance, optimization
+   :description: How to optimize Triton kernels for ROCm.
+   :keywords: ROCm, LLM, fine-tuning, usage, MI300X, tutorial, Triton, kernel, performance, optimization

 *************************
 Optimizing Triton kernels
@@ -13,7 +13,7 @@ and CUDA kernel optimization.

 Refer to the
 :ref:`Triton kernel performance optimization <mi300x-triton-kernel-performance-optimization>`
-section of the :doc:`/how-to/tuning-guides/mi300x/workload` guide
+section of the :doc:`workload` guide
 for detailed information.

 Triton kernel performance optimization includes the following topics.
--- a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md
+++ b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md
@@ -1,8 +1,9 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="SmoothQuant model inference on AMD Instinct MI300X using Composable Kernel">
-  <meta name="keywords" content="Mixed Precision, Kernel, Inference, Linear Algebra">
-</head>
+---
+myst:
+  html_meta:
+    "description": "How to optimize machine learning workloads with Composable Kernel (CK)."
+    "keywords": "mixed, precision, kernel, inference, linear, algebra, ck, GEMM"
+---

 # Optimizing with Composable Kernel

@@ -32,7 +33,7 @@ The template parameters of the instance are grouped into four parameter types:
 ================
 ### Figure 2
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-template_parameters.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-template_parameters.jpg
 The template parameters of the selected GEMM kernel are classified into four groups. These template parameter groups should be defined properly before running the instance.
 ```

@@ -126,7 +127,7 @@ The row and column, and stride information of input matrices are also passed to
 ================
 ### Figure 3
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-kernel_launch.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-kernel_launch.jpg
 Templated kernel launching consists of kernel instantiation, making arguments by passing in actual application parameters, creating an invoker, and running the instance through the invoker.
 ```

@@ -155,7 +156,7 @@ The first operation in the process is to perform the multiplication of input mat
 ================
 ### Figure 4
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-operation_flow.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-operation_flow.jpg
 Operation flow.
 ```

@@ -171,7 +172,7 @@ Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_ke
 ================
 ### Figure 5
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-root_instance.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-root_instance.jpg
 Use the ‘DeviceBatchedGemmMultiD_Xdl’ instance as a root.
 ```

@@ -421,7 +422,7 @@ Run `python setup.py install` to build and install the extension. It should look
 ================
 ### Figure 6
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-compilation.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-compilation.jpg
 Compilation and installation of the INT8 kernels.
 ```

@@ -433,7 +434,7 @@ The implementation architecture of running SmoothQuant models on MI300X GPUs is
 ================
 ### Figure 7
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg
 The implementation architecture of running SmoothQuant models on AMD MI300X accelerators.
 ```

@@ -459,7 +460,7 @@ Figure 8 shows the performance comparisons between the original FP16 and the Smo
 ================
 ### Figure 8
 ================ -->
-```{figure} ../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg
+```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg
 Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator.
 ```

--- a/docs/how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst
@@ -1,12 +1,12 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
-   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, profiling, debugging, performance, Triton
+   :description: How to use ROCm profiling and debugging tools.
+   :keywords: ROCm, LLM, fine-tuning, usage, MI300X, tutorial, profiling, debugging, performance, Triton

 ***********************
 Profiling and debugging
 ***********************

-This section provides an index for further documentation on  profiling and
+This section provides an index for further documentation on profiling and
 debugging tools and their common usage patterns.

 See :ref:`AMD Instinct MI300X™ workload optimization <mi300x-profiling-start>`
--- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
@@ -152,7 +152,7 @@ address any new bottlenecks that may emerge.

 ROCm provides a prebuilt optimized Docker image that has everything required to implement
 the tips in this section. It includes ROCm, vLLM, PyTorch, and tuning files in the CSV 
-format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
+format. For more information, see :doc:`../inference/vllm-benchmark`.

 .. _mi300x-profiling-tools:

@@ -173,7 +173,7 @@ tools available depending on their specific profiling needs.
  For more information, see
  :doc:`ROCm Compute Profiler documentation <rocprofiler-compute:index>`.

-Refer to :doc:`/how-to/llm-fine-tuning-optimization/profiling-and-debugging`
+Refer to :doc:`profiling-and-debugging`
 to explore commonly used profiling tools and their usage patterns.

 Once performance bottlenecks are identified, you can implement an informed workload
@@ -412,7 +412,7 @@ usage with ROCm.
 ROCm provides a prebuilt optimized Docker image for validating the performance
 of LLM inference with vLLM on the MI300X accelerator. The Docker image includes
 ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information,
-see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
+see :doc:`../inference/vllm-benchmark`.

 .. _mi300x-vllm-throughput-measurement:

@@ -1304,7 +1304,7 @@ performance (reduce latency) and improve benchmarking stability.
 CK provides a rich set of template parameters for generating flexible accelerated 
 computing kernels for difference application scenarios.

-See :doc:`/how-to/llm-fine-tuning-optimization/optimizing-with-composable-kernel`
+See :doc:`optimizing-with-composable-kernel`
 for an overview of Composable Kernel GEMM kernels, information on tunable
 parameters, and examples.

--- a/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst
+++ b/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to use ROCm for AI
+   :description: How to deploy your model for AI inference using vLLM and Hugging Face TGI.
   :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial

 ********************
@@ -119,4 +119,4 @@ TGI walkthrough
 vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high
 performance, low latency, and scalability.

-Visit the topics in :doc:`Using ROCm for AI <index>` to learn about other ROCm-aware solutions for AI development.
+Visit the topics in :doc:`Using ROCm for AI <../index>` to learn about other ROCm-aware solutions for AI development.
--- a/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst
+++ b/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to use ROCm for AI
+   :description: How to run models from Hugging Face on AMD GPUs.
   :keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial

 ********************************
--- a/docs/how-to/rocm-for-ai/inference/index.rst
+++ b/docs/how-to/rocm-for-ai/inference/index.rst
@@ -0,0 +1,22 @@
+.. meta::
+   :description: How to use ROCm for AI inference workloads.
+   :keywords: ROCm, AI, machine learning, LLM, AI inference, NLP, GPUs, usage, tutorial
+
+****************************
+Use ROCm for AI inference
+****************************
+AI inference is a process of deploying a trained machine learning model to make predictions or classifications on new data. This commonly involves using the model with real-time data and making quick decisions based on the predictions made by the model.  
+
+Understanding the ROCm™ software platform’s architecture and capabilities is vital for running AI inference. By leveraging the ROCm platform's capabilities, you can harness the power of high-performance computing and efficient resource management to run inference workloads, leading to faster predictions and classifications on real-time data.
+
+Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X accelerators and illustrate how to deploy trained models in production environments. 
+
+- :doc:`Installing ROCm and machine learning frameworks <install>`
+
+- :doc:`Running models from Hugging Face <hugging-face-models>`
+
+- :doc:`LLM inference frameworks <llm-inference-frameworks>`
+
+- :doc:`Performance validation <vllm-benchmark>`
+
+- :doc:`Deploying your model <deploy-your-model>`
--- a/docs/how-to/rocm-for-ai/inference/install.rst
+++ b/docs/how-to/rocm-for-ai/inference/install.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to use ROCm for AI
+   :description: How to install ROCm and popular machine learning frameworks.
   :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial

 .. _rocm-for-ai-install:
@@ -59,4 +59,4 @@ images with the framework pre-installed.

 * :doc:`JAX for ROCm <rocm-install-on-linux:install/3rd-party/jax-install>`

-The sections that follow in :doc:`Training a model <train-a-model>` are geared for a ROCm with PyTorch installation.
+The sections that follow in :doc:`Training a model <../training/train-a-model>` are geared for a ROCm with PyTorch installation.
--- a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: How to fine-tune LLMs with ROCm
+   :description: How to implement the LLM inference frameworks with ROCm acceleration.
   :keywords: ROCm, LLM, fine-tuning, usage, tutorial, inference, vLLM, TGI, text generation inference

 ************************
@@ -8,8 +8,8 @@ LLM inference frameworks

 This section discusses how to implement `vLLM <https://docs.vllm.ai/en/latest>`_ and `Hugging Face TGI
 <https://huggingface.co/docs/text-generation-inference/en/index>`_ using
-:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` and
-:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` systems.
+:doc:`single-accelerator <../fine-tuning/single-gpu-fine-tuning-and-inference>` and
+:doc:`multi-accelerator <../fine-tuning/multi-gpu-fine-tuning-and-inference>` systems.

 .. _fine-tuning-llms-vllm:

@@ -68,7 +68,7 @@ Installing vLLM

         The following log message is displayed in your command line indicates that the server is listening for requests.

-         .. image:: ../../data/how-to/llm-fine-tuning-optimization/vllm-single-gpu-log.png
+         .. image:: ../../../data/how-to/llm-fine-tuning-optimization/vllm-single-gpu-log.png
            :alt: vLLM API server log message
            :align: center

@@ -141,7 +141,7 @@ Installing vLLM

   ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM
   on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in CSV
-   format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
+   format. For more information, see :doc:`vllm-benchmark`.

 .. _fine-tuning-llms-tgi:

--- a/docs/how-to/performance-validation/mi300x/vllm-benchmark.rst
+++ b/docs/how-to/performance-validation/mi300x/vllm-benchmark.rst
@@ -1,6 +1,6 @@
 .. meta::
-   :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified
-                 ROCm Docker image.
+   :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
+                 ROCm vLLM Docker image.
   :keywords: model, MAD, automation, dashboarding, validate

 ***********************************************************
@@ -385,19 +385,22 @@ Further reading
 ===============

 - For application performance optimization strategies for HPC and AI workloads,
-  including inference with vLLM, see :doc:`/how-to/tuning-guides/mi300x/workload`.
+  including inference with vLLM, see :doc:`../inference-optimization/workload`.

 - To learn more about the options for latency and throughput benchmark scripts,
  see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.

 - To learn more about system settings and management practices to configure your system for
-  MI300X accelerators, see :doc:`/how-to/system-optimization/mi300x`.
+  MI300X accelerators, see :doc:`../../system-optimization/mi300x`.

 - To learn how to run LLM models from Hugging Face or your own model, see
-  :doc:`Using ROCm for AI </how-to/rocm-for-ai/index>`.
+  :doc:`Using ROCm for AI <../index>`.

 - To learn how to optimize inference on LLMs, see
-  :doc:`Fine-tuning LLMs and inference optimization </how-to/llm-fine-tuning-optimization/index>`.
+  :doc:`Inference optimization <../inference-optimization/index>`.
+
+- To learn how to fine-tune LLMs, see
+  :doc:`Fine-tuning LLMs <../fine-tuning/index>`.

 - To compare with the previous version of the ROCm vLLM Docker image for performance validation, refer to
  `LLM inference performance validation on AMD Instinct MI300X (ROCm 6.2.0) <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_.
--- a/docs/how-to/rocm-for-ai/training/index.rst
+++ b/docs/how-to/rocm-for-ai/training/index.rst
@@ -0,0 +1,21 @@
+.. meta::
+   :description: How to use ROCm for training models
+   :keywords: ROCm, LLM, training, GPUs, training model, scaling model, usage, tutorial
+
+=======================
+Use ROCm for training
+=======================
+
+Training models is the process of teaching a computer program to recognize patterns in data. This involves providing the computer with large amounts of labeled data and allowing it to learn from that data, adjusting the model's parameters. 
+
+The process of training models is computationally intensive, requiring specialized hardware like GPUs to accelerate computations and reduce training time. Training models on AMD GPUs involves leveraging the parallel processing capabilities of these GPUs to significantly speed up the model training process in machine learning and deep learning tasks.  
+
+Training models on AMD GPUs with the ROCm™ software platform allows you to use the powerful parallel processing capabilities and efficient compute resource management, significantly improving training time and overall performance in machine learning applications.
+ 
+The ROCm software platform makes it easier to train models on AMD GPUs while maintaining compatibility with existing code and tools. The platform also provides features like multi-GPU support, allowing for scaling and parallelization of model training across multiple GPUs to enhance performance. 
+
+In this guide, you'll learn about:
+
+- :doc:`Training a model <train-a-model>`
+
+- :doc:`Scale model training <scale-model-training>`
--- a/docs/how-to/rocm-for-ai/training/scale-model-training.rst
+++ b/docs/how-to/rocm-for-ai/training/scale-model-training.rst
@@ -105,7 +105,7 @@ Fine-tuning your model
 ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
 example, LoRA, QLoRA, PEFT, and FSDP.

-Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
+Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning/index`.

 The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU.

--- a/docs/how-to/rocm-for-ai/training/train-a-model.rst
+++ b/docs/how-to/rocm-for-ai/training/train-a-model.rst
@@ -164,7 +164,7 @@ Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:

   ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8

-.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
   :width: 800

 Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
@@ -174,7 +174,7 @@ recommended. So, a run on 8 GPUs looks something like:

   mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1

-.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
   :width: 800

 Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
@@ -195,7 +195,7 @@ Use the following script to run the RCCL test for four MI300X GPU nodes. Modify
   -x NCCL_DEBUG=version \
   $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1

-.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
+.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
   :width: 800

 .. _mi300x-amd-megatron-lm-training:
@@ -264,7 +264,7 @@ end-of-document token, remove sentence splitting, and use the tokenizer type.
 In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and
 ``my-gpt2_text_document.idx``.

-.. image:: ../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
+.. image:: ../../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
   :width: 800

 .. _amd-megatron-lm-environment-setup:
@@ -462,7 +462,7 @@ Benchmarking examples

      See the sample output:

-      .. image:: ../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
+      .. image:: ../../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
         :width: 800

   .. tab-item:: Multi node training
@@ -493,11 +493,11 @@ Benchmarking examples

      Master node:

-      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-master.png
+      .. image:: ../../../data/how-to/rocm-for-ai/2-node-training-master.png
         :width: 800

      Worker node:

-      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-worker.png
+      .. image:: ../../../data/how-to/rocm-for-ai/2-node-training-worker.png
         :width: 800

--- a/docs/how-to/rocm-for-hpc/index.rst
+++ b/docs/how-to/rocm-for-hpc/index.rst
@@ -1,6 +1,6 @@
 .. meta::
-   :description: How to use ROCm for HPC
-   :keywords: ROCm, AI, high performance computing, HPC
+   :description: How to use ROCm for high-performance computing (HPC).
+   :keywords: ROCm, AI, high performance computing, HPC, science, scientific

 ******************
 Using ROCm for HPC
--- a/docs/how-to/system-optimization/index.rst
+++ b/docs/how-to/system-optimization/index.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: AMD hardware optimization for specific workloads
+   :description: Learn about AMD hardware optimization for HPC-specific and workstation workloads.
   :keywords: high-performance computing, HPC, Instinct accelerators, Radeon,
              tuning, tuning guide, AMD, ROCm

--- a/docs/how-to/system-optimization/mi100.md
+++ b/docs/how-to/system-optimization/mi100.md
@@ -1,9 +1,9 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="MI100 high-performance computing and tuning guide">
-  <meta name="keywords" content="MI100, high-performance computing, HPC, BIOS
-  settings, NBIO, AMD, ROCm">
-</head>
+---
+myst:
+  html_meta:
+    "description": "AMD Instinct MI100 system settings optimization guide."
+    "keywords": "Instinct, MI100, microarchitecture, AMD, ROCm"
+---

 # AMD Instinct MI100 system optimization

--- a/docs/how-to/system-optimization/mi200.md
+++ b/docs/how-to/system-optimization/mi200.md
@@ -1,9 +1,9 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="MI200 high-performance computing and tuning guide">
-  <meta name="keywords" content="MI200, high-performance computing, HPC, BIOS
-  settings, NBIO, AMD, ROCm">
-</head>
+---
+myst:
+  html_meta:
+    "description": "Learn about AMD Instinct MI200 system settings and performance tuning."
+    "keywords": "Instinct, MI200, microarchitecture, AMD, ROCm"
+---

 # AMD Instinct MI200 system optimization

--- a/docs/how-to/system-optimization/mi300a.rst
+++ b/docs/how-to/system-optimization/mi300a.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: AMD Instinct MI300A system settings
+   :description: Learn about AMD Instinct MI300A system settings and performance tuning.
   :keywords: AMD, Instinct, MI300A, HPC, tuning, BIOS settings, NBIO, ROCm,
              environment variable, performance, accelerator, GPU, EPYC, GRUB,
              operating system
--- a/docs/how-to/system-optimization/mi300x.rst
+++ b/docs/how-to/system-optimization/mi300x.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: AMD Instinct MI300X system settings
+   :description: Learn about AMD Instinct MI300X system settings and performance tuning.
   :keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
              environment variable, performance, accelerator, GPU, EPYC, GRUB,
              operating system
@@ -35,7 +35,7 @@ functioning correctly before trying to improve its overall performance. In this
 section, the settings discussed mostly ensure proper functionality of your
 Instinct-based system. Some settings discussed are known to improve performance
 for most applications running on a MI300X system. See
-:doc:`/how-to/tuning-guides/mi300x/workload` for how to improve performance for
+:doc:`../rocm-for-ai/inference-optimization/workload` for how to improve performance for
 specific applications or workloads.

 .. _mi300x-bios-settings:
--- a/docs/how-to/system-optimization/w6000-v620.md
+++ b/docs/how-to/system-optimization/w6000-v620.md
@@ -1,9 +1,9 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="RDNA2 workstation tuning guide">
-  <meta name="keywords" content="RDNA2, workstation, BIOS settings, installation, AMD,
-  ROCm">
-</head>
+---
+myst:
+  html_meta:
+    "description": "Learn about system settings and performance tuning for RDNA2-based GPUs."
+    "keywords": "RDNA2, workstation, desktop, BIOS, installation, Radeon, pro, v620, w6000"
+---

 # AMD RDNA2 system optimization

--- a/docs/how-to/tuning-guides/mi300x/index.rst
+++ b/docs/how-to/tuning-guides/mi300x/index.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
+   :keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
+
 ************************
 AMD MI300X tuning guides
 ************************
@@ -8,8 +12,8 @@ accelerators. They include detailed instructions on system settings and
 application tuning suggestions to help you fully leverage the capabilities of
 these accelerators, thereby achieving optimal performance.

-* :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`
+* :doc:`../../rocm-for-ai/inference/vllm-benchmark`

-* :doc:`/how-to/tuning-guides/mi300x/system`
+* :doc:`../../system-optimization/mi300x`

-* :doc:`/how-to/tuning-guides/mi300x/workload`
+* :doc:`../../rocm-for-ai/inference-optimization/workload`
--- a/docs/index.md
+++ b/docs/index.md
@@ -39,7 +39,6 @@ ROCm documentation is organized into the following categories:

 * [Use ROCm for AI](./how-to/rocm-for-ai/index.rst)
 * [Use ROCm for HPC](./how-to/rocm-for-hpc/index.rst)
-* [Fine-tune LLMs and inference optimization](./how-to/llm-fine-tuning-optimization/index.rst)
 * [System optimization](./how-to/system-optimization/index.rst)
 * [AMD Instinct MI300X performance validation and tuning](./how-to/tuning-guides/mi300x/index.rst)
 * [System debugging](./how-to/system-debugging.md)
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -36,40 +36,62 @@ subtrees:
    title: Use ROCm for AI
    subtrees:
    - entries:
-      - file: how-to/rocm-for-ai/install.rst
-        title: Installation
-      - file: how-to/rocm-for-ai/train-a-model.rst
-        title: Train a model
-      - file: how-to/rocm-for-ai/scale-model-training.rst
-        title: Scale model training
-      - file: how-to/rocm-for-ai/hugging-face-models.rst
-        title: Run models from Hugging Face
-      - file: how-to/rocm-for-ai/deploy-your-model.rst
-        title: Deploy your model
-  - file: how-to/rocm-for-hpc/index.rst
-    title: Use ROCm for HPC
-  - file: how-to/llm-fine-tuning-optimization/index.rst
-    title: Fine-tune LLMs and inference optimization
-    subtrees:
-    - entries:
-      - file: how-to/llm-fine-tuning-optimization/overview.rst
-        title: Conceptual overview
-      - file: how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.rst
+      - file: how-to/rocm-for-ai/training/index.rst
+        title: Training
        subtrees:
        - entries:
-          - file: how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.rst
-            title: Use a single accelerator
-          - file: how-to/llm-fine-tuning-optimization/multi-gpu-fine-tuning-and-inference.rst
-            title: Use multiple accelerators
-      - file: how-to/llm-fine-tuning-optimization/model-quantization.rst
-      - file: how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst
-      - file: how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
-      - file: how-to/llm-fine-tuning-optimization/optimizing-with-composable-kernel.md
-        title: Optimize with Composable Kernel
-      - file: how-to/llm-fine-tuning-optimization/optimizing-triton-kernel.rst
-        title: Optimize Triton kernels
-      - file: how-to/llm-fine-tuning-optimization/profiling-and-debugging.rst
-        title: Profile and debug
+          - file: how-to/rocm-for-ai/training/train-a-model.rst
+            title: Train a model
+          - file: how-to/rocm-for-ai/training/scale-model-training.rst
+            title: Scale model training
+      
+      - file: how-to/rocm-for-ai/fine-tuning/index.rst
+        title: Fine-tuning LLMs
+        subtrees:
+        - entries:
+          - file: how-to/rocm-for-ai/fine-tuning/overview.rst
+            title: Conceptual overview
+          - file: how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst
+            title: Fine-tuning
+            subtrees:
+            - entries:
+              - file: how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst
+                title: Use a single accelerator
+              - file: how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
+                title: Use multiple accelerators
+
+      - file: how-to/rocm-for-ai/inference/index.rst
+        title: Inference
+        subtrees:
+        - entries:
+          - file: how-to/rocm-for-ai/inference/install.rst
+            title: Installation
+          - file: how-to/rocm-for-ai/inference/hugging-face-models.rst
+            title: Run models from Hugging Face
+          - file: how-to/rocm-for-ai/inference/llm-inference-frameworks.rst
+            title: LLM inference frameworks
+          - file: how-to/rocm-for-ai/inference/vllm-benchmark.rst
+            title: Performance validation
+          - file: how-to/rocm-for-ai/inference/deploy-your-model.rst
+            title: Deploy your model
+
+      - file: how-to/rocm-for-ai/inference-optimization/index.rst
+        title: Inference optimization
+        subtrees:
+        - entries:
+          - file: how-to/rocm-for-ai/inference-optimization/model-quantization.rst
+          - file: how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst
+          - file: how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md
+            title: Optimize with Composable Kernel
+          - file: how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst
+            title: Optimize Triton kernels
+          - file: how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst
+            title: Profile and debug
+          - file: how-to/rocm-for-ai/inference-optimization/workload.rst
+            title: Workload tuning
+  
+  - file: how-to/rocm-for-hpc/index.rst
+    title: Use ROCm for HPC
  - file: how-to/system-optimization/index.rst
    title: System optimization
    subtrees:
@@ -86,14 +108,6 @@ subtrees:
        title: AMD RDNA 2
  - file: how-to/tuning-guides/mi300x/index.rst
    title: AMD MI300X performance validation and tuning
-    subtrees:
-    - entries:
-      - file: how-to/performance-validation/mi300x/vllm-benchmark.rst
-        title: Performance validation
-      - file: how-to/tuning-guides/mi300x/system.rst
-        title: System tuning
-      - file: how-to/tuning-guides/mi300x/workload.rst
-        title: Workload tuning
  - file: how-to/system-debugging.md
  - file: conceptual/compiler-topics.md
    title: Use advanced compiler features