From 7d488322d9e3dc6fd3c7bd9bbf370c67ac8bf108 Mon Sep 17 00:00:00 2001
From: Peter Park <peter.park@amd.com>
Date: Wed, 30 Oct 2024 18:24:18 -0400
Subject: [PATCH] Update links to vllm perf validation doc (#3971)

* update links to vllm perf validation doc

* add PagedAttention to wordlist

(cherry picked from commit 0fe08d93d708fc1a19d5020c9f65669b692b15e7)

fix link
---
 .wordlist.txt                                       |  1 +
 .../llm-inference-frameworks.rst                    |  6 ++----
 docs/how-to/rocm-for-ai/deploy-your-model.rst       |  4 +---
 docs/how-to/tuning-guides/mi300x/workload.rst       | 13 +++++--------
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/.wordlist.txt b/.wordlist.txt
index 17f11e77b..2b7b7eb70 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -276,6 +276,7 @@ OpenSSL
 OpenVX
 OpenXLA
 Oversubscription
+PagedAttention
 PCC
 PCI
 PCIe
diff --git a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
index 3ee672353..84e839391 100644
--- a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst
@@ -16,7 +16,7 @@ This section discusses how to implement `vLLM <https://docs.vllm.ai/en/latest>`_
 vLLM inference
 ==============
 
-vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to
+vLLM is renowned for its PagedAttention algorithm that can reduce memory consumption and increase throughput thanks to
 its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the
 models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention
 is also effective when multiple requests share the same key and value contents for a large value of beam search or
@@ -139,9 +139,7 @@ Refer to :ref:`mi300x-vllm-optimization` for performance optimization tips.
 
 ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM 
 on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV 
-format. For more information, see the guide to 
-`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_ 
-on the ROCm GitHub repository.
+format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
 
 .. _fine-tuning-llms-tgi:
 
diff --git a/docs/how-to/rocm-for-ai/deploy-your-model.rst b/docs/how-to/rocm-for-ai/deploy-your-model.rst
index 0435e83ee..7a96f32e0 100644
--- a/docs/how-to/rocm-for-ai/deploy-your-model.rst
+++ b/docs/how-to/rocm-for-ai/deploy-your-model.rst
@@ -46,9 +46,7 @@ Validating vLLM performance
 
 ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM 
 on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV 
-format. For more information, see the guide to 
-`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_ 
-on the ROCm GitHub repository.
+format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
 
 .. _rocm-for-ai-serve-hugging-face-tgi:
 
diff --git a/docs/how-to/tuning-guides/mi300x/workload.rst b/docs/how-to/tuning-guides/mi300x/workload.rst
index 6857eae1b..56ad0e98a 100644
--- a/docs/how-to/tuning-guides/mi300x/workload.rst
+++ b/docs/how-to/tuning-guides/mi300x/workload.rst
@@ -152,9 +152,7 @@ address any new bottlenecks that may emerge.
 
 ROCm provides a prebuilt optimized Docker image that has everything required to implement
 the tips in this section. It includes ROCm, vLLM, PyTorch, and tuning files in the CSV 
-format. For more information, see the guide to 
-`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_ 
-on the ROCm GitHub repository.
+format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
 
 .. _mi300x-profiling-tools:
 
@@ -378,11 +376,10 @@ Refer to `vLLM documentation <https://docs.vllm.ai/en/latest/models/performance.
 for additional performance tips. :ref:`fine-tuning-llms-vllm` describes vLLM
 usage with ROCm.
 
-ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM 
-on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV 
-format. For more information, see the guide to 
-`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_ 
-on the ROCm GitHub repository.
+ROCm provides a prebuilt optimized Docker image for validating the performance
+of LLM inference with vLLM on the MI300X accelerator. The Docker image includes
+ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information,
+see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`.
 
 Maximize throughput
 -------------------