From 7d488322d9e3dc6fd3c7bd9bbf370c67ac8bf108 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Wed, 30 Oct 2024 18:24:18 -0400 Subject: [PATCH] Update links to vllm perf validation doc (#3971) * update links to vllm perf validation doc * add PagedAttention to wordlist (cherry picked from commit 0fe08d93d708fc1a19d5020c9f65669b692b15e7) fix link --- .wordlist.txt | 1 + .../llm-inference-frameworks.rst | 6 ++---- docs/how-to/rocm-for-ai/deploy-your-model.rst | 4 +--- docs/how-to/tuning-guides/mi300x/workload.rst | 13 +++++-------- 4 files changed, 9 insertions(+), 15 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 17f11e77b..2b7b7eb70 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -276,6 +276,7 @@ OpenSSL OpenVX OpenXLA Oversubscription +PagedAttention PCC PCI PCIe diff --git a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst index 3ee672353..84e839391 100644 --- a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst +++ b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst @@ -16,7 +16,7 @@ This section discusses how to implement `vLLM `_ vLLM inference ============== -vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to +vLLM is renowned for its PagedAttention algorithm that can reduce memory consumption and increase throughput thanks to its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention is also effective when multiple requests share the same key and value contents for a large value of beam search or @@ -139,9 +139,7 @@ Refer to :ref:`mi300x-vllm-optimization` for performance optimization tips. ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _fine-tuning-llms-tgi: diff --git a/docs/how-to/rocm-for-ai/deploy-your-model.rst b/docs/how-to/rocm-for-ai/deploy-your-model.rst index 0435e83ee..7a96f32e0 100644 --- a/docs/how-to/rocm-for-ai/deploy-your-model.rst +++ b/docs/how-to/rocm-for-ai/deploy-your-model.rst @@ -46,9 +46,7 @@ Validating vLLM performance ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _rocm-for-ai-serve-hugging-face-tgi: diff --git a/docs/how-to/tuning-guides/mi300x/workload.rst b/docs/how-to/tuning-guides/mi300x/workload.rst index 6857eae1b..56ad0e98a 100644 --- a/docs/how-to/tuning-guides/mi300x/workload.rst +++ b/docs/how-to/tuning-guides/mi300x/workload.rst @@ -152,9 +152,7 @@ address any new bottlenecks that may emerge. ROCm provides a prebuilt optimized Docker image that has everything required to implement the tips in this section. It includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _mi300x-profiling-tools: @@ -378,11 +376,10 @@ Refer to `vLLM documentation `_ -on the ROCm GitHub repository. +ROCm provides a prebuilt optimized Docker image for validating the performance +of LLM inference with vLLM on the MI300X accelerator. The Docker image includes +ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information, +see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. Maximize throughput -------------------