diff --git a/.wordlist.txt b/.wordlist.txt index 17f11e77b..2b7b7eb70 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -276,6 +276,7 @@ OpenSSL OpenVX OpenXLA Oversubscription +PagedAttention PCC PCI PCIe diff --git a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst index 3ee672353..84e839391 100644 --- a/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst +++ b/docs/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.rst @@ -16,7 +16,7 @@ This section discusses how to implement `vLLM `_ vLLM inference ============== -vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to +vLLM is renowned for its PagedAttention algorithm that can reduce memory consumption and increase throughput thanks to its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention is also effective when multiple requests share the same key and value contents for a large value of beam search or @@ -139,9 +139,7 @@ Refer to :ref:`mi300x-vllm-optimization` for performance optimization tips. ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _fine-tuning-llms-tgi: diff --git a/docs/how-to/rocm-for-ai/deploy-your-model.rst b/docs/how-to/rocm-for-ai/deploy-your-model.rst index 0435e83ee..7a96f32e0 100644 --- a/docs/how-to/rocm-for-ai/deploy-your-model.rst +++ b/docs/how-to/rocm-for-ai/deploy-your-model.rst @@ -46,9 +46,7 @@ Validating vLLM performance ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _rocm-for-ai-serve-hugging-face-tgi: diff --git a/docs/how-to/tuning-guides/mi300x/workload.rst b/docs/how-to/tuning-guides/mi300x/workload.rst index 6857eae1b..56ad0e98a 100644 --- a/docs/how-to/tuning-guides/mi300x/workload.rst +++ b/docs/how-to/tuning-guides/mi300x/workload.rst @@ -152,9 +152,7 @@ address any new bottlenecks that may emerge. ROCm provides a prebuilt optimized Docker image that has everything required to implement the tips in this section. It includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance validation with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. +format. For more information, see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. .. _mi300x-profiling-tools: @@ -378,11 +376,10 @@ Refer to `vLLM documentation `_ -on the ROCm GitHub repository. +ROCm provides a prebuilt optimized Docker image for validating the performance +of LLM inference with vLLM on the MI300X accelerator. The Docker image includes +ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information, +see :doc:`/how-to/performance-validation/mi300x/vllm-benchmark`. Maximize throughput -------------------