From 8d2d5abdaeee2dc2b519641105b934d93527fca9 Mon Sep 17 00:00:00 2001
From: peterjunpark <peter.park@amd.com>
Date: Thu, 23 Oct 2025 11:51:55 -0400
Subject: [PATCH] add xref to vllm v1 optimization guide in workload.rst
 (#5560) (#5561)

(cherry picked from commit 90c1d9068f0f81ea2d8e71142db814d9a05b6781)
---
 .../inference-optimization/workload.rst       | 20 ++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
index 7c4665113..53d84c63a 100644
--- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
@@ -99,12 +99,14 @@ execution.
 
 .. seealso::
 
-   See :doc:`vllm-optimization`.
+   See :doc:`vllm-optimization` to learn more about vLLM performance
+   optimization techniques.
 
 .. _mi300x-auto-tune:
 
 Auto-tunable configurations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 Auto-tunable configurations can significantly streamline performance
 optimization by automatically adjusting parameters based on workload
 characteristics. For example:
@@ -325,6 +327,22 @@ hardware counters are also included.
 
    ROCm Systems Profiler timeline trace example.
 
+vLLM performance optimization
+=============================
+
+vLLM is a high-throughput and memory efficient inference and serving engine for
+large language models that has gained traction in the AI community for its
+performance and ease of use. See :doc:`vllm-optimization`, where you'll learn
+how to:
+
+* Enable AITER (AI Tensor Engine for ROCm) to speed up on LLM models.
+* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance.
+* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton).
+* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments.
+* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss.
+* Tune engine arguments (batch size, memory utilization, graph modes) for your use case.
+* Benchmark and scale across single-node and multi-node configurations.
+
 .. _mi300x-tunableop:
 
 PyTorch TunableOp