add xref to vllm v1 optimization guide in workload.rst (#5560)

2026-01-08 22:28:06 -05:00 · 2025-10-22 13:47:46 -04:00
parent cb8d21a0df
commit 90c1d9068f
1 changed files with 19 additions and 1 deletions
--- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
+++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst
@@ -99,12 +99,14 @@ execution.

 .. seealso::

-   See :doc:`vllm-optimization`.
+   See :doc:`vllm-optimization` to learn more about vLLM performance
+   optimization techniques.

 .. _mi300x-auto-tune:

 Auto-tunable configurations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 Auto-tunable configurations can significantly streamline performance
 optimization by automatically adjusting parameters based on workload
 characteristics. For example:
@@ -325,6 +327,22 @@ hardware counters are also included.

   ROCm Systems Profiler timeline trace example.

+vLLM performance optimization
+=============================
+
+vLLM is a high-throughput and memory efficient inference and serving engine for
+large language models that has gained traction in the AI community for its
+performance and ease of use. See :doc:`vllm-optimization`, where you'll learn
+how to:
+
+* Enable AITER (AI Tensor Engine for ROCm) to speed up on LLM models.
+* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance.
+* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton).
+* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments.
+* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss.
+* Tune engine arguments (batch size, memory utilization, graph modes) for your use case.
+* Benchmark and scale across single-node and multi-node configurations.
+
 .. _mi300x-tunableop:

 PyTorch TunableOp