From 8d2d5abdaeee2dc2b519641105b934d93527fca9 Mon Sep 17 00:00:00 2001 From: peterjunpark Date: Thu, 23 Oct 2025 11:51:55 -0400 Subject: [PATCH] add xref to vllm v1 optimization guide in workload.rst (#5560) (#5561) (cherry picked from commit 90c1d9068f0f81ea2d8e71142db814d9a05b6781) --- .../inference-optimization/workload.rst | 20 ++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst index 7c4665113..53d84c63a 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst +++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst @@ -99,12 +99,14 @@ execution. .. seealso:: - See :doc:`vllm-optimization`. + See :doc:`vllm-optimization` to learn more about vLLM performance + optimization techniques. .. _mi300x-auto-tune: Auto-tunable configurations ^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Auto-tunable configurations can significantly streamline performance optimization by automatically adjusting parameters based on workload characteristics. For example: @@ -325,6 +327,22 @@ hardware counters are also included. ROCm Systems Profiler timeline trace example. +vLLM performance optimization +============================= + +vLLM is a high-throughput and memory efficient inference and serving engine for +large language models that has gained traction in the AI community for its +performance and ease of use. See :doc:`vllm-optimization`, where you'll learn +how to: + +* Enable AITER (AI Tensor Engine for ROCm) to speed up on LLM models. +* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance. +* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton). +* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments. +* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss. +* Tune engine arguments (batch size, memory utilization, graph modes) for your use case. +* Benchmark and scale across single-node and multi-node configurations. + .. _mi300x-tunableop: PyTorch TunableOp