add suggestions to vllm perf validation doc (#3968)

(cherry picked from commit f1fb476f6f)
2026-01-10 23:28:03 -05:00 · 2024-10-30 14:25:58 -04:00
parent aba0b8f1ac
commit 24ddfe2b3d
1 changed files with 36 additions and 15 deletions
--- a/docs/how-to/performance-validation/mi300x/vllm-benchmark.rst
+++ b/docs/how-to/performance-validation/mi300x/vllm-benchmark.rst
@@ -27,18 +27,41 @@ With this Docker image, you can quickly validate the expected inference
 performance numbers on the MI300X accelerator. This topic also provides tips on
 optimizing performance with popular AI models.

+.. hlist::
+   :columns: 6
+
+   * Llama 3.1 8B
+
+   * Llama 3.1 70B
+
+   * Llama 3.1 405B
+
+   * Llama 2 7B
+
+   * Llama 2 70B
+
+   * Mixtral 8x7B
+
+   * Mixtral 8x22B
+
+   * Mixtral 7B
+
+   * Qwen2 7B
+
+   * Qwen2 72B
+
+   * JAIS 13B
+
+   * JAIS 30B
+
 .. _vllm-benchmark-vllm:

 .. note::

-   vLLM is a toolkit and library for LLM inference and
-   serving. It deploys the PagedAttention algorithm, which reduces memory
-   consumption and increases throughput by leveraging dynamic key and value
-   allocation in GPU memory. vLLM also incorporates many LLM acceleration
-   and quantization algorithms. In addition, AMD implements high-performance
-   custom kernels and modules in vLLM to enhance performance further. See
-   :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for more
-   information.
+   vLLM is a toolkit and library for LLM inference and serving. AMD implements
+   high-performance custom kernels and modules in vLLM to enhance performance.
+   See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
+   more information.

 Getting started
 ===============
@@ -111,6 +134,7 @@ Available models
 ----------------

 .. hlist::
+   :columns: 3

   * ``pyt_vllm_llama-3.1-8b``

@@ -308,8 +332,8 @@ Here are some examples of running the benchmark with various options.
 See :ref:`Options <vllm-benchmark-standalone-options>` for the list of
 options and their descriptions.

-Latency benchmark example
-^^^^^^^^^^^^^^^^^^^^^^^^^
+Example 1: latency benchmark
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.

@@ -324,8 +348,8 @@ Find the latency reports at:

 - ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv``

-Throughput benchmark example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Example 2: throughput benchmark
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.

@@ -366,9 +390,6 @@ Further reading
 - To learn more about the options for latency and throughput benchmark scripts,
  see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.

- For application performance optimization strategies for HPC and AI workloads,
-  including inference with vLLM, see :doc:`/how-to/tuning-guides/mi300x/workload`.
-
 - To learn more about system settings and management practices to configure your system for
  MI300X accelerators, see :doc:`/how-to/system-optimization/mi300x`.