Docs: Add Device Major/Minor Versions to gpu-arch-spec.rst

Update vLLM benchmarking guide (#4347 )
* update vllm-benchmark fix hlist overflow update standalone benchmarking options update list of models fix typo and model name unnecessary duplicate info update formatting update vllm benchmark guide - remove Llama 2 FP8 - add Jais 13B - update commands update docker pull tag update MAD available models remove extra mad models not relevant to vllm update PyTorch version add changelog add model names to .wordlist.txt * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * fix typo * update link * fix link text * change changelog to previous versions * fix typo * remove "for" --------- Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
2026-01-09 22:58:17 -05:00 · 2025-02-13 14:24:00 +01:00 · 2025-02-05 17:18:35 -05:00 · 2025-02-05 16:40:31 -05:00 · 2025-02-05 14:46:16 -05:00
5 changed files with 278 additions and 106 deletions
--- a/.github/workflows/issue_retrieval.yml
+++ b/.github/workflows/issue_retrieval.yml
@@ -2,7 +2,7 @@ name: Issue retrieval

 on:
  issues:
-    types: [opened]
+    types: [opened, edited]

 jobs:
  auto-retrieve:
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -74,6 +74,7 @@ Conda
 ConnectX
 CuPy
 Dashboarding
+DBRX
 DDR
 DF
 DGEMM
@@ -92,6 +93,7 @@ DataFrame
 DataLoader
 DataParallel
 Debian
+DeepSeek
 DeepSpeed
 Dependabot
 Deprecations
@@ -129,6 +131,7 @@ GDS
 GEMM
 GEMMs
 GFortran
+Gemma
 GiB
 GIM
 GL
--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -62,7 +62,7 @@ additional licenses. Please review individual repositories for more information.
 | [rocJPEG](https://github.com/ROCm/rocJPEG/) | [MIT](https://github.com/ROCm/rocJPEG/blob/develop/LICENSE) |
 | [ROCK-Kernel-Driver](https://github.com/ROCm/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/ROCm/ROCK-Kernel-Driver/blob/master/COPYING) |
 | [rocminfo](https://github.com/ROCm/rocminfo/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocminfo/blob/amd-staging/License.txt) |
-| [ROCm Bandwidth Test](https://github.com/ROCm/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocm_bandwidth_test/blob/master/LICENSE.txt) |
+| [ROCm Bandwidth Test](https://github.com/ROCm/rocm_bandwidth_test/) | [MIT](https://github.com/ROCm/rocm_bandwidth_test/blob/master/LICENSE.txt) |
 | [ROCm CMake](https://github.com/ROCm/rocm-cmake/) | [MIT](https://github.com/ROCm/rocm-cmake/blob/develop/LICENSE) |
 | [ROCm Communication Collectives Library (RCCL)](https://github.com/ROCm/rccl/) | [Custom](https://github.com/ROCm/rccl/blob/develop/LICENSE.txt) |
 | [ROCm-Core](https://github.com/ROCm/rocm-core) | [MIT](https://github.com/ROCm/rocm-core/blob/master/copyright) |
--- a/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst
+++ b/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst
@@ -10,49 +10,22 @@ LLM inference performance validation on AMD Instinct MI300X
 .. _vllm-benchmark-unified-docker:

 The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
-a prebuilt, optimized environment designed for validating large language model
-(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
-ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
-MI300X accelerator and includes the following components:
+a prebuilt, optimized environment for validating large language model (LLM)
+inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM
+Docker image integrates vLLM and PyTorch tailored specifically for the MI300X
+accelerator and includes the following components:

-* `ROCm 6.2.1 <https://github.com/ROCm/ROCm>`_
+* `ROCm 6.3.1 <https://github.com/ROCm/ROCm>`_

-* `vLLM 0.6.4 <https://docs.vllm.ai/en/latest>`_
+* `vLLM 0.6.6 <https://docs.vllm.ai/en/latest>`_

-* `PyTorch 2.5.0 <https://github.com/pytorch/pytorch>`_
-
-* Tuning files (in CSV format)
+* `PyTorch 2.7.0 (2.7.0a0+git3a58512) <https://github.com/pytorch/pytorch>`_

 With this Docker image, you can quickly validate the expected inference
-performance numbers on the MI300X accelerator. This topic also provides tips on
-optimizing performance with popular AI models.
-
-.. hlist::
-   :columns: 6
-
-   * Llama 3.1 8B
-
-   * Llama 3.1 70B
-
-   * Llama 3.1 405B
-
-   * Llama 2 7B
-
-   * Llama 2 70B
-
-   * Mixtral 8x7B
-
-   * Mixtral 8x22B
-
-   * Mixtral 7B
-
-   * Qwen2 7B
-
-   * Qwen2 72B
-
-   * JAIS 13B
-
-   * JAIS 30B
+performance numbers for the MI300X accelerator. This topic also provides tips on
+optimizing performance with popular AI models. For more information, see the lists of
+:ref:`available models for MAD-integrated benchmarking <vllm-benchmark-mad-models>`
+and :ref:`standalone benchmarking <vllm-benchmark-standalone-options>`.

 .. _vllm-benchmark-vllm:

@@ -91,9 +64,9 @@ MI300X accelerator with the prebuilt vLLM Docker image.

   .. code-block:: shell

-      docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+      docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

-Once setup is complete, you can choose between two options to reproduce the
+Once the setup is complete, choose between two options to reproduce the
 benchmark results:

 -  :ref:`MAD-integrated benchmarking <vllm-benchmark-mad>`
@@ -130,45 +103,89 @@ Although the following models are preconfigured to collect latency and
 throughput performance data, you can also change the benchmarking parameters.
 Refer to the :ref:`Standalone benchmarking <vllm-benchmark-standalone>` section.

+.. _vllm-benchmark-mad-models:
+
 Available models
 ----------------

-.. hlist::
-   :columns: 3
+.. list-table::
+   :header-rows: 1
+   :widths: 2, 3

-   * ``pyt_vllm_llama-3.1-8b``
+   * - Model name
+     - Tag

-   * ``pyt_vllm_llama-3.1-70b``
+   * - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_
+     - ``pyt_vllm_llama-3.1-8b``

-   * ``pyt_vllm_llama-3.1-405b``
+   * - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_
+     - ``pyt_vllm_llama-3.1-70b``

-   * ``pyt_vllm_llama-2-7b``
+   * - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_
+     - ``pyt_vllm_llama-3.1-405b``

-   * ``pyt_vllm_llama-2-70b``
+   * - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_
+     - ``pyt_vllm_llama-3.2-11b-vision-instruct``

-   * ``pyt_vllm_mixtral-8x7b``
+   * - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`_
+     - ``pyt_vllm_llama-2-7b``

-   * ``pyt_vllm_mixtral-8x22b``
+   * - `Llama 2 70B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`_
+     - ``pyt_vllm_llama-2-70b``

-   * ``pyt_vllm_mistral-7b``
+   * - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_
+     - ``pyt_vllm_mixtral-8x7b``

-   * ``pyt_vllm_qwen2-7b``
+   * - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_
+     - ``pyt_vllm_mixtral-8x22b``

-   * ``pyt_vllm_qwen2-72b``
+   * - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_
+     - ``pyt_vllm_mistral-7b``

-   * ``pyt_vllm_jais-13b``
+   * - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_
+     - ``pyt_vllm_qwen2-7b``

-   * ``pyt_vllm_jais-30b``
+   * - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_
+     - ``pyt_vllm_qwen2-72b``

-   * ``pyt_vllm_llama-3.1-8b_fp8``
+   * - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_
+     - ``pyt_vllm_jais-13b``

-   * ``pyt_vllm_llama-3.1-70b_fp8``
+   * - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_
+     - ``pyt_vllm_jais-30b``

-   * ``pyt_vllm_llama-3.1-405b_fp8``
+   * - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_
+     - ``pyt_vllm_dbrx-instruct``

-   * ``pyt_vllm_mixtral-8x7b_fp8``
+   * - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_
+     - ``pyt_vllm_gemma-2-27b``

-   * ``pyt_vllm_mixtral-8x22b_fp8``
+   * - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_
+     - ``pyt_vllm_c4ai-command-r-plus-08-2024``
+
+   * - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_
+     - ``pyt_vllm_deepseek-moe-16b-chat``
+
+   * - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_
+     - ``pyt_vllm_llama-3.1-70b_fp8``
+
+   * - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_
+     - ``pyt_vllm_llama-3.1-405b_fp8``
+
+   * - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mixtral-8x7b_fp8``
+
+   * - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mixtral-8x22b_fp8``
+
+   * - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mistral-7b_fp8``
+
+   * - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_
+     - ``pyt_vllm_dbrx_fp8``
+
+   * - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_
+     - ``pyt_vllm_command-r-plus_fp8``

 .. _vllm-benchmark-standalone:

@@ -181,8 +198,8 @@ snippet.

 .. code-block::

-   docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.4 rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+   docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
+   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

 In the Docker container, clone the ROCm MAD repository and navigate to the
 benchmark scripts directory at ``~/MAD/scripts/vllm``.
@@ -224,8 +241,8 @@ See the :ref:`examples <vllm-benchmark-run-benchmark>` for more information.

 .. _vllm-benchmark-standalone-options:

-Options
-------
+Options and available models
+----------------------------

 .. list-table::
   :header-rows: 1
@@ -248,72 +265,100 @@ Options
     - Measure both throughput and latency

   * - ``$model_repo``
-     - ``meta-llama/Meta-Llama-3.1-8B-Instruct``
-     - Llama 3.1 8B
+     - ``meta-llama/Llama-3.1-8B-Instruct``
+     - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_

   * - (``float16``)
-     - ``meta-llama/Meta-Llama-3.1-70B-Instruct``
-     - Llama 3.1 70B
+     - ``meta-llama/Llama-3.1-70B-Instruct``
+     - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_

   * -
-     - ``meta-llama/Meta-Llama-3.1-405B-Instruct``
-     - Llama 3.1 405B
+     - ``meta-llama/Llama-3.1-405B-Instruct``
+     - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_
+
+   * -
+     - ``meta-llama/Llama-3.2-11B-Vision-Instruct``
+     - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_

   * -
     - ``meta-llama/Llama-2-7b-chat-hf``
-     - Llama 2 7B
+     - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`_

   * -
     - ``meta-llama/Llama-2-70b-chat-hf``
-     - Llama 2 70B
+     - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`_

   * -
     - ``mistralai/Mixtral-8x7B-Instruct-v0.1``
-     - Mixtral 8x7B
+     - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mixtral-8x22B-Instruct-v0.1``
-     - Mixtral 8x22B
+     - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mistral-7B-Instruct-v0.3``
-     - Mixtral 7B
+     - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_

   * -
     - ``Qwen/Qwen2-7B-Instruct``
-     - Qwen2 7B
+     - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_

   * -
     - ``Qwen/Qwen2-72B-Instruct``
-     - Qwen2 72B
+     - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_

   * -
     - ``core42/jais-13b-chat``
-     - JAIS 13B
+     - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_

   * -
     - ``core42/jais-30b-chat-v3``
-     - JAIS 30B
-
-   * - ``$model_repo``
-     - ``amd/Meta-Llama-3.1-8B-Instruct-FP8-KV``
-     - Llama 3.1 8B
-
-   * - (``float8``)
-     - ``amd/Meta-Llama-3.1-70B-Instruct-FP8-KV``
-     - Llama 3.1 70B
+     - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_

   * -
-     - ``amd/Meta-Llama-3.1-405B-Instruct-FP8-KV``
-     - Llama 3.1 405B
+     - ``databricks/dbrx-instruct``
+     - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_
+
+   * -
+     - ``google/gemma-2-27b``
+     - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_
+
+   * -
+     - ``CohereForAI/c4ai-command-r-plus-08-2024``
+     - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_
+
+   * -
+     - ``deepseek-ai/deepseek-moe-16b-chat``
+     - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_
+
+   * - ``$model_repo``
+     - ``amd/Llama-3.1-70B-Instruct-FP8-KV``
+     - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_
+
+   * - (``float8``)
+     - ``amd/Llama-3.1-405B-Instruct-FP8-KV``
+     - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV``
-     - Mixtral 8x7B
+     - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV``
-     - Mixtral 8x22B
+     - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_
+
+   * -
+     - ``amd/Mistral-7B-v0.1-FP8-KV``
+     - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_
+
+   * -
+     - ``amd/dbrx-instruct-FP8-KV``
+     - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_
+
+   * -
+     - ``amd/c4ai-command-r-plus-FP8-KV``
+     - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_

   * - ``$num_gpu``
     - 1 or 8
@@ -335,34 +380,34 @@ options and their descriptions.
 Example 1: latency benchmark
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.
+Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

 .. code-block::

-   ./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
-   ./vllm_benchmark_report.sh -s latency -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8
+   ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
+   ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

 Find the latency reports at:

- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv``
+- ``./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv``

- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv``
+- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv``

 Example 2: throughput benchmark
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.
+Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

 .. code-block:: shell

-   ./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
-   ./vllm_benchmark_report.sh -s throughput -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8
+   ./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
+   ./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

 Find the throughput reports at:

- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv``
+- ``./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv``

- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv``
+- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv``

 .. raw:: html

@@ -394,7 +439,7 @@ Further reading
  MI300X accelerators, see :doc:`../../system-optimization/mi300x`.

 - To learn how to run LLM models from Hugging Face or your own model, see
-  :doc:`Using ROCm for AI <../index>`.
+  :doc:`Running models from Hugging Face <hugging-face-models>`.

 - To learn how to optimize inference on LLMs, see
  :doc:`Inference optimization <../inference-optimization/index>`.
@@ -402,6 +447,32 @@ Further reading
 - To learn how to fine-tune LLMs, see
  :doc:`Fine-tuning LLMs <../fine-tuning/index>`.

- To compare with the previous version of the ROCm vLLM Docker image for performance validation, refer to
-  `LLM inference performance validation on AMD Instinct MI300X (ROCm 6.2.0) <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_.
+Previous versions
+=================

+This table lists previous versions of the ROCm vLLM Docker image for inference
+performance validation. For detailed information about available models for
+benchmarking, see the version-specific documentation.
+
+.. list-table::
+   :header-rows: 1
+   :stub-columns: 1
+
+   * - ROCm version
+     - vLLM version
+     - PyTorch version
+     - Resources
+
+   * - 6.2.1
+     - 0.6.4
+     - 2.5.0
+     - 
+       * `Documentation <https://rocm.docs.amd.com/en/docs-6.3.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
+       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4/images/sha256-ccbb74cc9e7adecb8f7bdab9555f7ac6fc73adb580836c2a35ca96ff471890d8>`_
+
+   * - 6.2.0
+     - 0.4.3
+     - 2.4.0
+     -
+       * `Documentation <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
+       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50/images/sha256-9e4dd4788a794c3d346d7d0ba452ae5e92d39b8dfac438b2af8efdc7f15d22c0>`_
--- a/docs/reference/gpu-arch-specs.rst
+++ b/docs/reference/gpu-arch-specs.rst
@@ -21,6 +21,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -36,6 +38,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI325X
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 256
          - 304 (38 per XCD)
          - 64
@@ -51,6 +55,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI300X
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 192
          - 304 (38 per XCD)
          - 64
@@ -66,6 +72,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI300A
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 128
          - 228 (38 per XCD)
          - 64
@@ -81,6 +89,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI250X
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 128
          - 220 (110 per GCD)
          - 64
@@ -96,6 +106,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI250
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 128
          - 208 (104 per GCD)
          - 64
@@ -111,6 +123,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI210
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 64
          - 104
          - 64
@@ -126,6 +140,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI100
          - CDNA
          - gfx908
+          - 9
+          - 0
          - 32
          - 120
          - 64
@@ -141,6 +157,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI60
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 32
          - 64
          - 64
@@ -156,6 +174,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI50 (32GB)
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 32
          - 60
          - 64
@@ -171,6 +191,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI50 (16GB)
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -186,6 +208,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI25
          - GCN5.0
          - gfx900
+          - 9
+          - 0
          - 16 
          - 64
          - 64
@@ -201,6 +225,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI8
          - GCN3.0
          - gfx803
+          - 8
+          - 0
          - 4
          - 64
          - 64
@@ -216,6 +242,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI6
          - GCN4.0
          - gfx803
+          - 8
+          - 0
          - 16
          - 36
          - 64
@@ -238,6 +266,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -254,6 +284,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO V710
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 28
          - 54
          - 32
@@ -270,6 +302,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7900 Dual Slot
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 48
          - 96
          - 32
@@ -286,6 +320,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7900
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 48
          - 96
          - 32
@@ -302,6 +338,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7800
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 32
          - 70
          - 32
@@ -318,6 +356,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7700
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 16
          - 48
          - 32
@@ -334,6 +374,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W6800
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 32
          - 60
          - 32
@@ -350,6 +392,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W6600
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 28
          - 32
@@ -366,6 +410,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO V620
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 32
          - 72
          - 32
@@ -382,6 +428,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon Pro W5500
          - RDNA
          - gfx1012
+          - 10
+          - 1
          - 8
          - 22
          - 32
@@ -398,6 +446,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon Pro VII
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -421,6 +471,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -437,6 +489,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 XTX
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 24
          - 96
          - 32
@@ -453,6 +507,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 XT
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 20
          - 84
          - 32
@@ -469,6 +525,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 GRE
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 16
          - 80
          - 32
@@ -485,6 +543,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7800 XT
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 16
          - 60
          - 32
@@ -501,6 +561,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7700 XT
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 12
          - 54
          - 32
@@ -517,6 +579,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7600
          - RDNA3
          - gfx1102
+          - 11
+          - 0
          - 8
          - 32
          - 32
@@ -533,6 +597,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6950 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 80
          - 32
@@ -549,6 +615,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6900 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 80
          - 32
@@ -565,6 +633,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6800 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 72
          - 32
@@ -581,6 +651,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6800
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 60
          - 32
@@ -597,6 +669,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6750 XT
          - RDNA2
          - gfx1031
+          - 10
+          - 3
          - 12
          - 40
          - 32
@@ -613,6 +687,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6700 XT
          - RDNA2
          - gfx1031
+          - 10
+          - 3
          - 12
          - 40
          - 32
@@ -630,6 +706,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - RDNA2
          - gfx1031
          - 10
+          - 3
+          - 10
          - 36
          - 32
          - 128
@@ -645,6 +723,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6650 XT
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 32
          - 32
@@ -661,6 +741,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6600 XT
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 32
          - 32
@@ -677,6 +759,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6600
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 28
          - 32
@@ -693,6 +777,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon VII
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -710,7 +796,7 @@ Glossary
 ========

 For more information about the terms used, see the
-:ref:`specific documents and guides <gpu-arch-documentation>`, or 
+:ref:`specific documents and guides <gpu-arch-documentation>`, or
 :doc:`Understanding the HIP programming model<hip:understand/programming_model>`.

 **LLVM target name**
@@ -718,6 +804,18 @@ For more information about the terms used, see the
 Argument to pass to clang in ``--offload-arch`` to compile code for the given
 architecture.

+**Device major version**
+
+Indicates the core instruction set of the GPU architecture. For example, a value
+of 11 would correspond to Navi III (RDNA3).
+
+**Device minor version**
+
+Indicates a particular configuration, feature set, or variation within the group
+represented by the device compute version. For example, different models within
+the same major version might have varying levels of support for certain features
+or optimizations.
+
 **VRAM**

 Amount of memory available on the GPU.
Author	SHA1	Message	Date
Adel Johar	1499f74c22	Docs: Add Device Major/Minor Versions to gpu-arch-spec.rst	2025-02-13 14:24:00 +01:00
Peter Park	2751a17cf0	Update vLLM benchmarking guide (#4347 ) * update vllm-benchmark fix hlist overflow update standalone benchmarking options update list of models fix typo and model name unnecessary duplicate info update formatting update vllm benchmark guide - remove Llama 2 FP8 - add Jais 13B - update commands update docker pull tag update MAD available models remove extra mad models not relevant to vllm update PyTorch version add changelog add model names to .wordlist.txt * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * fix typo * update link * fix link text * change changelog to previous versions * fix typo * remove "for" --------- Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>	2025-02-05 17:18:35 -05:00
Peter Park	9b0ae86b1b	Fix ROCm Bandwidth Test license type Fix ROCm Bandwidth Test license type	2025-02-05 16:40:31 -05:00
harkgill-amd	16f7cb4c04	Update issue workflow to trigger on edit (#4346 )	2025-02-05 14:46:16 -05:00