Docs: Add Device Major/Minor Versions to gpu-arch-spec.rst

Update vLLM benchmarking guide (#4347 )
* update vllm-benchmark fix hlist overflow update standalone benchmarking options update list of models fix typo and model name unnecessary duplicate info update formatting update vllm benchmark guide - remove Llama 2 FP8 - add Jais 13B - update commands update docker pull tag update MAD available models remove extra mad models not relevant to vllm update PyTorch version add changelog add model names to .wordlist.txt * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * fix typo * update link * fix link text * change changelog to previous versions * fix typo * remove "for" --------- Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
2026-01-10 23:28:03 -05:00 · 2025-02-13 14:24:00 +01:00 · 2025-02-05 17:18:35 -05:00 · 2025-02-05 16:40:31 -05:00 · 2025-02-05 14:46:16 -05:00 · 2025-02-05 13:21:44 -05:00
9 changed files with 362 additions and 157 deletions
--- a/.azuredevops/components/hipBLASLt.yml
+++ b/.azuredevops/components/hipBLASLt.yml
@@ -59,14 +59,10 @@ jobs:
    value: $(Build.BinariesDirectory)/rocm
  - name: TENSILE_ROCM_ASSEMBLER_PATH
    value: $(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-  - name: CMAKE_CXX_COMPILER
-    value: $(Agent.BuildDirectory)/rocm/bin/hipcc
  - name: TENSILE_ROCM_OFFLOAD_BUNDLER_PATH
    value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang-offload-bundler
-  - name: TENSILE_ROCM_PATH
-    value: $(Agent.BuildDirectory)/rocm/bin/hipcc
-  - name: PATH
-    value: $(Agent.BuildDirectory)/rocm/llvm/bin:$(Agent.BuildDirectory)/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
+  - name: ROCM_PATH
+    value: $(Agent.BuildDirectory)/rocm
  - name: DAY_STRING
    value: $[format('{0:ddMMyyyy}', pipeline.startTime)]
  pool: ${{ variables.ULTRA_BUILD_POOL }}
@@ -154,9 +150,8 @@ jobs:
      extraEnvVars:
        - HIP_ROCCLR_HOME:::/home/user/workspace/rocm
        - TENSILE_ROCM_ASSEMBLER_PATH:::/home/user/workspace/rocm/llvm/bin/amdclang
-        - CMAKE_CXX_COMPILER:::/home/user/workspace/rocm/bin/hipcc
        - TENSILE_ROCM_OFFLOAD_BUNDLER_PATH:::/home/user/workspace/rocm/llvm/bin/clang-offload-bundler
-        - TENSILE_ROCM_PATH:::/home/user/workspace/rocm/bin/hipcc
+        - ROCM_PATH:::/home/user/workspace/rocm
      extraCopyDirectories:
        - deps

--- a/.azuredevops/components/rocBLAS.yml
+++ b/.azuredevops/components/rocBLAS.yml
@@ -64,10 +64,10 @@ jobs:
    value: $(Build.BinariesDirectory)/rocm
  - name: TENSILE_ROCM_ASSEMBLER_PATH
    value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang
-  - name: CMAKE_CXX_COMPILER
-    value: $(Agent.BuildDirectory)/rocm/bin/hipcc
  - name: TENSILE_ROCM_OFFLOAD_BUNDLER_PATH
    value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang-offload-bundler
+  - name: ROCM_PATH
+    value: $(Agent.BuildDirectory)/rocm
  pool: ${{ variables.MEDIUM_BUILD_POOL }}
  workspace:
    clean: all
@@ -96,8 +96,8 @@ jobs:
        -DCMAKE_TOOLCHAIN_FILE=toolchain-linux.cmake
        -DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm/llvm;$(Agent.BuildDirectory)/rocm
        -DCMAKE_BUILD_TYPE=Release
-        -DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-        -DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
+        -DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/bin/amdclang++
+        -DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/amdclang
        -DGPU_TARGETS=$(JOB_GPU_TARGET)
        -DTensile_CODE_OBJECT_VERSION=default
        -DTensile_LOGIC=asm_full
@@ -125,8 +125,8 @@ jobs:
      extraEnvVars:
        - HIP_ROCCLR_HOME:::/home/user/workspace/rocm
        - TENSILE_ROCM_ASSEMBLER_PATH:::/home/user/workspace/rocm/llvm/bin/clang
-        - CMAKE_CXX_COMPILER:::/home/user/workspace/rocm/bin/hipcc
        - TENSILE_ROCM_OFFLOAD_BUNDLER_PATH:::/home/user/workspace/rocm/llvm/bin/clang-offload-bundler
+        - ROCM_PATH:::/home/user/workspace/rocm

 - job: rocBLAS_testing
  dependsOn: rocBLAS
--- a/.github/workflows/issue_retrieval.yml
+++ b/.github/workflows/issue_retrieval.yml
@@ -2,7 +2,7 @@ name: Issue retrieval

 on:
  issues:
-    types: [opened]
+    types: [opened, edited]

 jobs:
  auto-retrieve:
@@ -15,7 +15,7 @@ jobs:
          app_id: ${{ secrets.ACTION_APP_ID }}
          private_key: ${{ secrets.ACTION_PEM }}
      - name: 'Retrieve Issue'
-        uses: abhimeda/rocm_issue_management@main
+        uses: harkgill-amd/rocm_issue_management@main
        with:
          authentication-token: ${{ steps.generate_token.outputs.token }}
          github-organization: 'ROCm'
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -74,6 +74,7 @@ Conda
 ConnectX
 CuPy
 Dashboarding
+DBRX
 DDR
 DF
 DGEMM
@@ -92,6 +93,7 @@ DataFrame
 DataLoader
 DataParallel
 Debian
+DeepSeek
 DeepSpeed
 Dependabot
 Deprecations
@@ -129,6 +131,7 @@ GDS
 GEMM
 GEMMs
 GFortran
+Gemma
 GiB
 GIM
 GL
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -29,8 +29,7 @@ The release notes provide a summary of notable changes since the previous ROCm r
 - [ROCm upcoming changes](#rocm-upcoming-changes)

 ```{note}
-If you’re using Radeon™ PRO or Radeon GPUs in a workstation setting with a
-display connected, continue to use ROCm 6.2.3. See the [Use ROCm on Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/index.html)
+If you’re using Radeon™ PRO or Radeon GPUs in a workstation setting with a display connected, see the [Use ROCm on Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility/native_linux/native_linux_compatibility.html)
 documentation to verify compatibility and system requirements.
 ```
 ## Release highlights
--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -62,7 +62,7 @@ additional licenses. Please review individual repositories for more information.
 | [rocJPEG](https://github.com/ROCm/rocJPEG/) | [MIT](https://github.com/ROCm/rocJPEG/blob/develop/LICENSE) |
 | [ROCK-Kernel-Driver](https://github.com/ROCm/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/ROCm/ROCK-Kernel-Driver/blob/master/COPYING) |
 | [rocminfo](https://github.com/ROCm/rocminfo/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocminfo/blob/amd-staging/License.txt) |
-| [ROCm Bandwidth Test](https://github.com/ROCm/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocm_bandwidth_test/blob/master/LICENSE.txt) |
+| [ROCm Bandwidth Test](https://github.com/ROCm/rocm_bandwidth_test/) | [MIT](https://github.com/ROCm/rocm_bandwidth_test/blob/master/LICENSE.txt) |
 | [ROCm CMake](https://github.com/ROCm/rocm-cmake/) | [MIT](https://github.com/ROCm/rocm-cmake/blob/develop/LICENSE) |
 | [ROCm Communication Collectives Library (RCCL)](https://github.com/ROCm/rccl/) | [Custom](https://github.com/ROCm/rccl/blob/develop/LICENSE.txt) |
 | [ROCm-Core](https://github.com/ROCm/rocm-core) | [MIT](https://github.com/ROCm/rocm-core/blob/master/copyright) |
--- a/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst
+++ b/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst
@@ -10,49 +10,22 @@ LLM inference performance validation on AMD Instinct MI300X
 .. _vllm-benchmark-unified-docker:

 The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
-a prebuilt, optimized environment designed for validating large language model
-(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
-ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
-MI300X accelerator and includes the following components:
+a prebuilt, optimized environment for validating large language model (LLM)
+inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM
+Docker image integrates vLLM and PyTorch tailored specifically for the MI300X
+accelerator and includes the following components:

-* `ROCm 6.2.1 <https://github.com/ROCm/ROCm>`_
+* `ROCm 6.3.1 <https://github.com/ROCm/ROCm>`_

-* `vLLM 0.6.4 <https://docs.vllm.ai/en/latest>`_
+* `vLLM 0.6.6 <https://docs.vllm.ai/en/latest>`_

-* `PyTorch 2.5.0 <https://github.com/pytorch/pytorch>`_
-
-* Tuning files (in CSV format)
+* `PyTorch 2.7.0 (2.7.0a0+git3a58512) <https://github.com/pytorch/pytorch>`_

 With this Docker image, you can quickly validate the expected inference
-performance numbers on the MI300X accelerator. This topic also provides tips on
-optimizing performance with popular AI models.
-
-.. hlist::
-   :columns: 6
-
-   * Llama 3.1 8B
-
-   * Llama 3.1 70B
-
-   * Llama 3.1 405B
-
-   * Llama 2 7B
-
-   * Llama 2 70B
-
-   * Mixtral 8x7B
-
-   * Mixtral 8x22B
-
-   * Mixtral 7B
-
-   * Qwen2 7B
-
-   * Qwen2 72B
-
-   * JAIS 13B
-
-   * JAIS 30B
+performance numbers for the MI300X accelerator. This topic also provides tips on
+optimizing performance with popular AI models. For more information, see the lists of
+:ref:`available models for MAD-integrated benchmarking <vllm-benchmark-mad-models>`
+and :ref:`standalone benchmarking <vllm-benchmark-standalone-options>`.

 .. _vllm-benchmark-vllm:

@@ -91,9 +64,9 @@ MI300X accelerator with the prebuilt vLLM Docker image.

   .. code-block:: shell

-      docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+      docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

-Once setup is complete, you can choose between two options to reproduce the
+Once the setup is complete, choose between two options to reproduce the
 benchmark results:

 -  :ref:`MAD-integrated benchmarking <vllm-benchmark-mad>`
@@ -130,45 +103,89 @@ Although the following models are preconfigured to collect latency and
 throughput performance data, you can also change the benchmarking parameters.
 Refer to the :ref:`Standalone benchmarking <vllm-benchmark-standalone>` section.

+.. _vllm-benchmark-mad-models:
+
 Available models
 ----------------

-.. hlist::
-   :columns: 3
+.. list-table::
+   :header-rows: 1
+   :widths: 2, 3

-   * ``pyt_vllm_llama-3.1-8b``
+   * - Model name
+     - Tag

-   * ``pyt_vllm_llama-3.1-70b``
+   * - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_
+     - ``pyt_vllm_llama-3.1-8b``

-   * ``pyt_vllm_llama-3.1-405b``
+   * - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_
+     - ``pyt_vllm_llama-3.1-70b``

-   * ``pyt_vllm_llama-2-7b``
+   * - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_
+     - ``pyt_vllm_llama-3.1-405b``

-   * ``pyt_vllm_llama-2-70b``
+   * - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_
+     - ``pyt_vllm_llama-3.2-11b-vision-instruct``

-   * ``pyt_vllm_mixtral-8x7b``
+   * - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`_
+     - ``pyt_vllm_llama-2-7b``

-   * ``pyt_vllm_mixtral-8x22b``
+   * - `Llama 2 70B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`_
+     - ``pyt_vllm_llama-2-70b``

-   * ``pyt_vllm_mistral-7b``
+   * - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_
+     - ``pyt_vllm_mixtral-8x7b``

-   * ``pyt_vllm_qwen2-7b``
+   * - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_
+     - ``pyt_vllm_mixtral-8x22b``

-   * ``pyt_vllm_qwen2-72b``
+   * - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_
+     - ``pyt_vllm_mistral-7b``

-   * ``pyt_vllm_jais-13b``
+   * - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_
+     - ``pyt_vllm_qwen2-7b``

-   * ``pyt_vllm_jais-30b``
+   * - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_
+     - ``pyt_vllm_qwen2-72b``

-   * ``pyt_vllm_llama-3.1-8b_fp8``
+   * - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_
+     - ``pyt_vllm_jais-13b``

-   * ``pyt_vllm_llama-3.1-70b_fp8``
+   * - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_
+     - ``pyt_vllm_jais-30b``

-   * ``pyt_vllm_llama-3.1-405b_fp8``
+   * - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_
+     - ``pyt_vllm_dbrx-instruct``

-   * ``pyt_vllm_mixtral-8x7b_fp8``
+   * - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_
+     - ``pyt_vllm_gemma-2-27b``

-   * ``pyt_vllm_mixtral-8x22b_fp8``
+   * - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_
+     - ``pyt_vllm_c4ai-command-r-plus-08-2024``
+
+   * - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_
+     - ``pyt_vllm_deepseek-moe-16b-chat``
+
+   * - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_
+     - ``pyt_vllm_llama-3.1-70b_fp8``
+
+   * - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_
+     - ``pyt_vllm_llama-3.1-405b_fp8``
+
+   * - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mixtral-8x7b_fp8``
+
+   * - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mixtral-8x22b_fp8``
+
+   * - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_
+     - ``pyt_vllm_mistral-7b_fp8``
+
+   * - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_
+     - ``pyt_vllm_dbrx_fp8``
+
+   * - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_
+     - ``pyt_vllm_command-r-plus_fp8``

 .. _vllm-benchmark-standalone:

@@ -181,8 +198,8 @@ snippet.

 .. code-block::

-   docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
-   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.4 rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+   docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
+   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

 In the Docker container, clone the ROCm MAD repository and navigate to the
 benchmark scripts directory at ``~/MAD/scripts/vllm``.
@@ -224,8 +241,8 @@ See the :ref:`examples <vllm-benchmark-run-benchmark>` for more information.

 .. _vllm-benchmark-standalone-options:

-Options
-------
+Options and available models
+----------------------------

 .. list-table::
   :header-rows: 1
@@ -248,72 +265,100 @@ Options
     - Measure both throughput and latency

   * - ``$model_repo``
-     - ``meta-llama/Meta-Llama-3.1-8B-Instruct``
-     - Llama 3.1 8B
+     - ``meta-llama/Llama-3.1-8B-Instruct``
+     - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_

   * - (``float16``)
-     - ``meta-llama/Meta-Llama-3.1-70B-Instruct``
-     - Llama 3.1 70B
+     - ``meta-llama/Llama-3.1-70B-Instruct``
+     - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_

   * -
-     - ``meta-llama/Meta-Llama-3.1-405B-Instruct``
-     - Llama 3.1 405B
+     - ``meta-llama/Llama-3.1-405B-Instruct``
+     - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_
+
+   * -
+     - ``meta-llama/Llama-3.2-11B-Vision-Instruct``
+     - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_

   * -
     - ``meta-llama/Llama-2-7b-chat-hf``
-     - Llama 2 7B
+     - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`_

   * -
     - ``meta-llama/Llama-2-70b-chat-hf``
-     - Llama 2 70B
+     - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`_

   * -
     - ``mistralai/Mixtral-8x7B-Instruct-v0.1``
-     - Mixtral 8x7B
+     - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mixtral-8x22B-Instruct-v0.1``
-     - Mixtral 8x22B
+     - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mistral-7B-Instruct-v0.3``
-     - Mixtral 7B
+     - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_

   * -
     - ``Qwen/Qwen2-7B-Instruct``
-     - Qwen2 7B
+     - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_

   * -
     - ``Qwen/Qwen2-72B-Instruct``
-     - Qwen2 72B
+     - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_

   * -
     - ``core42/jais-13b-chat``
-     - JAIS 13B
+     - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_

   * -
     - ``core42/jais-30b-chat-v3``
-     - JAIS 30B
-
-   * - ``$model_repo``
-     - ``amd/Meta-Llama-3.1-8B-Instruct-FP8-KV``
-     - Llama 3.1 8B
-
-   * - (``float8``)
-     - ``amd/Meta-Llama-3.1-70B-Instruct-FP8-KV``
-     - Llama 3.1 70B
+     - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_

   * -
-     - ``amd/Meta-Llama-3.1-405B-Instruct-FP8-KV``
-     - Llama 3.1 405B
+     - ``databricks/dbrx-instruct``
+     - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_
+
+   * -
+     - ``google/gemma-2-27b``
+     - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_
+
+   * -
+     - ``CohereForAI/c4ai-command-r-plus-08-2024``
+     - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_
+
+   * -
+     - ``deepseek-ai/deepseek-moe-16b-chat``
+     - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_
+
+   * - ``$model_repo``
+     - ``amd/Llama-3.1-70B-Instruct-FP8-KV``
+     - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_
+
+   * - (``float8``)
+     - ``amd/Llama-3.1-405B-Instruct-FP8-KV``
+     - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV``
-     - Mixtral 8x7B
+     - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV``
-     - Mixtral 8x22B
+     - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_
+
+   * -
+     - ``amd/Mistral-7B-v0.1-FP8-KV``
+     - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_
+
+   * -
+     - ``amd/dbrx-instruct-FP8-KV``
+     - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_
+
+   * -
+     - ``amd/c4ai-command-r-plus-FP8-KV``
+     - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_

   * - ``$num_gpu``
     - 1 or 8
@@ -335,34 +380,34 @@ options and their descriptions.
 Example 1: latency benchmark
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.
+Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

 .. code-block::

-   ./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
-   ./vllm_benchmark_report.sh -s latency -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8
+   ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
+   ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

 Find the latency reports at:

- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv``
+- ``./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv``

- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv``
+- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv``

 Example 2: throughput benchmark
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types.
+Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

 .. code-block:: shell

-   ./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16
-   ./vllm_benchmark_report.sh -s throughput -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8
+   ./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
+   ./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

 Find the throughput reports at:

- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv``
+- ``./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv``

- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv``
+- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv``

 .. raw:: html

@@ -394,7 +439,7 @@ Further reading
  MI300X accelerators, see :doc:`../../system-optimization/mi300x`.

 - To learn how to run LLM models from Hugging Face or your own model, see
-  :doc:`Using ROCm for AI <../index>`.
+  :doc:`Running models from Hugging Face <hugging-face-models>`.

 - To learn how to optimize inference on LLMs, see
  :doc:`Inference optimization <../inference-optimization/index>`.
@@ -402,6 +447,32 @@ Further reading
 - To learn how to fine-tune LLMs, see
  :doc:`Fine-tuning LLMs <../fine-tuning/index>`.

- To compare with the previous version of the ROCm vLLM Docker image for performance validation, refer to
-  `LLM inference performance validation on AMD Instinct MI300X (ROCm 6.2.0) <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_.
+Previous versions
+=================

+This table lists previous versions of the ROCm vLLM Docker image for inference
+performance validation. For detailed information about available models for
+benchmarking, see the version-specific documentation.
+
+.. list-table::
+   :header-rows: 1
+   :stub-columns: 1
+
+   * - ROCm version
+     - vLLM version
+     - PyTorch version
+     - Resources
+
+   * - 6.2.1
+     - 0.6.4
+     - 2.5.0
+     - 
+       * `Documentation <https://rocm.docs.amd.com/en/docs-6.3.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
+       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4/images/sha256-ccbb74cc9e7adecb8f7bdab9555f7ac6fc73adb580836c2a35ca96ff471890d8>`_
+
+   * - 6.2.0
+     - 0.4.3
+     - 2.4.0
+     -
+       * `Documentation <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
+       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50/images/sha256-9e4dd4788a794c3d346d7d0ba452ae5e92d39b8dfac438b2af8efdc7f15d22c0>`_
--- a/docs/reference/gpu-arch-specs.rst
+++ b/docs/reference/gpu-arch-specs.rst
@@ -21,6 +21,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -36,6 +38,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI325X
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 256
          - 304 (38 per XCD)
          - 64
@@ -51,6 +55,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI300X
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 192
          - 304 (38 per XCD)
          - 64
@@ -66,6 +72,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI300A
          - CDNA3
          - gfx942
+          - 9
+          - 4
          - 128
          - 228 (38 per XCD)
          - 64
@@ -81,6 +89,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI250X
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 128
          - 220 (110 per GCD)
          - 64
@@ -96,6 +106,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI250
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 128
          - 208 (104 per GCD)
          - 64
@@ -111,6 +123,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI210
          - CDNA2
          - gfx90a
+          - 9
+          - 0
          - 64
          - 104
          - 64
@@ -126,6 +140,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI100
          - CDNA
          - gfx908
+          - 9
+          - 0
          - 32
          - 120
          - 64
@@ -141,6 +157,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI60
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 32
          - 64
          - 64
@@ -156,6 +174,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI50 (32GB)
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 32
          - 60
          - 64
@@ -171,6 +191,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI50 (16GB)
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -186,6 +208,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI25
          - GCN5.0
          - gfx900
+          - 9
+          - 0
          - 16 
          - 64
          - 64
@@ -201,6 +225,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI8
          - GCN3.0
          - gfx803
+          - 8
+          - 0
          - 4
          - 64
          - 64
@@ -216,6 +242,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - MI6
          - GCN4.0
          - gfx803
+          - 8
+          - 0
          - 16
          - 36
          - 64
@@ -238,6 +266,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -254,6 +284,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO V710
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 28
          - 54
          - 32
@@ -270,6 +302,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7900 Dual Slot
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 48
          - 96
          - 32
@@ -286,6 +320,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7900
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 48
          - 96
          - 32
@@ -302,6 +338,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7800
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 32
          - 70
          - 32
@@ -318,6 +356,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W7700
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 16
          - 48
          - 32
@@ -334,6 +374,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W6800
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 32
          - 60
          - 32
@@ -350,6 +392,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO W6600
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 28
          - 32
@@ -366,6 +410,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon PRO V620
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 32
          - 72
          - 32
@@ -382,6 +428,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon Pro W5500
          - RDNA
          - gfx1012
+          - 10
+          - 1
          - 8
          - 22
          - 32
@@ -398,6 +446,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon Pro VII
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -421,6 +471,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Model
          - Architecture
          - LLVM target name
+          - Device Major version
+          - Device Minor version
          - VRAM (GiB)
          - Compute Units
          - Wavefront Size
@@ -437,6 +489,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 XTX
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 24
          - 96
          - 32
@@ -453,6 +507,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 XT
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 20
          - 84
          - 32
@@ -469,6 +525,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7900 GRE
          - RDNA3
          - gfx1100
+          - 11
+          - 0
          - 16
          - 80
          - 32
@@ -485,6 +543,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7800 XT
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 16
          - 60
          - 32
@@ -501,6 +561,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7700 XT
          - RDNA3
          - gfx1101
+          - 11
+          - 0
          - 12
          - 54
          - 32
@@ -517,6 +579,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 7600
          - RDNA3
          - gfx1102
+          - 11
+          - 0
          - 8
          - 32
          - 32
@@ -533,6 +597,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6950 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 80
          - 32
@@ -549,6 +615,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6900 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 80
          - 32
@@ -565,6 +633,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6800 XT
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 72
          - 32
@@ -581,6 +651,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6800
          - RDNA2
          - gfx1030
+          - 10
+          - 3
          - 16
          - 60
          - 32
@@ -597,6 +669,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6750 XT
          - RDNA2
          - gfx1031
+          - 10
+          - 3
          - 12
          - 40
          - 32
@@ -613,6 +687,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6700 XT
          - RDNA2
          - gfx1031
+          - 10
+          - 3
          - 12
          - 40
          - 32
@@ -630,6 +706,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - RDNA2
          - gfx1031
          - 10
+          - 3
+          - 10
          - 36
          - 32
          - 128
@@ -645,6 +723,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6650 XT
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 32
          - 32
@@ -661,6 +741,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6600 XT
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 32
          - 32
@@ -677,6 +759,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon RX 6600
          - RDNA2
          - gfx1032
+          - 10
+          - 3
          - 8
          - 28
          - 32
@@ -693,6 +777,8 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - Radeon VII
          - GCN5.1
          - gfx906
+          - 9
+          - 0
          - 16
          - 60
          - 64
@@ -710,7 +796,7 @@ Glossary
 ========

 For more information about the terms used, see the
-:ref:`specific documents and guides <gpu-arch-documentation>`, or 
+:ref:`specific documents and guides <gpu-arch-documentation>`, or
 :doc:`Understanding the HIP programming model<hip:understand/programming_model>`.

 **LLVM target name**
@@ -718,6 +804,18 @@ For more information about the terms used, see the
 Argument to pass to clang in ``--offload-arch`` to compile code for the given
 architecture.

+**Device major version**
+
+Indicates the core instruction set of the GPU architecture. For example, a value
+of 11 would correspond to Navi III (RDNA3).
+
+**Device minor version**
+
+Indicates a particular configuration, feature set, or variation within the group
+represented by the device compute version. For example, different models within
+the same major version might have varying levels of support for certain features
+or optimizations.
+
 **VRAM**

 Amount of memory available on the GPU.
--- a/docs/reference/precision-support.rst
+++ b/docs/reference/precision-support.rst
@@ -1,19 +1,23 @@
 .. meta::
-  :description: Supported data types in ROCm
-  :keywords: int8, float8, float8 (E4M3), float8 (E5M2), bfloat8, float16, half, bfloat16, tensorfloat32, float,
-   float32, float64, double, AMD, ROCm, AMDGPU
+  :description: Supported data types of AMD GPUs and libraries in ROCm.
+  :keywords: precision, data types, HIP types, int8, float8, float8 (E4M3),
+             float8 (E5M2), bfloat8, float16, half, bfloat16, tensorfloat32,
+             float, float32, float64, double, AMD data types, HIP data types,
+             ROCm precision, ROCm data types

 *************************************************************
-Precision support
+Data types and precision support
 *************************************************************

-Use the following sections to identify data types and HIP types ROCm™ supports.
+This topic lists the supported data types of AMD GPUs and ROCm libraries.
+Corresponding :doc:`HIP <hip:index>` data types are also noted. 

 Integral types
 ==========================================

-The signed and unsigned integral types that are supported by ROCm are listed in the following table,
-together with their corresponding HIP type and a short description.
+The signed and unsigned integral types supported by ROCm are listed in
+the following table, along with their corresponding HIP type and a short
+description.


 .. list-table::
@@ -46,8 +50,8 @@ together with their corresponding HIP type and a short description.
 Floating-point types
 ==========================================

-The floating-point types that are supported by ROCm are listed in the following table, together with
-their corresponding HIP type and a short description.
+The floating-point types supported by ROCm are listed in the following
+table, along with their corresponding HIP type and a short description.

 .. image:: ../data/about/compatibility/floating-point-data-types.png
    :alt: Supported floating-point types
@@ -63,43 +67,62 @@ their corresponding HIP type and a short description.
    *
      - float8 (E4M3)
      - ``-``
-      - An 8-bit floating-point number that mostly follows IEEE-754 conventions and **S1E4M3** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.
+      - An 8-bit floating-point number that mostly follows IEEE-754 conventions
+        and **S1E4M3** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ ,
+        with expanded range and no infinity or signed zero. NaN is
+        represented as negative zero.
    *
      - float8 (E5M2)
      - ``-``
-      - An 8-bit floating-point number mostly following IEEE-754 conventions and **S1E5M2** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.
+      - An 8-bit floating-point number mostly following IEEE-754 conventions and
+        **S1E5M2** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ ,
+        with expanded range and no infinity or signed zero. NaN is
+        represented as negative zero.
    *
      - float16
      - ``half``
-      - A 16-bit floating-point number that conforms to the IEEE 754-2008 half-precision storage format.
+      - A 16-bit floating-point number that conforms to the IEEE 754-2008
+        half-precision storage format.
    *
      - bfloat16
      - ``bfloat16``
-      - A shortened 16-bit version of the IEEE 754 single-precision storage format.
+      - A shortened 16-bit version of the IEEE 754 single-precision storage
+        format.
    *
      - tensorfloat32
      - ``-``
-      - A floating-point number that occupies 32 bits or less of storage, providing improved range compared to half (16-bit) format, at (potentially) greater throughput than single-precision (32-bit) formats.
+      - A floating-point number that occupies 32 bits or less of storage,
+        providing improved range compared to half (16-bit) format, at
+        (potentially) greater throughput than single-precision (32-bit) formats.
    *
      - float32
      - ``float``
-      - A 32-bit floating-point number that conforms to the IEEE 754 single-precision storage format.
+      - A 32-bit floating-point number that conforms to the IEEE 754
+        single-precision storage format.
    *
      - float64
      - ``double``
-      - A 64-bit floating-point number that conforms to the IEEE 754 double-precision storage format.
+      - A 64-bit floating-point number that conforms to the IEEE 754
+        double-precision storage format.

 .. note::

-  * The float8 and tensorfloat32 types are internal types used in calculations in Matrix Cores and can be stored in any type of the same size.
-  * The encodings for FP8 (E5M2) and FP8 (E4M3) that are natively supported by MI300 differ from the FP8 (E5M2) and FP8 (E4M3) encodings used in H100 (`FP8 Formats for Deep Learning <https://arxiv.org/abs/2209.05433>`_).
+  * The float8 and tensorfloat32 types are internal types used in calculations
+    in Matrix Cores and can be stored in any type of the same size.
+
+  * The encodings for FP8 (E5M2) and FP8 (E4M3) that the
+    MI300 series natively supports differ from the FP8 (E5M2) and FP8 (E4M3)
+    encodings used in NVIDIA H100
+    (`FP8 Formats for Deep Learning <https://arxiv.org/abs/2209.05433>`_).
+
  * In some AMD documents and articles, float8 (E5M2) is referred to as bfloat8.

 ROCm support icons
 ==========================================

-In the following sections, we use icons to represent the level of support. These icons, described in the
-following table, are also used on the library data type support pages.
+In the following sections, icons represent the level of support. These
+icons, described in the following table, are also used in the library data type
+support pages.

 .. list-table::
    :header-rows: 1
@@ -121,14 +144,27 @@ following table, are also used on the library data type support pages.

 .. note::

-  * Full support means that the type is supported natively or with hardware emulation.
-  * Native support means that the operations for that type are implemented in hardware. Types that are not natively supported are emulated with the available hardware. The performance of non-natively supported types can differ from the full instruction throughput rate. For example, 16-bit integer operations can be performed on the 32-bit integer ALUs at full rate; however, 64-bit integer operations might need several instructions on the 32-bit integer ALUs.
-  * Any type can be emulated by software, but this page does not cover such cases.
+  * Full support means that the type is supported natively or with hardware
+    emulation.

-Hardware type support
+  * Native support means that the operations for that type are implemented in
+    hardware. Types that are not natively supported are emulated with the
+    available hardware. The performance of non-natively supported types can
+    differ from the full instruction throughput rate. For example, 16-bit
+    integer operations can be performed on the 32-bit integer ALUs at full rate;
+    however, 64-bit integer operations might need several instructions on the
+    32-bit integer ALUs.
+
+  * Any type can be emulated by software, but this page does not cover such
+    cases.
+
+Hardware data type support
 ==========================================

-AMD GPU hardware support for data types is listed in the following tables.
+The following tables provide information about AMD Instinct accelerators support
+for various data types. The MI200 series GPUs, which include MI210, MI250, and
+MI250X, are based on the CDNA2 architecture. The MI300 series GPUs, consisting
+of MI300A, MI300X, and MI325X, are built on the CDNA3 architecture.

 Compute units support
 -------------------------------------------------------------------------------
@@ -375,21 +411,23 @@ The following table lists data type support for atomic operations.

 .. note::

-  For cases that are not natively supported, you can emulate atomic operations using software.
-  Software-emulated atomic operations have high negative performance impact when they frequently
-  access the same memory address.
+  You can emulate atomic operations using software for cases that are not
+  natively supported. Software-emulated atomic operations have a high negative
+  performance impact when they frequently access the same memory address.

-Data Type support in ROCm Libraries
+Data type support in ROCm libraries
 ==========================================

-ROCm library support for int8, float8 (E4M3), float8 (E5M2), int16, float16, bfloat16, int32,
-tensorfloat32, float32, int64, and float64 is listed in the following tables.
+ROCm library support for int8, float8 (E4M3), float8 (E5M2), int16, float16,
+bfloat16, int32, tensorfloat32, float32, int64, and float64 is listed in the
+following tables.

 Libraries input/output type support
 -------------------------------------------------------------------------------

-The following tables list ROCm library support for specific input and output data types. For a detailed
-description, refer to the corresponding library data type support page.
+The following tables list ROCm library support for specific input and output
+data types. Refer to the corresponding library data type support page for a
+detailed description.

 .. tab-set::

@@ -516,8 +554,9 @@ description, refer to the corresponding library data type support page.
 Libraries internal calculations type support
 -------------------------------------------------------------------------------

-The following tables list ROCm library support for specific internal data types. For a detailed
-description, refer to the corresponding library data type support page.
+The following tables list ROCm library support for specific internal data types.
+Refer to the corresponding library data type support page for a detailed
+description.

 .. tab-set::
Author	SHA1	Message	Date
Adel Johar	1499f74c22	Docs: Add Device Major/Minor Versions to gpu-arch-spec.rst	2025-02-13 14:24:00 +01:00
Peter Park	2751a17cf0	Update vLLM benchmarking guide (#4347 ) * update vllm-benchmark fix hlist overflow update standalone benchmarking options update list of models fix typo and model name unnecessary duplicate info update formatting update vllm benchmark guide - remove Llama 2 FP8 - add Jais 13B - update commands update docker pull tag update MAD available models remove extra mad models not relevant to vllm update PyTorch version add changelog add model names to .wordlist.txt * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * Update docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst Co-authored-by: Pratik Basyal <pratik.basyal@amd.com> * fix typo * update link * fix link text * change changelog to previous versions * fix typo * remove "for" --------- Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>	2025-02-05 17:18:35 -05:00
Peter Park	9b0ae86b1b	Fix ROCm Bandwidth Test license type Fix ROCm Bandwidth Test license type	2025-02-05 16:40:31 -05:00
harkgill-amd	16f7cb4c04	Update issue workflow to trigger on edit (#4346 )	2025-02-05 14:46:16 -05:00
harkgill-amd	de007b6faf	Update issue_retrieval.yml (#4342 )	2025-02-05 13:21:44 -05:00
Daniel Su	aa1333269c	Ex CI: add ROCM_PATH to rocBLAS (#4343 )	2025-02-05 13:20:36 -05:00
Pratik Basyal	acb8f60304	Radeon support note updated in 6.3.2 (#4339 )	2025-02-04 17:44:24 -05:00
Istvan Kiss	faa67965dd	Precision support page update	2025-02-04 16:17:31 +01:00