Explicitely use gfortran

Is the omp fortran backend coming from openblas
add aomp
2026-01-17 02:28:04 -05:00 · 2026-01-15 17:19:09 -07:00 · 2026-01-15 17:06:17 -07:00 · 2026-01-15 16:46:18 -07:00 · 2026-01-15 16:32:30 -07:00 · 2026-01-15 16:20:39 -07:00
16 changed files with 912 additions and 585 deletions
--- a/.azuredevops/components/HIP.yml
+++ b/.azuredevops/components/HIP.yml
@@ -34,6 +34,7 @@ parameters:
  default:
    - cmake
    - libnuma-dev
+    - libsimde-dev
    - mesa-common-dev
    - ninja-build
    - ocl-icd-libopencl1
--- a/.azuredevops/components/hipSPARSE.yml
+++ b/.azuredevops/components/hipSPARSE.yml
@@ -32,7 +32,6 @@ parameters:
 - name: aptPackages
  type: object
  default:
-    - cmake
    - gfortran
    - git
    - libboost-program-options-dev
@@ -42,6 +41,7 @@ parameters:
 - name: rocmDependencies
  type: object
  default:
+    - aomp
    - clr
    - llvm-project
    - rocminfo
@@ -51,6 +51,7 @@ parameters:
 - name: rocmTestDependencies
  type: object
  default:
+    - aomp
    - clr
    - llvm-project
    - hipBLAS-common
@@ -103,6 +104,7 @@ jobs:
      parameters:
        aptPackages: ${{ parameters.aptPackages }}
        packageManager: ${{ job.packageManager }}
+    - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
    - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
    - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
      parameters:
@@ -128,6 +130,7 @@ jobs:
          -DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/vendor
          -DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
          -DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/amdclang
+          -DCMAKE_Fortran_COMPILER=gfortran
          -DCMAKE_BUILD_TYPE=Release
          -DBUILD_CLIENTS_TESTS=ON
          -DBUILD_CLIENTS_SAMPLES=OFF
--- a/docs/compatibility/compatibility-matrix-historical-6.0.csv
+++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv
@@ -37,7 +37,7 @@ ROCm Version,7.1.1,7.1.0,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6
      :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
      :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,2.4.0,2.4.0,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
      :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
-      :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,2.51.1,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
      :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,N/A,N/A,N/A,b6652,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
      :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
      `ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.23.1,1.22.0,1.22.0,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -157,8 +157,8 @@ compatibility and system requirements.

 .. [#os-compatibility] Some operating systems are supported on limited GPUs. For detailed information, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
 .. [#gpu-compatibility] Some GPUs have limited operating system support. For detailed information, see the latest :ref:`supported_GPUs`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-gpus>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-gpus>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-gpus>`__.
-.. [#dgl_compat] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
-.. [#llama-cpp_compat] llama.cpp is supported only on ROCm 7.0.0 and ROCm 6.4.x.
+.. [#dgl_compat] DGL is only supported on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
+.. [#llama-cpp_compat] llama.cpp is only supported on ROCm 7.0.0 and ROCm 6.4.x.
 .. [#mi325x_KVM] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
 .. [#driver_patch] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
 .. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
@@ -204,13 +204,13 @@ Expand for full historical view of:
   .. [#os-compatibility-past-60] Some operating systems are supported on limited GPUs. For detailed information, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
   .. [#gpu-compatibility-past-60] Some GPUs have limited operating system support. For detailed information, see the latest :ref:`supported_GPUs`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-gpus>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-gpus>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-gpus>`__.
   .. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead.
-   .. [#verl_compat-past-60] verl is supported only on ROCm 7.0.0 and 6.2.0.
-   .. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is supported only on ROCm 6.3.0.
-   .. [#dgl_compat-past-60] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
-   .. [#megablocks_compat-past-60] Megablocks is supported only on ROCm 6.3.0.
-   .. [#ray_compat-past-60] Ray is supported only on ROCm 6.4.1.
-   .. [#llama-cpp_compat-past-60] llama.cpp is supported only on ROCm 7.0.0 and 6.4.x.
-   .. [#flashinfer_compat-past-60] FlashInfer is supported only on ROCm 6.4.1.
+   .. [#verl_compat-past-60] verl is only supported on ROCm 7.0.0 and 6.2.0.
+   .. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is only supported on ROCm 6.3.0.
+   .. [#dgl_compat-past-60] DGL is only supported on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
+   .. [#megablocks_compat-past-60] Megablocks is only supported on ROCm 6.3.0.
+   .. [#ray_compat-past-60] Ray is only supported on ROCm 7.0.0 and 6.4.1.
+   .. [#llama-cpp_compat-past-60] llama.cpp is only supported on ROCm 7.0.0 and 6.4.x.
+   .. [#flashinfer_compat-past-60] FlashInfer is only supported on ROCm 6.4.1.
   .. [#mi325x_KVM-past-60] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
   .. [#driver_patch-past-60] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
   .. [#kfd_support-past-60] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
--- a/docs/compatibility/ml-compatibility/dgl-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/dgl-compatibility.rst
@@ -36,63 +36,9 @@ Support overview
  - You can also consult the upstream `Installation guide <https://www.dgl.ai/pages/start.html>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
-
-DGL is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__, 
-`ROCm 6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__, and `ROCm 6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
-
-Supported devices
--------------------------------------------------------------------------------
-
-**Officially Supported**: AMD Instinct™ MI300X, MI250X
-
-.. _dgl-recommendations:
-
-Use cases and recommendations
-================================================================================
-
-DGL can be used for Graph Learning, and building popular graph models like  
-GAT, GCN, and GraphSage. Using these models, a variety of use cases are supported:
-
- Recommender systems
- Network Optimization and Analysis
- 1D (Temporal) and 2D (Image) Classification
- Drug Discovery
-
-For use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
-where you can search for DGL examples and best practices to optimize your workloads on AMD GPUs.
-
-* Although multiple use cases of DGL have been tested and verified, a few have been  
-  outlined in the `DGL in the Real World: Running GNNs on Real Use Cases 
-  <https://rocm.blogs.amd.com/artificial-intelligence/dgl_blog2/README.html>`__ blog 
-  post, which walks through four real-world graph neural network (GNN) workloads 
-  implemented with the Deep Graph Library on ROCm. It covers tasks ranging from 
-  heterogeneous e-commerce graphs and multiplex networks (GATNE) to molecular graph 
-  regression (GNN-FiLM) and EEG-based neurological diagnosis (EEG-GCNN). For each use 
-  case, the authors detail: the dataset and task, how DGL is used, and their experience 
-  porting to ROCm. It is shown that DGL codebases often run without modification, with 
-  seamless integration of graph operations, message passing, sampling, and convolution. 
-
-* The `Graph Neural Networks (GNNs) at Scale: DGL with ROCm on AMD Hardware 
-  <https://rocm.blogs.amd.com/artificial-intelligence/why-graph-neural/README.html>`__ 
-  blog post introduces the Deep Graph Library (DGL) and its enablement on the AMD ROCm platform, 
-  bringing high-performance graph neural network (GNN) training to AMD GPUs. DGL bridges 
-  the gap between dense tensor frameworks and the irregular nature of graph data through a 
-  graph-first, message-passing abstraction. Its design ensures scalability, flexibility, and 
-  interoperability across frameworks like PyTorch and TensorFlow. AMD’s ROCm integration 
-  enables DGL to run efficiently on HIP-based GPUs, supported by prebuilt Docker containers 
-  and open-source repositories. This marks a major step in AMD's mission to advance open, 
-  scalable AI ecosystems beyond traditional architectures.
-
-You can pre-process datasets and begin training on AMD GPUs through:
-
-* Single-GPU training/inference
-* Multi-GPU training
-
 .. _dgl-docker-compat:

-Docker image compatibility
+Compatibility matrix
 ================================================================================

 .. |docker-icon| raw:: html
@@ -114,6 +60,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - PyTorch
      - Ubuntu
      - Python
+      - GPU

    * - .. raw:: html

@@ -124,6 +71,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.8.0 <https://github.com/pytorch/pytorch/releases/tag/v2.8.0>`__
      - 24.04
      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
+      - MI300X, MI250X

    * - .. raw:: html

@@ -134,6 +82,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
      - 24.04
      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
+      - MI300X, MI250X

    * - .. raw:: html

@@ -144,6 +93,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.7.1 <https://github.com/pytorch/pytorch/releases/tag/v2.7.1>`__
      - 22.04
      - `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
+      - MI300X, MI250X

    * - .. raw:: html

@@ -154,6 +104,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
      - 24.04
      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
+      - MI300X, MI250X

    * - .. raw:: html

@@ -164,6 +115,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
      - 24.04
      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
+      - MI300X, MI250X

    * - .. raw:: html

@@ -174,7 +126,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.4.1 <https://github.com/pytorch/pytorch/releases/tag/v2.4.1>`__
      - 24.04
      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
-
+      - MI300X, MI250X

    * - .. raw:: html

@@ -185,7 +137,7 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.4.1 <https://github.com/pytorch/pytorch/releases/tag/v2.4.1>`__
      - 22.04
      - `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
-
+      - MI300X, MI250X

    * - .. raw:: html

@@ -196,7 +148,10 @@ Click the |docker-icon| to view the image on Docker Hub.
      - `2.3.0 <https://github.com/pytorch/pytorch/releases/tag/v2.3.0>`__
      - 22.04
      - `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
-      
+      - MI300X, MI250X
+
+
+.. _dgl-key-rocm-libraries:

 Key ROCm libraries for DGL
 ================================================================================
@@ -310,8 +265,9 @@ If you prefer to build it yourself, ensure the following dependencies are instal
        multiplication (GEMM) and accumulation operations with mixed precision
        support.

+.. _dgl-supported-features-latest:

-Supported features
+Supported features with ROCm 7.0.0
 ================================================================================

 Many functions and methods available upstream are also supported in DGL on ROCm.
@@ -335,14 +291,17 @@ Instead of listing them all, support is grouped into the following categories to
 * DGL Sparse
 * GraphBolt

-Unsupported features
+.. _dgl-unsupported-features-latest:
+
+Unsupported features with ROCm 7.0.0
 ================================================================================

 * TF32 Support (only supported for PyTorch 2.7 and above)
 * Kineto/ROCTracer integration

+.. _dgl-unsupported-functions:

-Unsupported functions
+Unsupported functions with ROCm 7.0.0
 ================================================================================

 * ``bfs``
@@ -355,6 +314,50 @@ Unsupported functions
 * ``sample_labors_noprob``
 * ``sparse_admin``

+.. _dgl-recommendations:
+
+Use cases and recommendations
+================================================================================
+
+DGL can be used for Graph Learning, and building popular graph models like  
+GAT, GCN, and GraphSage. Using these models, a variety of use cases are supported:
+
+- Recommender systems
+- Network Optimization and Analysis
+- 1D (Temporal) and 2D (Image) Classification
+- Drug Discovery
+
+For use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
+where you can search for DGL examples and best practices to optimize your workloads on AMD GPUs.
+
+* Although multiple use cases of DGL have been tested and verified, a few have been  
+  outlined in the `DGL in the Real World: Running GNNs on Real Use Cases 
+  <https://rocm.blogs.amd.com/artificial-intelligence/dgl_blog2/README.html>`__ blog 
+  post, which walks through four real-world graph neural network (GNN) workloads 
+  implemented with the Deep Graph Library on ROCm. It covers tasks ranging from 
+  heterogeneous e-commerce graphs and multiplex networks (GATNE) to molecular graph 
+  regression (GNN-FiLM) and EEG-based neurological diagnosis (EEG-GCNN). For each use 
+  case, the authors detail: the dataset and task, how DGL is used, and their experience 
+  porting to ROCm. It is shown that DGL codebases often run without modification, with 
+  seamless integration of graph operations, message passing, sampling, and convolution. 
+
+* The `Graph Neural Networks (GNNs) at Scale: DGL with ROCm on AMD Hardware 
+  <https://rocm.blogs.amd.com/artificial-intelligence/why-graph-neural/README.html>`__ 
+  blog post introduces the Deep Graph Library (DGL) and its enablement on the AMD ROCm platform, 
+  bringing high-performance graph neural network (GNN) training to AMD GPUs. DGL bridges 
+  the gap between dense tensor frameworks and the irregular nature of graph data through a 
+  graph-first, message-passing abstraction. Its design ensures scalability, flexibility, and 
+  interoperability across frameworks like PyTorch and TensorFlow. AMD’s ROCm integration 
+  enables DGL to run efficiently on HIP-based GPUs, supported by prebuilt Docker containers 
+  and open-source repositories. This marks a major step in AMD's mission to advance open, 
+  scalable AI ecosystems beyond traditional architectures.
+
+You can pre-process datasets and begin training on AMD GPUs through:
+
+* Single-GPU training/inference
+* Multi-GPU training
+
+
 Previous versions
 ===============================================================================
 See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/dgl-history` to find documentation for previous releases
--- a/docs/compatibility/ml-compatibility/flashinfer-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/flashinfer-compatibility.rst
@@ -42,38 +42,9 @@ Support overview
  - You can also consult the upstream `Installation guide <https://docs.flashinfer.ai/installation.html>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
-
-FlashInfer is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
-
-Supported devices
--------------------------------------------------------------------------------
-
-**Officially Supported**: AMD Instinct™ MI300X
-
-
-.. _flashinfer-recommendations:
-
-Use cases and recommendations
-================================================================================
-
-This release of FlashInfer on ROCm provides the decode functionality for LLM inferencing.
-In the decode phase, tokens are generated sequentially, with the model predicting each new 
-token based on the previously generated tokens and the input context.
-
-FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense 
-attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.
-
-Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm 
-also implements cascade attention from upstream to reduce memory usage. 
-
-For currently supported use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
-where you can search for examples and best practices to optimize your workloads on AMD GPUs.
-
 .. _flashinfer-docker-compat:

-Docker image compatibility
+Compatibility matrix
 ================================================================================

 .. |docker-icon| raw:: html
@@ -95,6 +66,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - PyTorch
      - Ubuntu
      - Python
+      - GPU

    * - .. raw:: html

@@ -104,5 +76,23 @@ Click |docker-icon| to view the image on Docker Hub.
      - `2.7.1 <https://github.com/ROCm/pytorch/releases/tag/v2.7.1>`__
      - 24.04
      - `3.12 <https://www.python.org/downloads/release/python-3129/>`__
+      - MI300X

+.. _flashinfer-recommendations:
+
+Use cases and recommendations
+================================================================================
+
+The release of FlashInfer on ROCm provides the decode functionality for LLM inferencing.
+In the decode phase, tokens are generated sequentially, with the model predicting each new 
+token based on the previously generated tokens and the input context.
+
+FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense 
+attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.
+
+Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm 
+also implements cascade attention from upstream to reduce memory usage. 
+
+For currently supported use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
+where you can search for examples and best practices to optimize your workloads on AMD GPUs.

--- a/docs/compatibility/ml-compatibility/llama-cpp-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/llama-cpp-compatibility.rst
@@ -36,47 +36,9 @@ Support overview
  - You can also consult the upstream `Installation guide <https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
-
-llama.cpp is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__ and 
-`ROCm 6.4.x <https://repo.radeon.com/rocm/apt/6.4/>`__.
-
-Supported devices
--------------------------------------------------------------------------------
-
-**Officially Supported**: AMD Instinct™ MI325X, MI300X, MI210
-
-Use cases and recommendations
-================================================================================
-
-llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
-
- Plain C/C++ implementation with no external dependencies
- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
-
-llama.cpp is also used in a range of real-world applications, including:
-
- Games such as `Lucy's Labyrinth <https://github.com/MorganRO8/Lucys_Labyrinth>`__:
-  A simple maze game where AI-controlled agents attempt to trick the player.
- Tools such as `Styled Lines <https://marketplace.unity.com/packages/tools/ai-ml-integration/style-text-webgl-ios-stand-alone-llm-llama-cpp-wrapper-292902>`__:
-  A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
- Various other AI applications use llama.cpp as their inference engine;  
-  for a detailed list, see the `user interfaces (UIs) section <https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#description>`__.
-
-For more use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
-where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
-
- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration <https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html>`__ 
-  blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``, 
-  server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for 
-  AMD Instinct GPUs within the ROCm ecosystem. 
-
 .. _llama-cpp-docker-compat:

-Docker image compatibility
+Compatibility matrix
 ================================================================================

 .. |docker-icon| raw:: html
@@ -106,6 +68,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - llama.cpp
      - ROCm
      - Ubuntu
+      - GPU

    * - .. raw:: html

@@ -119,6 +82,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
      - `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
      - 24.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -132,6 +96,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
      - `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
      - 22.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -145,6 +110,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
      - 24.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -158,7 +124,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
      - 22.04
-
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -172,6 +138,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
      - 24.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -185,7 +152,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
      - 22.04
-
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -199,6 +166,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
      - 24.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -212,6 +180,7 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
      - `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
      - 22.04
+      - MI325X, MI300X, MI210

    * - .. raw:: html

@@ -225,7 +194,9 @@ Click |docker-icon| to view the image on Docker Hub.
      - `b5997 <https://github.com/ROCm/llama.cpp/tree/release/b5997>`__
      - `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__
      - 24.04
+      - MI300X, MI210

+.. _llama-cpp-key-rocm-libraries:

 Key ROCm libraries for llama.cpp
 ================================================================================
@@ -268,6 +239,36 @@ your corresponding ROCm version.
      - Can be used to enhance the flash attention performance on AMD compute, by enabling
        the flag during compile time.

+.. _llama-cpp-uses-recommendations:
+
+Use cases and recommendations
+================================================================================
+
+llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
+
+- Plain C/C++ implementation with no external dependencies
+- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
+- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
+- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
+
+llama.cpp is also used in a range of real-world applications, including:
+
+- Games such as `Lucy's Labyrinth <https://github.com/MorganRO8/Lucys_Labyrinth>`__:
+  A simple maze game where AI-controlled agents attempt to trick the player.
+- Tools such as `Styled Lines <https://marketplace.unity.com/packages/tools/ai-ml-integration/style-text-webgl-ios-stand-alone-llm-llama-cpp-wrapper-292902>`__:
+  A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
+- Various other AI applications use llama.cpp as their inference engine;  
+  for a detailed list, see the `user interfaces (UIs) section <https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#description>`__.
+
+For more use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
+where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
+
+- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration <https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html>`__ 
+  blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``, 
+  server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for 
+  AMD Instinct GPUs within the ROCm ecosystem. 
+
+
 Previous versions
 ===============================================================================
 See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/llama-cpp-history` to find documentation for previous releases
--- a/docs/compatibility/ml-compatibility/megablocks-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/megablocks-compatibility.rst
@@ -33,19 +33,44 @@ Support overview
  - You can also consult the upstream `Installation guide <https://github.com/databricks/megablocks>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
+.. _megablocks-docker-compat:

-Megablocks is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
+Compatibility matrix
+================================================================================

-Supported devices
--------------------------------------------------------------------------------
+.. |docker-icon| raw:: html

- **Officially Supported**: AMD Instinct™ MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct™ MI250X, MI210
+   <i class="fab fa-docker"></i>

-Supported models and features
--------------------------------------------------------------------------------
+AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
+with ROCm backends on Docker Hub. The following Docker image tag and associated
+inventories represent the latest available Megablocks version from the official Docker Hub. 
+Click |docker-icon| to view the image on Docker Hub.
+
+.. list-table:: 
+    :header-rows: 1
+    :class: docker-image-compatibility
+
+    * - Docker image
+      - ROCm
+      - Megablocks
+      - PyTorch
+      - Ubuntu
+      - Python
+      - GPU
+
+    * - .. raw:: html
+
+           <a href="https://hub.docker.com/layers/rocm/megablocks/megablocks-0.7.0_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-372ff89b96599019b8f5f9db469c84add2529b713456781fa62eb9a148659ab4"><i class="fab fa-docker fa-lg"></i> rocm/megablocks</a>
+      - `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
+      - `0.7.0 <https://github.com/databricks/megablocks/releases/tag/v0.7.0>`_
+      - `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
+      - 24.04
+      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
+      - MI300X
+
+Supported models and features with ROCm 6.3.0
+================================================================================

 This section summarizes the Megablocks features supported by ROCm.

@@ -77,38 +102,3 @@ It features how to pre-process datasets and how to begin pre-training on AMD GPU
 * Single-GPU pre-training
 * Multi-GPU pre-training

-.. _megablocks-docker-compat:
-
-Docker image compatibility
-================================================================================
-
-.. |docker-icon| raw:: html
-
-   <i class="fab fa-docker"></i>
-
-AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
-with ROCm backends on Docker Hub. The following Docker image tag and associated
-inventories represent the latest available Megablocks version from the official Docker Hub. 
-Click |docker-icon| to view the image on Docker Hub.
-
-.. list-table:: 
-    :header-rows: 1
-    :class: docker-image-compatibility
-
-    * - Docker image
-      - ROCm
-      - Megablocks
-      - PyTorch
-      - Ubuntu
-      - Python
-
-    * - .. raw:: html
-
-           <a href="https://hub.docker.com/layers/rocm/megablocks/megablocks-0.7.0_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-372ff89b96599019b8f5f9db469c84add2529b713456781fa62eb9a148659ab4"><i class="fab fa-docker fa-lg"></i> rocm/megablocks</a>
-      - `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
-      - `0.7.0 <https://github.com/databricks/megablocks/releases/tag/v0.7.0>`_
-      - `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
-      - 24.04
-      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
-
-
--- a/docs/compatibility/ml-compatibility/ray-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/ray-compatibility.rst
@@ -12,8 +12,8 @@ Ray compatibility

 Ray is a unified framework for scaling AI and Python applications from your laptop 
 to a full cluster, without changing your code. Ray consists of `a core distributed 
-runtime  <https://docs.ray.io/en/latest/ray-core/walkthrough.html>`_ and a set of 
-`AI libraries <https://docs.ray.io/en/latest/ray-air/getting-started.html>`_ for 
+runtime  <https://docs.ray.io/en/latest/ray-core/walkthrough.html>`__ and a set of 
+`AI libraries <https://docs.ray.io/en/latest/ray-air/getting-started.html>`__ for 
 simplifying machine learning computations.

 Ray is a general-purpose framework that runs many types of workloads efficiently. 
@@ -29,25 +29,57 @@ Support overview
 - To get started and install Ray on ROCm, use the prebuilt :ref:`Docker image <ray-docker-compat>`, 
  which includes ROCm, Ray, and all required dependencies.

-  - The Docker image provided is based on the upstream Ray `Daily Release (Nightly) wheels 
-    <https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies>`__ 
-    corresponding to commit `005c372 <https://github.com/ray-project/ray/commit/005c372262e050d5745f475e22e64305fa07f8b8>`__.
-
-  - See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>` 
+  - See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
    for installation and setup instructions.

  - You can also consult the upstream `Installation guide <https://docs.ray.io/en/latest/ray-overview/installation.html>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
+.. _ray-docker-compat:

-Ray is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
+Compatibility matrix
+================================================================================

-Supported devices
--------------------------------------------------------------------------------
+.. |docker-icon| raw:: html

-**Officially Supported**: AMD Instinct™ MI300X, MI210
+   <i class="fab fa-docker"></i>
+
+AMD validates and publishes `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
+with ROCm backends on Docker Hub. The following Docker image tags and
+associated inventories represent the latest Ray version from the official Docker Hub.
+Click |docker-icon| to view the image on Docker Hub.
+
+.. list-table::
+    :header-rows: 1
+    :class: docker-image-compatibility
+
+    * - Docker image
+      - ROCm
+      - Ray
+      - Pytorch
+      - Ubuntu
+      - Python
+      - GPU
+
+    * - .. raw:: html
+
+           <a href="https://hub.docker.com/layers/rocm/ray/ray-2.51.1_rocm7.0.0_ubuntu22.04_py3.12_pytorch2.9.0/images/sha256-a02f6766b4ba406f88fd7e85707ec86c04b569834d869a08043ec9bcbd672168"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
+      - `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
+      - `2.51.1 <https://github.com/ROCm/ray/tree/release/2.51.1>`__
+      - 2.9.0a0+git1c57644
+      - 22.04
+      - `3.12.12 <https://www.python.org/downloads/release/python-31212/>`__
+      - MI300X
+
+    * - .. raw:: html
+
+           <a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
+      - `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
+      - `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`__
+      - 2.6.0+git684f6f2
+      - 24.04
+      - `3.12.10 <https://www.python.org/downloads/release/python-31210/>`__
+      - MI300X, MI210

 Use cases and recommendations
 ================================================================================
@@ -76,36 +108,7 @@ topic <https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accel
 of the Ray core documentation and refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__, 
 where you can search for Ray examples and best practices to optimize your workloads on AMD GPUs.

-.. _ray-docker-compat:
-
-Docker image compatibility
-================================================================================
-
-.. |docker-icon| raw:: html
-
-   <i class="fab fa-docker"></i>
-
-AMD validates and publishes ready-made `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
-with ROCm backends on Docker Hub. The following Docker image tags and
-associated inventories represent the latest Ray version from the official Docker Hub.
-Click the |docker-icon| icon to view the image on Docker Hub.
-
-.. list-table::
-    :header-rows: 1
-    :class: docker-image-compatibility
-
-    * - Docker image
-      - ROCm
-      - Ray
-      - Pytorch
-      - Ubuntu
-      - Python
-
-    * - .. raw:: html
-
-           <a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
-      - `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
-      - `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`_
-      - 2.6.0+git684f6f2
-      - 24.04
-      - `3.12.10 <https://www.python.org/downloads/release/python-31210/>`_
+Previous versions
+===============================================================================
+See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/ray-history` to find documentation for previous releases
+of the ``ROCm/ray`` Docker image.
--- a/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst
@@ -35,19 +35,45 @@ Support overview
  - You can also consult the upstream `Installation guide <https://github.com/NVIDIA/Megatron-LM>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
+.. _megatron-lm-docker-compat:

-Stanford Megatron-LM is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
+Compatibility matrix
+================================================================================

-Supported devices
--------------------------------------------------------------------------------
+.. |docker-icon| raw:: html

- **Officially Supported**: AMD Instinct™ MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct™ MI250X, MI210
+   <i class="fab fa-docker"></i>

-Supported models and features
--------------------------------------------------------------------------------
+AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
+with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
+inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
+Click |docker-icon| to view the image on Docker Hub.
+
+.. list-table:: 
+    :header-rows: 1
+    :class: docker-image-compatibility
+
+    * - Docker image
+      - ROCm
+      - Stanford Megatron-LM
+      - PyTorch
+      - Ubuntu
+      - Python
+      - GPU
+
+    * - .. raw:: html
+
+           <a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i> rocm/stanford-megatron-lm</a>
+
+      - `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
+      - `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
+      - `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
+      - 24.04
+      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
+      - MI300X
+
+Supported models and features with ROCm 6.3.0
+================================================================================

 This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.

@@ -88,41 +114,3 @@ It features how to pre-process datasets and how to begin pre-training on AMD GPU

 * Single-GPU pre-training
 * Multi-GPU pre-training
-
-.. _megatron-lm-docker-compat:
-
-Docker image compatibility
-================================================================================
-
-.. |docker-icon| raw:: html
-
-   <i class="fab fa-docker"></i>
-
-AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
-with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
-inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
-Click |docker-icon| to view the image on Docker Hub.
-
-.. list-table:: 
-    :header-rows: 1
-    :class: docker-image-compatibility
-
-    * - Docker image
-      - ROCm
-      - Stanford Megatron-LM
-      - PyTorch
-      - Ubuntu
-      - Python
-
-    * - .. raw:: html
-
-           <a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i></a>
-
-      - `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
-      - `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
-      - `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
-      - 24.04
-      - `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
-
-      
-
--- a/docs/compatibility/ml-compatibility/verl-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/verl-compatibility.rst
@@ -37,67 +37,9 @@ Support overview
  - You can also consult the upstream `verl documentation <https://verl.readthedocs.io/en/latest/>`__ 
    for additional context.

-Version support
--------------------------------------------------------------------------------
-
-verl is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__ and
-`ROCm 6.2.0 <https://repo.radeon.com/rocm/apt/6.2/>`__.
-
-Supported devices
--------------------------------------------------------------------------------
-
-**Officially Supported**: AMD Instinct™ MI300X
-
-.. _verl-recommendations:
-
-Use cases and recommendations
-================================================================================
-
-* The benefits of verl in large-scale reinforcement learning from human feedback 
-  (RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD 
-  GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__ 
-  blog. The blog post outlines how the Volcano Engine Reinforcement Learning 
-  (verl) framework integrates with the AMD ROCm platform to optimize training on 
-  AMD Instinct™ GPUs. The guide details the process of building a Docker image, 
-  setting up single-node and multi-node training environments, and highlights 
-  performance benchmarks demonstrating improved throughput and convergence accuracy. 
-  This resource serves as a comprehensive starting point for deploying verl on AMD GPUs, 
-  facilitating efficient RLHF training workflows.
-
-.. _verl-supported_features:
-
-Supported features
-===============================================================================
-
-The following table shows verl on ROCm support for GPU-accelerated modules.
-
-.. list-table::
-    :header-rows: 1
-
-    * - Module
-      - Description
-      - verl version
-      - ROCm version
-    * - ``FSDP``
-      - Training engine
-      - 
-       * 0.6.0
-       * 0.3.0.post0
-      - 
-       * 7.0.0
-       * 6.2.0
-    * - ``vllm``
-      - Inference engine
-      - 
-       * 0.6.0
-       * 0.3.0.post0
-      - 
-       * 7.0.0
-       * 6.2.0
-
 .. _verl-docker-compat:

-Docker image compatibility
+Compatibility matrix
 ================================================================================

 .. |docker-icon| raw:: html
@@ -120,6 +62,7 @@ Click |docker-icon| to view the image on Docker Hub.
     - PyTorch
     - Python
     - vllm
+     - GPU

   * - .. raw:: html

@@ -130,6 +73,7 @@ Click |docker-icon| to view the image on Docker Hub.
     - `2.9.0 <https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x>`__
     - `3.12.11 <https://www.python.org/downloads/release/python-31211/>`__
     - `0.11.0 <https://github.com/vllm-project/vllm/releases/tag/v0.11.0>`__
+     - MI300X

   * - .. raw:: html

@@ -140,7 +84,33 @@ Click |docker-icon| to view the image on Docker Hub.
     - `2.5.0 <https://github.com/ROCm/pytorch/tree/release/2.5>`__
     - `3.9.19 <https://www.python.org/downloads/release/python-3919/>`__
     - `0.6.3 <https://github.com/vllm-project/vllm/releases/tag/v0.6.3>`__
+     - MI300X

+.. _verl-supported_features:
+
+Supported modules with verl on ROCm
+===============================================================================
+
+The following GPU-accelerated modules are supported with verl on ROCm:
+
+- ``FSDP``: Training engine
+- ``vllm``: Inference engine
+
+.. _verl-recommendations:
+
+Use cases and recommendations
+================================================================================
+
+* The benefits of verl in large-scale reinforcement learning from human feedback 
+  (RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD 
+  GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__ 
+  blog. The blog post outlines how the Volcano Engine Reinforcement Learning 
+  (verl) framework integrates with the AMD ROCm platform to optimize training on 
+  AMD Instinct™ GPUs. The guide details the process of building a Docker image, 
+  setting up single-node and multi-node training environments, and highlights 
+  performance benchmarks demonstrating improved throughput and convergence accuracy. 
+  This resource serves as a comprehensive starting point for deploying verl on AMD GPUs, 
+  facilitating efficient RLHF training workflows.

 Previous versions
 ===============================================================================
--- a/docs/data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml
@@ -8,6 +8,303 @@ dockers:
      hipBLASLt: 1.0.0
    dockerfile:
      commit: 8398684622109c806a35d660647060b0b9910663
+
+configs:
+  default:
+    ## DeepSeek AITER MLA currently only supports --block-size 1
+    - &deepseek-r1-serving
+      benchmark: serving
+      model: deepseek-ai/DeepSeek-R1-0528
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      extra_args:
+        async-scheduling: True
+        block-size: 1
+    ## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
+    - &gpt-oss-120b-serving
+      benchmark: serving
+      model: openai/gpt-oss-120b
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      env:
+        VLLM_ROCM_USE_AITER_MHA: 0
+        VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION: 1
+      extra_args:
+        async-scheduling: True
+        block-size: 64
+        compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
+    - &llama-3-serving
+      benchmark: serving
+      model:
+        meta-llama/Llama-3.1-405B-Instruct
+        amd/Llama-3.1-405B-Instruct-FP8-KV
+        meta-llama/Llama-3.3-70B-Instruct
+        amd/Llama-3.3-70B-Instruct-FP8-KV
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      extra_args:
+        async-scheduling: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Llama 3.x MXFP4 (gfx950 only)
+    - &llama-3-mxfp4-serving
+      benchmark: serving
+      model:
+        amd/Llama-3.1-405B-Instruct-MXFP4-Preview
+        amd/Llama-3.3-70B-Instruct-MXFP4-Preview
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      extra_args:
+        async-scheduling: True
+    ## Llama 4 currently does not support full cudagraph or attn fusion
+    - &llama-4-fp8-serving
+      benchmark: serving
+      model:
+        meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      extra_args:
+        async-scheduling: True
+        compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    - &mixtral-8x22b-serving
+      benchmark: serving
+      model:
+        mistralai/Mixtral-8x22B-Instruct-v0.1
+        amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1 8 32 128
+      extra_args:
+        async-scheduling: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+  extended:
+    ## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
+    - &gpt-oss-20b-serving
+      benchmark: serving
+      model:
+        openai/gpt-oss-20b
+      tp: 1
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      env:
+        VLLM_ROCM_USE_AITER_MHA: 0
+        VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION: 1
+      extra_args:
+        async-scheduling: True
+        block-size: 64
+        compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
+    - &llama-3-8b-phi-4-qwen3-serving
+      benchmark: serving
+      model:
+        meta-llama/Llama-3.1-8B-Instruct
+        amd/Llama-3.1-8B-Instruct-FP8-KV
+        microsoft/phi-4
+        Qwen/Qwen3-8B
+        Qwen/Qwen3-32B
+        Qwen/Qwen3-30B-A3B-Thinking-2507
+        Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
+      tp: 1
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      extra_args:
+        async-scheduling: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+
+    - &llama-2-70b-serving
+      benchmark: serving
+      model:
+        meta-llama/Llama-2-70b-chat-hf
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      extra_args:
+        async-scheduling: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Llama 4 currently does not support full cudagraph or attn fusion
+    - &llama-4-serving
+      benchmark: serving
+      model:
+        meta-llama/Llama-4-Scout-17B-16E-Instruct
+        meta-llama/Llama-4-Maverick-17B-128E-Instruct
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      extra_args:
+        async-scheduling: True
+        compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
+      arch_overrides:
+        gfx942:
+          dtype: float16
+
+    - &mixtral-8x7b-serving
+      benchmark: serving
+      model:
+        mistralai/Mixtral-8x7B-Instruct-v0.1
+        amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      extra_args:
+        async-scheduling: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Qwen 235B requires --enable-expert-parallel with tp 8
+    - &qwen3-235b-a22b-serving
+      benchmark: serving
+      model:
+        Qwen/Qwen3-235B-A22B-Thinking-2507
+        Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
+      tp: 8
+      inp: 1024
+      out: 1024
+      dtype: auto
+      max_concurrency: 1
+      extra_args:
+        async-scheduling: True
+        enable-expert-parallel: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+  accuracy:
+    ## DeepSeek AITER MLA currently only supports --block-size 1
+    - &deepseek-r1-accuracy
+      benchmark: accuracy
+      model: deepseek-ai/DeepSeek-R1-0528
+      tp: 8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+        block-size: 1
+      bench_args:
+        apply_chat_template: True
+    ## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
+    - &gpt-oss-120b-accuracy
+      benchmark: accuracy
+      model: openai/gpt-oss-120b
+      tp: 8
+      dtype: auto
+      env:
+        VLLM_ROCM_USE_AITER_MHA: 0
+        VLLM_USE_AITER_UNIFIED_ATTENTION: 1
+      extra_args:
+        async-scheduling: True
+        block-size: 64
+        compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
+      bench_args:
+        apply_chat_template: True
+    ## Llama 3.x bf16 and fp8 perform better with --dtype float16 on gfx942
+    - &llama-3-accuracy
+      benchmark: accuracy
+      model:
+        meta-llama/Llama-3.1-405B-Instruct
+        amd/Llama-3.1-405B-Instruct-FP8-KV
+        meta-llama/Llama-3.3-70B-Instruct
+        amd/Llama-3.3-70B-Instruct-FP8-KV
+      tp: 8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+      bench_args:
+        apply_chat_template: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Llama 3.x MXFP4 (gfx950 only)
+    - &llama-3-mxfp4-accuracy
+      benchmark: accuracy
+      model:
+        amd/Llama-3.1-405B-Instruct-MXFP4-Preview
+        amd/Llama-3.3-70B-Instruct-MXFP4-Preview
+      tp: 8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+      bench_args:
+        apply_chat_template: True
+    ## Llama 4 currently does not support full cudagraph or attn fusion
+    - &llama-4-fp8-accuracy
+      benchmark: accuracy
+      model:
+        meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
+      tp: 8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+        compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
+      bench_args:
+        apply_chat_template: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Mistral models require --tokenizer-mode mistral for correct decoding
+    - &mixtral-8x22b-accuracy
+      benchmark: accuracy
+      model:
+        mistralai/Mixtral-8x22B-Instruct-v0.1
+        amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
+      tp: 8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+      bench_args:
+        apply_chat_template: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+    ## Qwen 235B requires --enable-expert-parallel with tp 8
+    - &qwen3-235b-a22b-accuracy
+      benchmark: accuracy
+      model:
+        Qwen/Qwen3-235B-A22B-Thinking-2507
+        Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
+      dtype: auto
+      extra_args:
+        async-scheduling: True
+        enable-expert-parallel: True
+      bench_args:
+        apply_chat_template: True
+      arch_overrides:
+        gfx942:
+          dtype: float16
+
 model_groups:
  - group: Meta Llama
    tag: llama
@@ -18,132 +315,139 @@ model_groups:
        url: https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 4096
-          max_model_len: 4096
+          serving: *llama-2-70b-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 4096
+            max_model_len: 4096
      - model: Llama 3.1 8B
        mad_tag: pyt_vllm_llama-3.1-8b
        model_repo: meta-llama/Llama-3.1-8B-Instruct
        url: https://huggingface.co/meta-llama/Llama-3.1-8B
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.1 8B FP8
        mad_tag: pyt_vllm_llama-3.1-8b_fp8
        model_repo: amd/Llama-3.1-8B-Instruct-FP8-KV
        url: https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV
        precision: float8
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.1 405B
        mad_tag: pyt_vllm_llama-3.1-405b
        model_repo: meta-llama/Llama-3.1-405B-Instruct
        url: https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-serving
+          accuracy: *llama-3-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.1 405B FP8
        mad_tag: pyt_vllm_llama-3.1-405b_fp8
        model_repo: amd/Llama-3.1-405B-Instruct-FP8-KV
        url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-serving
+          accuracy: *llama-3-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.1 405B MXFP4
        mad_tag: pyt_vllm_llama-3.1-405b_fp4
        model_repo: amd/Llama-3.1-405B-Instruct-MXFP4-Preview
        url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview
        precision: float4
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-mxfp4-serving
+          accuracy: *llama-3-mxfp4-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.3 70B
        mad_tag: pyt_vllm_llama-3.3-70b
        model_repo: meta-llama/Llama-3.3-70B-Instruct
        url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-serving
+          accuracy: *llama-3-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.3 70B FP8
        mad_tag: pyt_vllm_llama-3.3-70b_fp8
        model_repo: amd/Llama-3.3-70B-Instruct-FP8-KV
        url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-serving
+          accuracy: *llama-3-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 3.3 70B MXFP4
        mad_tag: pyt_vllm_llama-3.3-70b_fp4
        model_repo: amd/Llama-3.3-70B-Instruct-MXFP4-Preview
        url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview
        precision: float4
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-3-mxfp4-serving
+          accuracy: *llama-3-mxfp4-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
      - model: Llama 4 Scout 17Bx16E
        mad_tag: pyt_vllm_llama-4-scout-17b-16e
        model_repo: meta-llama/Llama-4-Scout-17B-16E-Instruct
        url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 32768
-          max_model_len: 8192
+          serving: *llama-4-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 32768
+            max_model_len: 8192
      - model: Llama 4 Maverick 17Bx128E
        mad_tag: pyt_vllm_llama-4-maverick-17b-128e
        model_repo: meta-llama/Llama-4-Maverick-17B-128E-Instruct
        url: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 32768
-          max_model_len: 8192
+          serving: *llama-4-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 32768
+            max_model_len: 8192
      - model: Llama 4 Maverick 17Bx128E FP8
        mad_tag: pyt_vllm_llama-4-maverick-17b-128e_fp8
        model_repo: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
        url: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *llama-4-fp8-serving
+          accuracy: *llama-4-fp8-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
  - group: DeepSeek
    tag: deepseek
    models:
@@ -153,12 +457,12 @@ model_groups:
        url: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_seqs: 1024
-          max_num_batched_tokens: 131072
-          max_model_len: 8192
+          serving: *deepseek-r1-serving
+          accuracy: *deepseek-r1-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 131072
+            max_model_len: 8192
  - group: OpenAI GPT OSS
    tag: gpt-oss
    models:
@@ -168,22 +472,23 @@ model_groups:
        url: https://huggingface.co/openai/gpt-oss-20b
        precision: bfloat16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 8192
-          max_model_len: 8192
+          serving: *gpt-oss-20b-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 8192
+            max_model_len: 8192
      - model: GPT OSS 120B
        mad_tag: pyt_vllm_gpt-oss-120b
        model_repo: openai/gpt-oss-120b
        url: https://huggingface.co/openai/gpt-oss-120b
        precision: bfloat16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 8192
-          max_model_len: 8192
+          serving: *gpt-oss-120b-serving
+          accuracy: *gpt-oss-120b-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 8192
+            max_model_len: 8192
  - group: Mistral AI
    tag: mistral
    models:
@@ -193,44 +498,46 @@ model_groups:
        url: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 32768
-          max_model_len: 8192
+          serving: *mixtral-8x7b-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 32768
+            max_model_len: 8192
      - model: Mixtral MoE 8x7B FP8
        mad_tag: pyt_vllm_mixtral-8x7b_fp8
        model_repo: amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
        url: https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 32768
-          max_model_len: 8192
+          serving: *mixtral-8x7b-serving
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 32768
+            max_model_len: 8192
      - model: Mixtral MoE 8x22B
        mad_tag: pyt_vllm_mixtral-8x22b
        model_repo: mistralai/Mixtral-8x22B-Instruct-v0.1
        url: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 65536
-          max_model_len: 8192
+          serving: *mixtral-8x22b-serving
+          accuracy: *mixtral-8x22b-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 65536
+            max_model_len: 8192
      - model: Mixtral MoE 8x22B FP8
        mad_tag: pyt_vllm_mixtral-8x22b_fp8
        model_repo: amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
        url: https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 65536
-          max_model_len: 8192
+          serving: *mixtral-8x22b-serving
+          accuracy: *mixtral-8x22b-accuracy
+          ex:
+            kv_cache_dtype: fp8
+            max_num_batched_tokens: 65536
+            max_model_len: 8192
  - group: Qwen
    tag: qwen
    models:
@@ -240,66 +547,68 @@ model_groups:
        url: https://huggingface.co/Qwen/Qwen3-8B
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
      - model: Qwen3 32B
        mad_tag: pyt_vllm_qwen3-32b
-        model_repo: Qwen/Qwen3-32b
+        model_repo: Qwen/Qwen3-32B
        url: https://huggingface.co/Qwen/Qwen3-32B
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
-      - model: Qwen3 30B A3B
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
+      - model: Qwen3 30B A3B Thinking
        mad_tag: pyt_vllm_qwen3-30b-a3b
-        model_repo: Qwen/Qwen3-30B-A3B
-        url: https://huggingface.co/Qwen/Qwen3-30B-A3B
+        model_repo: Qwen/Qwen3-30B-A3B-Thinking-2507
+        url: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
-      - model: Qwen3 30B A3B FP8
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
+      - model: Qwen3 30B A3B Thinking FP8
        mad_tag: pyt_vllm_qwen3-30b-a3b_fp8
-        model_repo: Qwen/Qwen3-30B-A3B-FP8
-        url: https://huggingface.co/Qwen/Qwen3-30B-A3B-FP8
+        model_repo: Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
+        url: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
-      - model: Qwen3 235B A22B
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
+      - model: Qwen3 235B A22B Thinking
        mad_tag: pyt_vllm_qwen3-235b-a22b
-        model_repo: Qwen/Qwen3-235B-A22B
-        url: https://huggingface.co/Qwen/Qwen3-235B-A22B
+        model_repo: Qwen/Qwen3-235B-A22B-Thinking-2507
+        url: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
        precision: float16
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
-      - model: Qwen3 235B A22B FP8
+          serving: *qwen3-235b-a22b-serving
+          accuracy: *qwen3-235b-a22b-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
+      - model: Qwen3 235B A22B Thinking FP8
        mad_tag: pyt_vllm_qwen3-235b-a22b_fp8
-        model_repo: Qwen/Qwen3-235B-A22B-FP8
-        url: https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8
+        model_repo: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
+        url: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
        precision: float8
        config:
-          tp: 8
-          dtype: auto
-          kv_cache_dtype: fp8
-          max_num_batched_tokens: 40960
-          max_model_len: 8192
+          serving: *qwen3-235b-a22b-serving
+          accuracy: *qwen3-235b-a22b-accuracy
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 40960
+            max_model_len: 8192
  - group: Microsoft Phi
    tag: phi
    models:
@@ -309,8 +618,8 @@ model_groups:
        url: https://huggingface.co/microsoft/phi-4
        precision: float16
        config:
-          tp: 1
-          dtype: auto
-          kv_cache_dtype: auto
-          max_num_batched_tokens: 16384
-          max_model_len: 8192
+          serving: *llama-3-8b-phi-4-qwen3-serving
+          ex:
+            kv_cache_dtype: auto
+            max_num_batched_tokens: 16384
+            max_model_len: 8192
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst
@@ -189,6 +189,10 @@ Benchmarking
   {% for model_group in model_groups %}
      {% for model in model_group.models %}

+      {% set serv_config = model.config.serving %}
+      {% set acc_config = model.config.accuracy %}
+      {% set ex_config = model.config.ex %}
+
   .. container:: model-doc {{model.mad_tag}}

      .. tab-set::
@@ -283,108 +287,173 @@ Benchmarking
                   --name test \
                   {{ docker.pull_tag }}

-            .. rubric:: Throughput command
+            .. rubric:: Run the inference benchmarks

-            Use the following command to start the throughput benchmark.
+            .. tab-set::

-            .. code-block:: shell
+               .. tab-item:: Latency command

-               model={{ model.model_repo }}
-               tp={{ model.config.tp }}
-               num_prompts={{ model.config.num_prompts | default(1024) }}
-               in={{ model.config.in | default(128) }}
-               out={{ model.config.in | default(128) }}
-               dtype={{ model.config.dtype | default("auto") }}
-               kv_cache_dtype={{ model.config.kv_cache_dtype }}
-               max_num_seqs={{ model.config.max_num_seqs | default(1024) }}
-               max_num_batched_tokens={{ model.config.max_num_batched_tokens }}
-               max_model_len={{ model.config.max_model_len }}
+                  Use the following command to start the latency benchmark.

-               vllm bench throughput --model $model \
-                   -tp $tp \
-                   --num-prompts $num_prompts \
-                   --input-len $in \
-                   --output-len $out \
-                   --dtype $dtype \
-                   --kv-cache-dtype $kv_cache_dtype \
-                   --max-num-seqs $max_num_seqs \
-                   --max-num-batched-tokens $max_num_batched_tokens \
-                   --max-model-len $max_model_len \
-                   --trust-remote-code \
-                   --output-json ${model}_throughput.json \
-                   --gpu-memory-utilization {{ model.config.gpu_memory_utilization | default(0.9) }}
+                  .. code-block:: shell

-            .. rubric:: Serving command
+                     model={{ model.model_repo }}
+                     tp={{ serv_config.tp }}
+                     batch_size=16
+                     in={{ serv_config.inp | default(1024) }}
+                     out={{ serv_config.out | default(1024) }}
+                     dtype={{ serv_config.dtype | default("auto") }}
+                     kv_cache_dtype={{ ex_config.kv_cache_dtype | default("auto") }}
+                     max_num_seqs={{ ex_config.max_num_seqs | default(1024) }}
+                     max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
+                     max_model_len={{ ex_config.max_model_len }}

-            1. Start the server using the following command:
+                     vllm bench latency --model $model \
+                         -tp $tp \
+                         --batch-size $batch_size \
+                         --input-len $in \
+                         --output-len $out \
+                         --dtype $dtype \
+                         --kv-cache-dtype $kv_cache_dtype \
+                         --max-num-seqs $max_num_seqs \
+                         --max-num-batched-tokens $max_num_batched_tokens \
+                         --max-model-len $max_model_len \
+                         --output-json ${model}_throughput.json \

-               .. code-block:: shell
+               .. tab-item:: Throughput command

-                  model={{ model.model_repo }}
-                  tp={{ model.config.tp }}
-                  dtype={{ model.config.dtype }}
-                  kv_cache_dtype={{ model.config.kv_cache_dtype }}
-                  max_num_seqs=256
-                  max_num_batched_tokens={{ model.config.max_num_batched_tokens }}
-                  max_model_len={{ model.config.max_model_len }}
+                  Use the following command to start the throughput benchmark.

-                  vllm serve $model \
-                      -tp $tp \
-                      --dtype $dtype \
-                      --kv-cache-dtype $kv_cache_dtype \
-                      --max-num-seqs $max_num_seqs \
-                      --max-num-batched-tokens $max_num_batched_tokens \
-                      --max-model-len $max_model_len \
-                      --no-enable-prefix-caching \
-                      --swap-space 16 \
-                      --disable-log-requests \
-                      --trust-remote-code \
-                      --gpu-memory-utilization 0.9
+                  .. code-block:: shell

-               Wait until the model has loaded and the server is ready to accept requests.
+                     model={{ model.model_repo }}
+                     tp={{ serv_config.tp }}
+                     num_prompts={{ model.config.num_prompts | default(1024) }}
+                     in={{ serv_config.inp | default(1024) }}
+                     out={{ serv_config.out | default(1024) }}
+                     dtype={{ serv_config.dtype | default("auto") }}
+                     kv_cache_dtype={{ ex_config.kv_cache_dtype | default("auto") }}
+                     max_num_seqs={{ ex_config.max_num_seqs | default(1024) }}
+                     max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
+                     max_model_len={{ ex_config.max_model_len }}

-            2. On another terminal on the same machine, run the benchmark:
+                     vllm bench throughput --model $model \
+                         -tp $tp \
+                         --num-prompts $num_prompts \
+                         --input-len $in \
+                         --output-len $out \
+                         --dtype $dtype \
+                         --kv-cache-dtype $kv_cache_dtype \
+                         --max-num-seqs $max_num_seqs \
+                         --max-num-batched-tokens $max_num_batched_tokens \
+                         --max-model-len $max_model_len \
+                         --trust-remote-code \
+                         --output-json ${model}_throughput.json \
+                         --gpu-memory-utilization {{ model.config.gpu_memory_utilization | default(0.9) }}

-               .. code-block:: shell
+               .. tab-item:: Serving command

-                  # Connect to the container
-                  docker exec -it test bash
+                  1. Start the server using the following command:

-                  # Wait for the server to start
-                  until curl -s http://localhost:8000/v1/models; do sleep 30; done
+                     .. code-block:: shell

-                  # Run the benchmark
-                  model={{ model.model_repo }}
-                  max_concurrency=1
-                  num_prompts=10
-                  in=128
-                  out=128
-                  vllm bench serve --model $model \
-                      --percentile-metrics "ttft,tpot,itl,e2el" \
-                      --dataset-name random \
-                      --ignore-eos \
-                      --max-concurrency $max_concurrency \
-                      --num-prompts $num_prompts \
-                      --random-input-len $in \
-                      --random-output-len $out \
-                      --trust-remote-code \
-                      --save-result \
-                      --result-filename ${model}_serving.json
+                        model={{ model.model_repo }}
+                        tp={{ serv_config.tp }}
+                        dtype={{ serv_config.dtype }}
+                        kv_cache_dtype={{ ex_config.kv_cache_dtype }}
+                        max_num_seqs=1024
+                        max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
+                        max_model_len={{ ex_config.max_model_len }}

-            .. note::
+                        vllm serve $model \
+                            -tp $tp \
+                            --dtype $dtype \
+                            --kv-cache-dtype $kv_cache_dtype \
+                            --max-num-seqs $max_num_seqs \
+                            --max-num-batched-tokens $max_num_batched_tokens \
+                            --max-model-len $max_model_len \
+                            --no-enable-prefix-caching \
+                            --swap-space 16 \
+                            --disable-log-requests

-               For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B,
-               try adding ``export VLLM_ROCM_USE_AITER=1`` to your commands.
+                     Wait until the model has loaded and the server is ready to accept requests.

-               If you encounter the following error, pass your access-authorized Hugging
-               Face token to the gated models.
+                  2. On another terminal on the same machine, run the benchmark:

-               .. code-block::
+                     .. code-block:: shell

-                  OSError: You are trying to access a gated repo.
+                        # Connect to the container
+                        docker exec -it test bash

-                  # pass your HF_TOKEN
-                  export HF_TOKEN=$your_personal_hf_token
+                        # Wait for the server to start
+                        until curl -s http://localhost:8000/v1/models; do sleep 30; done
+
+                        # Run the benchmark
+                        model={{ model.model_repo }}
+                        max_concurrency=1
+                        num_prompts=10
+                        in={{ serv_config.inp | default("1024") }}
+                        out={{ serv_config.out | default("1024") }}
+                        vllm bench serve --model $model \
+                            --percentile-metrics "ttft,tpot,itl,e2el" \
+                            --dataset-name random \
+                            --ignore-eos \
+                            --max-concurrency $max_concurrency \
+                            --num-prompts $num_prompts \
+                            --random-input-len $in \
+                            --random-output-len $out \
+                            --trust-remote-code \
+                            --save-result \
+                            --result-filename ${model}_serving.json
+
+               {% if acc_config %}
+               .. tab-item:: Accuracy command
+
+                  1. Start the server using the following command:
+
+                     .. code-block:: shell
+
+                        model={{ model.model_repo }}
+                        tp={{ acc_config.tp }}
+                        dtype={{ acc_config.dtype }}
+                        kv_cache_dtype={{ ex_config.kv_cache_dtype }}
+                        max_num_seqs=1024
+                        max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
+                        max_model_len={{ ex_config.max_model_len }}
+
+                        vllm serve $model \
+                            -tp $tp \
+                            --dtype $dtype \
+                            --kv-cache-dtype $kv_cache_dtype \
+                            --max-num-seqs $max_num_seqs \
+                            --max-num-batched-tokens $max_num_batched_tokens \
+                            --max-model-len $max_model_len \
+                            --no-enable-prefix-caching \
+                            --swap-space 16 \
+                            --disable-log-requests
+
+                     Wait until the model has loaded and the server is ready to accept requests.
+
+                  2. On another terminal on the same machine, run the benchmark:
+
+                     .. code-block:: shell
+
+                        # Connect to the container
+                        docker exec -it test bash
+
+                        # Wait for the server to start
+                        until curl -s http://localhost:8000/v1/models; do sleep 30; done
+
+                        # Install lm-eval
+                        pip install lm-eval[api]
+
+                        # Run the benchmark
+                        model={{ acc_config.model }}
+                        lm_eval --model local-completions \
+                            --model_args model=$model,max_gen_toks=2048,num_concurrent=256,max_retries=10,base_url=http://localhost:8000/v1/completions \
+                            --tasks gsm8k --limit 250 --output_path ./tmp
+
+               {% endif %}

            .. raw:: html

--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst
@@ -285,7 +285,7 @@ tweak some configurations (such as batch sizes).

                     .. code-block:: shell

-                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
+                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh

                  .. tab-item:: MI325X
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,4 +1,4 @@
-rocm-docs-core==1.31.1
+rocm-docs-core==1.31.2
 sphinx-reredirects
 sphinx-sitemap
 sphinxcontrib.datatemplates==0.11.0
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -19,11 +19,11 @@ babel==2.17.0
    # via
    #   pydata-sphinx-theme
    #   sphinx
-beautifulsoup4==4.14.2
+beautifulsoup4==4.14.3
    # via pydata-sphinx-theme
 breathe==4.36.0
    # via rocm-docs-core
-certifi==2025.11.12
+certifi==2026.1.4
    # via requests
 cffi==2.0.0
    # via
@@ -39,7 +39,7 @@ comm==0.2.3
    # via ipykernel
 cryptography==46.0.3
    # via pyjwt
-debugpy==1.8.17
+debugpy==1.8.19
    # via ipykernel
 decorator==5.2.1
    # via ipython
@@ -60,21 +60,21 @@ fastjsonschema==2.21.2
    #   rocm-docs-core
 gitdb==4.0.12
    # via gitpython
-gitpython==3.1.45
+gitpython==3.1.46
    # via rocm-docs-core
-greenlet==3.2.4
+greenlet==3.3.0
    # via sqlalchemy
 idna==3.11
    # via requests
 imagesize==1.4.1
    # via sphinx
-importlib-metadata==8.7.0
+importlib-metadata==8.7.1
    # via
    #   jupyter-cache
    #   myst-nb
 ipykernel==7.1.0
    # via myst-nb
-ipython==8.37.0
+ipython==8.38.0
    # via
    #   ipykernel
    #   myst-nb
@@ -84,13 +84,13 @@ jinja2==3.1.6
    # via
    #   myst-parser
    #   sphinx
-jsonschema==4.25.1
+jsonschema==4.26.0
    # via nbformat
 jsonschema-specifications==2025.9.1
    # via jsonschema
 jupyter-cache==1.0.1
    # via myst-nb
-jupyter-client==8.6.3
+jupyter-client==8.8.0
    # via
    #   ipykernel
    #   nbclient
@@ -118,7 +118,7 @@ myst-nb==1.3.0
    # via rocm-docs-core
 myst-parser==4.0.1
    # via myst-nb
-nbclient==0.10.2
+nbclient==0.10.4
    # via
    #   jupyter-cache
    #   myst-nb
@@ -138,11 +138,11 @@ parso==0.8.5
    # via jedi
 pexpect==4.9.0
    # via ipython
-platformdirs==4.5.0
+platformdirs==4.5.1
    # via jupyter-core
 prompt-toolkit==3.0.52
    # via ipython
-psutil==7.1.3
+psutil==7.2.1
    # via ipykernel
 ptyprocess==0.7.0
    # via pexpect
@@ -188,9 +188,9 @@ requests==2.32.5
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.31.1
+rocm-docs-core==1.31.2
    # via -r requirements.in
-rpds-py==0.29.0
+rpds-py==0.30.0
    # via
    #   jsonschema
    #   referencing
@@ -200,7 +200,7 @@ smmap==5.0.2
    # via gitdb
 snowballstemmer==3.0.1
    # via sphinx
-soupsieve==2.8
+soupsieve==2.8.1
    # via beautifulsoup4
 sphinx==8.1.3
    # via
@@ -250,15 +250,15 @@ sphinxcontrib-runcmd==0.2.0
    # via sphinxcontrib-datatemplates
 sphinxcontrib-serializinghtml==2.0.0
    # via sphinx
-sqlalchemy==2.0.44
+sqlalchemy==2.0.45
    # via jupyter-cache
 stack-data==0.6.3
    # via ipython
 tabulate==0.9.0
    # via jupyter-cache
-tomli==2.3.0
+tomli==2.4.0
    # via sphinx
-tornado==6.5.2
+tornado==6.5.4
    # via
    #   ipykernel
    #   jupyter-client
@@ -282,7 +282,7 @@ typing-extensions==4.15.0
    #   pygithub
    #   referencing
    #   sqlalchemy
-urllib3==2.5.0
+urllib3==2.6.3
    # via
    #   pygithub
    #   requests
Author	SHA1	Message	Date
David Dixon	6e5194a9ba	Explicitely use gfortran	2026-01-15 17:19:09 -07:00
David Dixon	b5a455bd71	Is the omp fortran backend coming from openblas	2026-01-15 17:06:17 -07:00
David Dixon	957b556e75	add aomp	2026-01-15 16:46:18 -07:00
David Dixon	1eac108411	Correct placement of cmake template	2026-01-15 16:32:30 -07:00
David Dixon	4b888d0025	fix sequencing	2026-01-15 16:20:39 -07:00
David Dixon	ea22ab2d7a	Use newer cmake version	2026-01-15 16:05:46 -07:00
peterjunpark	a745e45dcb	Doc update for vLLM refactor #5855	2026-01-15 11:21:38 -05:00
alexxu-amd	8beac1891f	update requirements.txt (#5851 )	2026-01-14 16:55:26 -05:00
anisha-amd	773f5de407	Docs: Ray release 25.12 and compatibility version format standardization (#5845 )	2026-01-08 12:09:11 -05:00
dependabot[bot]	b297ced032	Bump urllib3 from 2.5.0 to 2.6.3 in /docs/sphinx (#5842 ) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.5.0 to 2.6.3. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.5.0...2.6.3) --- updated-dependencies: - dependency-name: urllib3 dependency-version: 2.6.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-01-08 08:22:01 -05:00
peterjunpark	2dc22ca890	fix(primus-pytorch.rst): FP8 config instead of BF16 (#5839 )	2026-01-07 13:49:31 -05:00
Joseph Macaranas	85102079ed	[External CI] Add SIMDe dev package to HIP runtime pipeline (#5838 )	2026-01-07 11:00:38 -05:00