Adding hipDNN links

Add torch to origami pip modules (#5896 )
* Add torch to origami pip modules * Fix free space error * Add numpy
2026-02-01 01:45:18 -05:00 · 2026-01-30 10:42:06 -05:00 · 2026-01-29 09:56:31 -07:00 · 2026-01-29 11:20:28 -05:00 · 2026-01-28 11:56:05 -05:00 · 2026-01-23 10:06:11 -07:00
19 changed files with 1647 additions and 45 deletions
--- a/.azuredevops/components/MIVisionX.yml
+++ b/.azuredevops/components/MIVisionX.yml
@@ -62,14 +62,9 @@ parameters:
 - name: rocmDependencies
  type: object
  default:
-    - AMDMIGraphX
    - clr
    - half
-    - hipBLAS-common
-    - hipBLASLt
    - llvm-project
-    - MIOpen
-    - rocBLAS
    - rocDecode
    - rocm-cmake
    - rocminfo
@@ -82,12 +77,7 @@ parameters:
    - aomp
    - clr
    - half
-    - hipBLAS-common
-    - hipBLASLt
    - llvm-project
-    - MIOpen
-    - rocBLAS
-    - rocprofiler-register
    - ROCR-Runtime
    - roctracer
    - rpp
--- a/.azuredevops/components/hipTensor.yml
+++ b/.azuredevops/components/hipTensor.yml
@@ -71,6 +71,7 @@ parameters:
 jobs:
 - ${{ each job in parameters.jobMatrix.buildJobs }}:
  - job: ${{ parameters.componentName }}_build_${{ job.target }}
+    timeoutInMinutes: 120
    variables:
    - group: common
    - template: /.azuredevops/variables-global.yml
--- a/.azuredevops/components/origami.yml
+++ b/.azuredevops/components/origami.yml
@@ -47,8 +47,10 @@ parameters:
  type: object
  default:
    - nanobind>=2.0.0
+    - numpy
    - pytest
    - pytest-cov
+    - torch
 - name: rocmDependencies
  type: object
  default:
@@ -101,8 +103,7 @@ jobs:
    - template: /.azuredevops/variables-global.yml
    - name: ROCM_PATH
      value: $(Agent.BuildDirectory)/rocm
-    pool:
-      vmImage: ${{ variables.BASE_BUILD_POOL }}
+    pool: ${{ variables.MEDIUM_BUILD_POOL }}
    ${{ if eq(job.os, 'almalinux8') }}:
      container:
        image: rocmexternalcicd.azurecr.io/manylinux228:latest
@@ -239,7 +240,7 @@ jobs:
          targetType: inline
          workingDirectory: build
          script: |
-            cmake --build . --target origami-tests origami_python -- -j$(nproc)
+            cmake --build . --target origami-tests _pyorigami -- -j$(nproc)
      - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/gpu-diagnostics.yml
      # Run tests using CTest (discovers and runs both C++ and Python tests)
      - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/test.yml
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -39,6 +39,7 @@ autograd
 Backported
 BARs
 BatchNorm
+BKC
 BLAS
 BMC
 BabelStream
@@ -53,6 +54,7 @@ CDNA
 CGUI
 CHTML
 CIFAR
+CNP
 CLI
 CLion
 CMake
@@ -96,6 +98,7 @@ Dashboarding
 Dataloading
 dataflows
 DBRX
+DCQCN
 DDR
 DF
 DGEMM
@@ -110,8 +113,10 @@ DMA
 DOMContentLoaded
 DNN
 DNNL
+DOCA
 DPM
 DRI
+DSCP
 DW
 DWORD
 Dask
@@ -127,7 +132,9 @@ Deprecations
 DevCap
 DirectX
 Disaggregated
+disagg
 disaggregated
+disaggregation
 Dockerfile
 Dockerized
 Doxygen
@@ -179,6 +186,8 @@ GFLOPS
 GFortran
 GFXIP
 GGUF
+GID
+Gbps
 Gemma
 GiB
 GIM
@@ -248,6 +257,7 @@ IOP
 IOPS
 IOPM
 IOV
+IPs
 IRQ
 ISA
 ISV
@@ -312,6 +322,7 @@ MNIST
 MPI
 MPT
 MSVC
+MTU
 mul
 MVAPICH
 MVFFR
@@ -334,6 +345,7 @@ MLA
 MosaicML
 MoEs
 Mooncake
+MoRI
 Mpops
 Multicore
 multihost
@@ -403,16 +415,21 @@ PEQT
 PIL
 PILImage
 PJRT
+PLDM
 POR
 PRNG
 PRs
+PSID
+PTPC
 PaLM
 Pageable
 PeerDirect
+Pensando
 PerfDb
 Perfetto
 PipelineParallel
 PnP
+Pollara
 PowerEdge
 PowerShell
 Pretrained
@@ -424,6 +441,7 @@ Pytest
 PyTorch
 QPS
 Qcycles
+QoS
 Qwen
 RAII
 RAS
@@ -457,6 +475,7 @@ RPP
 RST
 RW
 Radeon
+Redfish
 RelWithDebInfo
 Req
 Rickle
@@ -724,6 +743,7 @@ enqueue
 env
 epilog
 etcetera
+eth
 ethernet
 exascale
 executables
@@ -819,6 +839,7 @@ llvm
 lm
 localscratch
 logits
+loopback
 lossy
 macOS
 matchers
@@ -844,6 +865,7 @@ nanoGPT
 NCS
 NOP
 NVLink
+netplan
 num
 numref
 ocl
@@ -911,6 +933,7 @@ rc
 rccl
 rdc
 rdma
+reachability
 reStructuredText
 redirections
 refactorization
@@ -980,6 +1003,7 @@ shader
 sharding
 sigmoid
 sles
+slurm
 sm
 smi
 softmax
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -199,7 +199,7 @@ for a complete overview of this release.
  * fftw_execute_dft_c2r
  * fftwf_execute_dft_c2r

-### **HIPIFY** (22.2.0)
+### **HIPIFY** (22.0.0)

 #### Added

@@ -279,6 +279,10 @@ for a complete overview of this release.

 * Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches).

+#### Upcoming changes
+
+* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
+
 ### **MIGraphX** (2.15.0) 

 #### Added
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -139,7 +139,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
          </td>
      </tr>
      <tr>
-          <td>MI300X</td>
+          <td>MI300X<a href="#footnote1"><sup>[2]</sup></a></td>
          <td>01.25.03.12</td>
          <td rowspan="6" style="vertical-align: middle;">
              30.30.0<br>
@@ -180,6 +180,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
 </div>

 <p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p>
+<p id="footnote1">[2]: For AMD Instinct MI300X KVM SR-IOV with Multi-VF (8 VF) support requires a compatible firmware BKC bundle which will be released in coming months.</p>

 #### Node power management for multi-GPU nodes added

@@ -245,7 +246,7 @@ New Stream Management API `hipStreamCopyAttributes` is implemented for CUDA Pari

 The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit.  This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs.  The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.

-In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/latest/install.html).
+In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/docs-7.2.0/install.html).

 ### Software-managed plan cache support for hipTensor

@@ -285,7 +286,7 @@ MIGraphX has the following enhancements:

 ### AMDGPU wavefront size macro removal

-The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize).
+The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/hip_cpp_language_extensions.html#warpsize).
 For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example:

 ```
@@ -349,13 +350,13 @@ The ROCm Offline Installer Creator 7.2.0 includes the following features and imp
 * Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation.
 * Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer.
 
-See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information.
+See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) for more information.
 
 ### ROCm Runfile Installer updates
 
 The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
 
-For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html).
+For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html).

 ### Expansion of the ROCm examples repository

@@ -375,7 +376,7 @@ Usage examples are now available for the [ROCgdb](https://github.com/ROCm/rocm-e

 ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.

-* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/latest/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html) set continues to provide detailed information, tutorials, and reference content.
+* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/docs-7.2.0/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/index.html) set continues to provide detailed information, tutorials, and reference content.

 * The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html).

@@ -653,7 +654,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
                <th rowspan="5"></th>
                <th rowspan="5">Development</th>
                <td><a href="https://rocm.docs.amd.com/projects/HIPIFY/en/docs-7.2.0/index.html">HIPIFY</a></td>
-                <td>20.0.0&nbsp;&Rightarrow;&nbsp;<a href="#hipify-22-2-0">22.2.0</td>
+                <td>20.0.0&nbsp;&Rightarrow;&nbsp;<a href="#hipify-22-0-0">22.0.0</td>
                <td><a href="https://github.com/ROCm/HIPIFY/"><i
                            class="fab fa-github fa-lg"></i></a></td>
            </tr>
@@ -913,7 +914,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
  * fftw_execute_dft_c2r
  * fftwf_execute_dft_c2r

-### **HIPIFY** (22.2.0)
+### **HIPIFY** (22.0.0)

 #### Added

@@ -995,7 +996,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid

 #### Upcoming changes

-* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
+* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/docs-7.2.0/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/docs-7.2.0/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.

 ### **MIGraphX** (2.15.0) 

@@ -1432,7 +1433,11 @@ python3 -m pip install --user .
 `sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
 ```

-For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release.
+For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release. See [GitHub issue #5875](https://github.com/ROCm/ROCm/issues/5875).
+
+### Intermittent errors when running JAX workloads
+
+You might experience intermittent errors or segmentation faults when running JAX workloads. The issue is currently under investigation and will be addressed in an upcoming ROCm release. See [GitHub issue #5878](https://github.com/ROCm/ROCm/issues/5878).

 ### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs

@@ -1477,7 +1482,7 @@ The following changes to the ROCm software stack are anticipated for future rele

 ###  ROCm Offline Installer Creator deprecation

-The ROCm Offline Installer Creator is deprecated with the ROCm 7.2.0 release. Equivalent installation capabilities are available through the ROCm Runfile Installer, a self-extracting installer that is not based on OS package managers. This installer will be removed in a future release.
+The [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) is deprecated with the ROCm 7.2.0 release and will be removed in a future release. Equivalent installation capabilities are available through the [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html), a self-extracting installer that is not based on OS package managers.

 ### ROCm SMI deprecation

--- a/default.xml
+++ b/default.xml
@@ -1,13 +1,12 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <manifest>
    <remote name="rocm-org" fetch="https://github.com/ROCm/" />
-    <default revision="refs/tags/rocm-7.1.1"
+    <default revision="refs/tags/rocm-7.2.0"
     remote="rocm-org"
     sync-c="true"
     sync-j="4" />
 <!--list of projects for ROCm-->
    <project name="ROCK-Kernel-Driver" />
-    <project name="amdsmi" />
    <project name="rocm_bandwidth_test" />
    <project name="rocm-examples" />
 <!--HIP Projects-->
@@ -25,30 +24,16 @@
    <project groups="mathlibs" name="MIVisionX" />
    <project groups="mathlibs" name="ROCmValidationSuite" />
    <project groups="mathlibs" name="composable_kernel" />
-    <project groups="mathlibs" name="hipSOLVER" />
-    <project groups="mathlibs" name="hipTensor" />
    <project groups="mathlibs" name="hipfort" />
    <project groups="mathlibs" name="rccl" />
    <project groups="mathlibs" name="rocAL" />
    <project groups="mathlibs" name="rocALUTION" />
    <project groups="mathlibs" name="rocDecode" />
    <project groups="mathlibs" name="rocJPEG" />
-    <!-- The following components have been migrated to rocm-libraries:
-        hipBLAS-common hipBLAS hipBLASLt hipCUB
-        hipFFT hipRAND hipSPARSE hipSPARSELt
-        MIOpen rocBLAS rocFFT rocPRIM rocRAND
-        rocSPARSE rocThrust Tensile -->
    <project groups="mathlibs" name="rocm-libraries" />
-    <!-- The following components have been migrated to rocm-systems:
-        aqlprofile clr hip hip-tests hipother
-        rdc rocm-core rocm_smi_lib rocminfo rocprofiler-compute 
-        rocprofiler-register rocprofiler-sdk rocprofiler-systems 
-        rocprofiler rocr-runtime roctracer -->
    <project groups="mathlibs" name="rocm-systems" />
    <project groups="mathlibs" name="rocPyDecode" />
-    <project groups="mathlibs" name="rocSOLVER" />
    <project groups="mathlibs" name="rocSHMEM" />
-    <project groups="mathlibs" name="rocWMMA" />
    <project groups="mathlibs" name="rocm-cmake" />
    <project groups="mathlibs" name="rpp" />
    <project groups="mathlibs" name="TransferBench" />
@@ -56,4 +41,4 @@
    <project name="aomp" path="openmp-extras/aomp" />
    <project name="aomp-extras" path="openmp-extras/aomp-extras" />
    <project name="flang" path="openmp-extras/flang" />
-</manifest>
+</manifest>
--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -39,6 +39,7 @@ additional licenses. Please review individual repositories for more information.
 | [hipBLASLt](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt/) | [MIT](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipblaslt/LICENSE.md) |
 | [HIPCC](https://github.com/ROCm/llvm-project/tree/amd-staging/amd/hipcc) | [MIT](https://github.com/ROCm/llvm-project/blob/amd-staging/amd/hipcc/LICENSE.txt) |
 | [hipCUB](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipcub/) | [Custom](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipcub/LICENSE.txt) |
+| [hipDNN](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipdnn/) | [MIT](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipdnn/LICENSE.md) |
 | [hipFFT](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipfft/) | [MIT](https://github.com/ROCm/rocm-libraries/blob/develop/projects/hipfft/LICENSE.md) |
 | [hipfort](https://github.com/ROCm/hipfort/) | [MIT](https://github.com/ROCm/hipfort/blob/develop/LICENSE) |
 | [HIPIFY](https://github.com/ROCm/HIPIFY/) | [MIT](https://github.com/ROCm/HIPIFY/blob/amd-staging/LICENSE.txt) |
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -171,6 +171,7 @@ Operating systems, kernel and Glibc versions
 *********************************************

 For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
+
 .. note::

  * See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL).
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -163,7 +163,6 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
-    {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},    

    {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
@@ -193,11 +192,16 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]},
+
    {"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},

    {"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},
--- a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
+++ b/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
@@ -130,7 +130,7 @@ After loading the model in this way, the model is fully ready to use the resourc
 torchtune for fine-tuning and inference
 =============================================

-`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
+`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU 
 model fine-tuning and inference with LLMs.

 #. Install torchtune using pip.
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -25,6 +25,5 @@ In this guide, you'll learn how to use ROCm for AI:

 - :doc:`Inference optimization <inference-optimization/index>`

-
 To learn about ROCm for HPC applications and scientific computing, see
 :doc:`../rocm-for-hpc/index`.
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
@@ -0,0 +1,904 @@
+# SGLang distributed inference with MoRI
+
+This document provides a comprehensive guide for deploying a high-performance
+SGLang distributed inference serving environment on an AMD Instinct MI355X GPU
+cluster, utilizing the [MoRI (Modular RDMA
+Interface)](https://github.com/rocm/mori) communication backend for optimized
+inter-node collective operations. It also includes systematic instructions for
+benchmarking 1P2D (1 prefill 2 decode, 3 nodes) configurations using automated
+scripts.
+
+## Prerequisites
+
+The following configuration is required to implement this setup:
+
+* **Nodes:** A minimum of three GPU nodes (Virtual machines or Physical
+    machines) for wide expert parallelism (EP) evaluation.
+* **GPUs** 8x AMD Instinct MI355X GPU cards per node.
+* **Networking:**   8x AMD Pensando™ Pollara 400 AI NICs per node, providing
+  a dedicated 1:1 mapping between GPUs and network interfaces for optimal
+  inter-node communication.
+* **Orchestration:** A Slurm cluster with at least three nodes -- one for
+  prefill service and two for decode services (EP16)
+
+## System configuration
+
+This section outlines the infrastructure setup required to support your AMD
+Instinct MI355X cluster. It covers essential procedures for verifying software
+baselines and firmware versions, configuring the AMD Pensando Pollara 400 AI
+NICs for high-bandwidth networking, and applying thermal and Quality of Service
+(QoS) tunings to ensure a stable, lossless RDMA fabric.
+
+(sglang-mori-verify-baseline)=
+
+### Verify baseline software
+
+The following table outlines the validated software stack. Use the provided
+shell commands to verify the environment on each node before proceeding.
+
+| Component | Version | Verification command |
+| :--- | :--- | :--- |
+| **OS** | Ubuntu 22.04.5 LTS | `cat /etc/os-release` |
+| **Kernel** | 5.15.0-163-generic | `uname -r` |
+| **ROCm** | 7.1.1 | `amd-smi version` |
+| **PLDM bundle (firmware)** | 01.25.16.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
+| **AI NIC Firmware** | 1.117.5.a.45 | `dkms status` |
+| **AI NIC Driver** | 25.11.1.001 | `dkms status` |
+
+### Verify best known configuration (BKC)
+
+The BKC defines a validated configuration of GPU firmware, baseboard firmware,
+ROCm user space components, the AMD GPU Driver, and virtualization tooling.
+These components are tested together to attain best performance and compatibility.
+
+While AMD publishes the AMD GPU driver and ROCm user space components, your
+server OEM or infrastructure provider distributes the firmware packages. AMD
+supplies those firmware images (PLDM bundles), which the OEM integrates and
+distributes.
+
+To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
+Redfish API:
+
+1. Prepare credentials: Identify your BMC IP, username, and password.
+2. Run Redfish queries: Use the following commands to check the active
+   firmware inventory.
+
+     ``` bash
+     # Define BMC connection variables
+     BMC_IP="<BMC_IP>"
+     AUTH="<username>:<password>"
+
+     # Query active BKC bundle version
+     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
+          -u "${AUTH}" -k | json_pp
+
+     # Query active IFWI (Integrated Firmware Image)
+     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
+          -u "${AUTH}" -k | json_pp
+     ```
+
+### Run basic system health checks
+
+Before proceeding with software deployment, verify that all cluster nodes
+comply with the [MI355X Basic Health
+Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi355x.html#basic-health-checks)
+Key requirements include specific kernel boot arguments, minimum system memory
+thresholds, PCIe Gen5 link stability, and so on.
+
+### Install AMD Pensando Pollara 400 AI NIC drivers
+
+For detailed instructions on upgrading the firmware and installing drivers for
+the AMD Pensando Pollara 400 AI NIC, refer to the [AMD Instinct System
+Acceptance
+Guide](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/network/nic-installation.html#amd-pensando-pollara-400-ai-nic).
+After installation, verify the active firmware version on all NICs to ensure it
+matches the software baseline. See [Verify baseline software](#verify-best-known-configuration-bkc).
+
+To display the current firmware version for all AI NICs, use the following command.
+
+```bash
+sudo nicctl show version firmware
+```
+
+### Configure thermal management (fan speed)
+
+For systems equipped with 400G optics, standard fan profiles are often
+insufficient for maintaining stable operating temperatures. To prevent thermal
+throttling or optics failure, the system fans must be set to `FullSpeed`.
+
+* Requirement: A fan speed of approximately 25,000 RPM is required to maintain
+  the AI NIC modules at an optimal operating temperature (~50°C).
+
+* Constraint: Default profiles (typically around 4,000 RPM) and "Performance IO"
+  settings (around 9,000 RPM) do not provide adequate airflow for 400G optical
+  transceivers.
+
+#### Configure fan speed via Redfish (Supermicro)
+
+Run the following command to set the fan mode to `FullSpeed` through the BMC:
+
+``` bash
+# Define BMC connection variables
+BMC_IP="<BMC_IP>"
+AUTH="<username>:<password>"
+
+# Set Fan Mode to FullSpeed
+curl -X PATCH "https://${BMC_IP}/redfish/v1/Managers/1/Oem/Supermicro/FanMode" \
+     -k -u "${AUTH}" \
+     -H "Content-Type: application/json" \
+     -d '{"Mode": "FullSpeed"}'
+```
+
+### Configure your backend network (netplan)
+
+Configure the backend NICs for high-bandwidth inter-node communication. Suppose
+the GPU’s eight network interface controllers (NICs) are `benic1p1` to
+`benic8p1`. Each NIC must have its own subnet that is disjoint from the others.
+Each node needs a unique IP address on each subnet. You should use the same
+final octet in each subnet for a given node. For example, one node would have
+the addresses `192.168.1.36`, `192.168.2.36`, and so on. Another node would
+have `192.168.1.37`, `192.168.2.37`, and so on. Ensure MTU is set to `9000`.
+
+```{note}
+Ensure you identify the correct interface names for your system using ip link
+before applying this configuration.
+```
+
+For example, your `/etc/netplan/70-backend.yaml` should look like the
+following:
+
+```yaml
+network:
+  ethernets:
+    benic8p1:
+      addresses:
+      - 192.168.8.38/31
+      match:
+        macaddress: 04:90:81:2a:34:08
+      mtu: 9000
+      routes:
+      - table: 108
+        to: 0.0.0.0/0
+        via: 192.168.8.39
+      routing-policy:
+      - from: 192.168.8.38
+        table: 108
+      set-name: benic8p1
+    benic7p1:
+      addresses:
+      - 192.168.7.38/31
+      match:
+        macaddress: 04:90:81:2b:82:40
+      mtu: 9000
+      routes:
+      - table: 107
+        to: 0.0.0.0/0
+        via: 192.168.7.39
+      routing-policy:
+      - from: 192.168.7.38
+        table: 107
+      set-name: benic7p1
+    benic6p1:
+      addresses:
+      - 192.168.6.38/31
+      match:
+        macaddress: 04:90:81:30:c9:30
+      mtu: 9000
+      routes:
+      - table: 106
+        to: 0.0.0.0/0
+        via: 192.168.6.39
+      routing-policy:
+      - from: 192.168.6.38
+        table: 106
+      set-name: benic6p1
+    benic5p1:
+      addresses:
+      - 192.168.5.38/31
+      match:
+        macaddress: 04:90:81:2a:23:40
+      mtu: 9000
+      routes:
+      - table: 105
+        to: 0.0.0.0/0
+        via: 192.168.5.39
+      routing-policy:
+      - from: 192.168.5.38
+        table: 105
+      set-name: benic5p1
+    benic4p1:
+      addresses:
+      - 192.168.4.38/31
+      match:
+        macaddress: 04:90:81:2d:69:60
+      mtu: 9000
+      routes:
+      - table: 104
+        to: 0.0.0.0/0
+        via: 192.168.4.39
+      routing-policy:
+      - from: 192.168.4.38
+        table: 104
+      set-name: benic4p1
+    benic3p1:
+      addresses:
+      - 192.168.3.38/31
+      match:
+        macaddress: 04:90:81:2a:2c:40
+      mtu: 9000
+      routes:
+      - table: 103
+        to: 0.0.0.0/0
+        via: 192.168.3.39
+      routing-policy:
+      - from: 192.168.3.38
+        table: 103
+      set-name: benic3p1
+    benic2p1:
+      addresses:
+      - 192.168.2.38/31
+      match:
+        macaddress: 04:90:81:30:d5:30
+      mtu: 9000
+      routes:
+      - table: 102
+        to: 0.0.0.0/0
+        via: 192.168.2.39
+      routing-policy:
+      - from: 192.168.2.38
+        table: 102
+      set-name: benic2p1
+    benic1p1:
+      addresses:
+      - 192.168.1.38/31
+      match:
+        macaddress: 04:90:81:30:e4:00
+      mtu: 9000
+      routes:
+      - table: 101
+        to: 0.0.0.0/0
+        via: 192.168.1.39
+      routing-policy:
+      - from: 192.168.1.38
+        table: 101
+      set-name: benic1p1
+```
+
+To apply the configuration, use the following command.
+
+```bash
+sudo netplan apply
+```
+
+To verify your configuration, use the following command.
+
+```bash
+sudo apt install -y net-tools && ip -br a
+```
+
+### Configure Quality of Service (QoS) and Congestion Control (DCQCN)
+
+To ensure lossless communication and optimal performance for RDMA traffic, the
+network must be configured with specific QoS and Data Center Quantized
+Congestion Notification (DCQCN) settings.
+
+The following configuration achieves:
+• It enables RX and TX Pause frames on the ports
+• Maps DSCP 24 (Data) to Q3 and DSCP 46 (CNP) to Q6, all other DSCP to Q0
+• Enables PFC for Q3
+• Scheduling : 99% to Q3, 1% to Q0 and strict priority for Q6
+
+#### Configure DCQCN
+
+Create and run a `/nfsdata/enable_dcqcn.sh` script to initialize congestion
+control parameters.
+
+``` bash
+# !/bin/bash
+
+TOKEN_BUCKET_SIZE=800000
+AI_RATE=160
+ALPHA_UPDATE_INTERVAL=1
+ALPHA_UPDATE_G=512
+INITIAL_ALPHA_VALUE=64
+RATE_INCREASE_BYTE_COUNT=431068
+HAI_RATE=300
+RATE_REDUCE_MONITOR_PERIOD=1
+RATE_INCREASE_THRESHOLD=1
+RATE_INCREASE_INTERVAL=1
+CNP_DSCP=46
+
+ROCE_DEVICES=$(ibv_devices | grep ionic_ | awk '{print $1}' | paste -sd " ")
+for roce_dev in $ROCE_DEVICES
+do
+    sudo nicctl update dcqcn -r $roce_dev -i 1 \
+    --token-bucket-size $TOKEN_BUCKET_SIZE \
+    --ai-rate $AI_RATE \
+    --alpha-update-interval $ALPHA_UPDATE_INTERVAL \
+    --alpha-update-g $ALPHA_UPDATE_G \
+    --initial-alpha-value $INITIAL_ALPHA_VALUE \
+    --rate-increase-byte-count $RATE_INCREASE_BYTE_COUNT \
+    --hai-rate $HAI_RATE \
+    --rate-reduce-monitor-period $RATE_REDUCE_MONITOR_PERIOD \
+    --rate-increase-threshold $RATE_INCREASE_THRESHOLD  \
+    --rate-increase-interval $RATE_INCREASE_INTERVAL \
+    --cnp-dscp $CNP_DSCP
+done
+```
+
+#### Configure QoS and PFC
+
+Create and run `/nfsdata/qos.sh` to set up traffic classes and scheduling.
+
+``` bash
+#!/bin/bash
+# qos.sh
+
+# Enable PFC and Auto-negotiation on all ports
+for i in $(sudo nicctl show port | grep Port | awk {'print $3'}); do sudo nicctl update port -p $i --pause-type pfc --rx-pause enable --tx-pause enable; done
+for i in $(sudo nicctl show port | grep Port | awk '{print $3}'); do sudo nicctl update port --port $i --auto-neg enable; done
+
+# Define Priorities
+cts_dscp=46
+cts_prio=6
+data_dscp=24
+data_prio=3
+default_prio=0
+cnp_dscp=46
+cnp_prio=6
+
+sudo nicctl update qos pfc --priority 0 --no-drop disable
+sudo nicctl update qos dscp-to-purpose --dscp 48 --purpose none
+sudo nicctl update qos dscp-to-purpose --dscp 46 --purpose none
+sudo nicctl update qos --classification-type pcp
+sudo nicctl update qos --classification-type dscp
+sudo nicctl update qos dscp-to-priority --dscp 0-63 --priority 0
+sudo nicctl update qos dscp-to-priority --dscp 0-23,25-45,47-63 --priority $default_prio
+sudo nicctl update qos dscp-to-priority --dscp $cts_dscp --priority $cts_prio
+sudo nicctl update qos dscp-to-priority --dscp $data_dscp --priority $data_prio
+sudo nicctl update qos dscp-to-priority --dscp $cnp_dscp --priority $cnp_prio
+sudo nicctl update qos pfc --priority $data_prio --no-drop enable
+sudo nicctl update qos scheduling --priority $data_prio,$default_prio,$cts_prio --dwrr 99,1,0 --rate-limit 0,0,10
+```
+
+#### Verification your configuration
+
+Verify the configuration using `nicctl`.
+
+* Verify QoS classification:
+
+  ``` bash
+  sudo nicctl show qos
+  ```
+
+  Expected QoS output:
+
+  ``` bash
+  NIC  : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
+
+  Port : 04908130-a7a0-4242-4242-000011010000
+
+  Classification type         : DSCP
+
+  DSCP-to-priority :
+  DSCP bitmap               : 0xffffbffffeffffff ==> priority : 0
+  DSCP bitmap               : 0x0000000001000000 ==> priority : 3
+  DSCP bitmap               : 0x0000400000000000 ==> priority : 6
+  DSCP                      : 0-23, 25-45, 47-63 ==> priority : 0
+  DSCP                      : 24 ==> priority : 3
+  DSCP                      : 46 ==> priority : 6
+  ```
+
+* Verify DCQCN and scheduling:
+
+  ``` bash
+  sudo nicctl show dcqcn
+  ```
+
+  Expected DCQCN and scheduling output:
+
+  ``` bash
+  NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
+  ------------------------------------------------------------------------------------------
+
+  Lif id                                     : 43000070-0100-0000-4242-04908130a7a0
+  ROCE device                                : ionic_7
+  DCQCN profile id                         : 1
+  Status                                   : Enabled
+  Rate increase in AI phase                : 160
+  Rate increase byte count                 : 431068
+  Alpha update G value                     : 512
+  Alpha update interval                    : 1
+  Rate increase in HAI phase               : 300
+  Initial alpha value                      : 64
+  Rate reduce monitor period               : 1
+  Rate increase threshold                  : 1
+  Rate increase interval                   : 1
+  Token bucket size                        : 800000
+  DSCP value used for CNP                  : 46
+
+
+  PFC :
+  PFC priority bitmap       : 0x8
+  PFC no-drop priorities    : 3
+
+  Scheduling :
+  --------------------------------------------
+  Priority  Scheduling  Bandwidth Rate-limit
+            Type        (in %age) (in Gbps)
+  --------------------------------------------
+  0         DWRR        1         N/A
+  3         DWRR        99        N/A
+  6         strict      N/A       10
+  ```
+
+### Configure your network file system (NFS)
+
+Setting up a shared NFS volume facilitates centralized storage for models,
+recipes, and logs across the cluster. Use the following commands to install the
+necessary client tools and mount the remote directory.
+
+```{important}
+Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
+server details and desired local mount path.
+```
+
+``` bash
+sudo apt update && sudo apt install -y nfs-common
+sudo mkdir -p /mount/point
+sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
+echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
+```
+
+## Software installation
+
+Next, install the core compute stack required to operate the AMD Instinct GPUs.
+The following steps guide you through deploying the ROCm software stack and the
+necessary kernel-mode drivers to enable hardware acceleration and optimize the
+environment for distributed inference workloads.
+
+### Install ROCm
+
+Use the following commands to quickly install ROCm 7.1.1 on Ubuntu 22.04:
+
+``` bash
+wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
+sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
+sudo apt update
+sudo apt install python3-setuptools python3-wheel
+sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
+sudo apt install rocm
+```
+
+For detailed installation instructions, refer to the [ROCm 7.1.1
+documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#rocm-installation).
+
+### Install AMD GPU Driver (amdgpu)
+
+Use the following commands to quickly install the AMD GPU Driver (ROCm 7.1.1)
+on Ubuntu 22.04:
+
+``` bash
+wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
+sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
+sudo apt update
+sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
+sudo apt install amdgpu-dkms
+```
+
+For detailed installation instructions, refer to the [ROCm 7.1.1
+documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#amdgpu-driver-installation).
+
+## Network verification and testing
+
+Before deploying the inference engine, validate the health and performance of
+the cluster interconnects.
+
+### Verify network connectivity
+
+Verify that all network interfaces are reachable across the cluster nodes.
+Assuming `eth0` is the management interface, and `benic1p1` through `benic8p1` are the
+dedicated RoCE backend interfaces, use the following loop to test reachability
+to a remote node (for instance, a target node with host IP suffix `.38`).
+
+```bash
+# Test connectivity for RoCE subnets 192.168.x.38 (node B) through 192.168.x.37 (node A)
+for i in {1..8}; do ping -c 1 192.168.${i}.38; done
+```
+
+### Validate your RDMA setup
+
+Confirm that all eight RDMA network interfaces are in the `UP` state and
+correctly configured with the required MTU and GID settings.
+
+#### Verify link status MTU, NIC temperature, and NIC speed
+
+```bash
+sudo nicctl show port
+```
+
+The output should look something like this:
+
+```bash
+-------------------------------------------------------------------------------------
+
+NIC  : 42424650-4c32-3531-3530-314343000000 (0000:f6:00.0)
+
+Port : 04908132-5d88-4242-4242-000011010000 (eth1/1)
+  Spec:
+    Ifindex                                  : 0x11010000
+    Type                                     : ETH
+    speed                                    : 400G
+    Admin state                              : UP
+    FEC type                                 : RS
+    Pause type                               : PFC
+    Number of lanes                          : 4
+    MTU                                      : 9216
+    TX pause                                 : enabled
+    RX pause                                 : enabled
+    Auto negotiation                         : enabled
+  Status:
+    Physical port                            : 1
+    Operational status                       : UP
+    Link FSM state                           : UP
+    FEC type                                 : RS
+    Cable type                               : Fiber
+    Number of lanes                          : 4
+    speed                                    : 400G
+    Auto negotiation                         : disabled
+    MAC ID                                   : 0
+    MAC channel                              : 0
+    MAC address                              : 04:90:81:32:5d:88
+    Transceiver type                         : QSFP_CMIS
+    Transceiver state                        : SPROM-READ
+    Transceiver PID                          : QSFP-400G-DR4
+    Transceiver temperature (in C)           : 45
+    Transceiver warning temperature (in C)   : 75
+    Transceiver alarm temperature (in C)     : 80
+-------------------------------------------------------------------------------------
+```
+
+#### Verify GID
+
+Ensure each device has a valid GID mapped to its assigned IP address.
+
+```bash
+ibv_devinfo -v | grep GID
+```
+
+The output should look something like this:
+
+```bash
+      GID[  0]:               fe80::690:81ff:fe30:a7a0, RoCE v2
+      GID[  1]:               ::ffff:192.168.7.36, RoCE v2
+```
+
+### Run RDMA bandwidth benchmarks
+
+Verify the inter-node RDMA performance to ensure the network fabric can
+saturate the link bandwidth.
+
+#### Install RDMA Performance Tools
+
+To get started, build the ROCm-optimized `rdma-perftest` test suite from
+source:
+
+```bash
+sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
+git clone https://github.com/ROCm/rdma-perftest
+cd rdma-perftest/
+./autogen.sh
+./configure --enable-rocm --with-rocm=/opt/rocm
+make -j$(nproc)
+sudo make install
+```
+
+#### Run a bandwidth test (GPU memory)
+
+Perform a bandwidth test using ROCm GPU memory between two nodes. One acts as
+a server and the other acts as a client. Replace `<SERVER_IP>` with the
+appropriate IP.
+
+```bash
+# On Server Node
+./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
+
+# On Client Node
+./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
+```
+
+## SGLang serving and MoRI unit tests
+
+### Install Docker Engine
+
+Install the Docker engine to manage the containerized vLLM and MoRI serving
+environments.
+
+```bash
+sudo apt update && sudo apt install -y docker.io
+sudo usermod -aG docker "$USER"
+```
+
+### Launch the serving container
+
+Deploy the SGLang MoRI serving container on each node.
+
+```bash
+CONTAINER_NAME=sglang_mori
+IMAGE_NAME=rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-0113
+
+docker run -it \
+    --rm \
+    --device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
+    --network host --ipc host \
+    --group-add video \
+    --cap-add SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    --privileged \
+    --shm-size 128G \
+    --name ${CONTAINER_NAME} \
+    ${IMAGE_NAME} /bin/bash
+```
+
+### Run MoRI inter-node unit tests
+
+Before starting the vLLM service, run the MoRI unit test to verify that the
+inter-node communication backend is correctly configured.
+
+MoRI unit test uses 2 nodes as a minimal validation before running the full
+1P2D (3 nodes) benchmark.
+
+The key configuration variables are:
+
+* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
+* `<MASTER_IP>`: The IP address of the primary node's backend interface.
+
+```{note}
+You can find reference performance data in the [ROCm/MoRI
+repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
+```
+
+```bash
+# Set up environment inside the container
+export PYTHONPATH=/app/mori:$PYTHONPATH
+export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
+
+# Node 0 (Primary)
+torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
+    --master_addr="<MASTER_IP>" --master_port=1234 \
+    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
+    --cmd bench --kernel-type v1
+
+# Node 1 (Secondary)
+torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
+    --master_addr="<MASTER_IP>" --master_port=1234 \
+    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
+    --cmd bench --kernel-type v1
+```
+
+## End-to-end 1P2D performance testing
+
+This section guides you through running distributed inference benchmarks using
+the SGLang disagg recipe. For detailed implementation details, refer to the
+[SGLang Disaggregation
+Recipe](https://github.com/billishyahao/sglang_disagg/blob/9n_cluster/README.md).
+
+### Download the model and setup your run environment
+
+This performance test supports the following models:
+
+* [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)
+* [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
+* [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
+
+To set up your environment and download the models using the Hugging Face CLI,
+use the following commands. Modify the `huggingface-cli download` command
+to download the desired model.
+
+```bash
+# Set up a virtual environment and install the Hugging Face CLI
+sudo apt update && sudo apt install -y python3-venv
+python3 -m venv ~/venvs/hf
+source ~/venvs/hf/bin/activate
+pip install huggingface_hub
+
+# Download the model to the shared NFS mount point
+# Replace 'deepseek-ai/DeepSeek-R1-0528' with your desired model
+huggingface-cli download --token <your_hf_token> \
+    deepseek-ai/DeepSeek-R1-0528 \
+    --local-dir /mount/point/models/DeepSeek-R1
+```
+
+### Clone the SGLang disaggregation recipe
+
+Clone the SGLang disaggregation repository to the shared file system and switch
+to the appropriate branch:
+
+```bash
+git clone https://github.com/billishyahao/sglang_disagg.git
+git checkout 9n_cluster
+cd sglang_disagg
+```
+
+```{note}
+In the 1P2D configuration, the prefill service and benchmark process run on the
+same node, while remaining nodes handle decode services.
+```
+
+### Configure InfiniBand devices
+
+Identify and configure the available InfiniBand devices.
+
+1. List available devices using the following command.
+
+   ```bash
+   ibv_devinfo -l
+   ```
+
+   Example output:
+
+   ```bash
+   8 HCAs found:
+           ionic_0
+           ionic_1
+           ionic_2
+           ionic_3
+           ionic_4
+           ionic_5
+           ionic_6
+           ionic_7
+   ```
+
+2. Update environment variables. Edit `set_env_vars.sh` and add the
+   comma-separated list of your system's IB devices. For example:
+
+   ```bash
+   export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
+   ```
+
+### Configure the script and submit the job
+
+1. To set the required configuration parameters, update the following
+   environment variables in `run_submit_disagg.sh` to match your cluster setup:
+
+   ```bash
+   # SLURM Job Configuration
+   export SLURM_ACCOUNT="amd"       # The account name for SLURM job accounting and resource allocation
+   export SLURM_PARTITION="compute" # The specific cluster partition (queue) to submit the job to
+   export TIME_LIMIT="24:00:00"     # Maximum wall time for the job (Hours:Minutes:Seconds)
+
+   # Model Configuration
+   export MODEL_PATH="/nfsdata"     # Base directory where the model weights are stored
+   export MODEL_NAME="DeepSeek-R1"  # Specific model directory name (joined with MODEL_PATH)
+   export CONTAINER_IMAGE="rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-1224" # Docker image to use for the environment
+
+   # Cluster Topology (Disaggregation Setup)
+   export PREFILL_NODES=1           # Number of prefill nodes
+   export PREFILL_WORKERS=1         # Number of prefill workers
+   export DECODE_NODES=2            # Number of decode nodes
+   export DECODE_WORKERS=2          # Number of decode workers
+
+   # Benchmark/Workload Parameters
+   export ISL=1024                  # Input Sequence Length (number of tokens in the prompt)
+   export OSL=1024                  # Output Sequence Length (number of tokens to generate)
+   export CONCURRENCIES="2048"      # Total number of concurrent requests to simulate in the benchmark. The value can be "32,64,128"
+   export REQUEST_RATE="inf"        # Request per second rate. "inf" means send all requests immediately
+
+   # Parallelism Strategies
+   export PREFILL_ENABLE_EP=true    # Enable Expert Parallelism (EP) for the prefill phase
+   export PREFILL_ENABLE_DP=true    # Enable Data Parallelism (DP) for the prefill phase
+   export DECODE_ENABLE_EP=true     # Enable Expert Parallelism (EP) for the decode phase
+   export DECODE_ENABLE_DP=true     # Enable Data Parallelism (DP) for the decode phase
+   ```
+
+2. Then submit the batch job into slurm cluster through `bash ./run_submit_disagg.sh`.
+
+   ```bash
+   bash ./run_submit_disagg.sh
+   ```
+
+### Log file analysis
+
+1. After submission, retrieve the SLURM job ID:
+
+   ```bash
+   squeue
+   ```
+
+   Example output:
+
+   ```bash
+   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+   123   compute       1p2d   alice  R    00:10:32      4 node[01-04]
+   ```
+
+2. A directory named `slurm_job-$SLURM_JOB_ID` is created in `/tmp` on each
+   participating node. The directory contains:
+
+   | Log File | Description |
+   | :--------| :-----------|
+   | `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` | Main service log per node |
+   | `decode_NODE${NODE_RANK}.log` | SGLang decode service details |
+   | `prefill_NODE${NODE_RANK}.log` | SGLang prefill service details |
+
+3. The benchmark results will be displayed in
+   `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log`. Key metrics include:
+
+   ```{note}
+   The following benchmark utility output is provided for reference only and
+   should not be used to compare performance. See the
+   [InferenceMAX](https://inferencemax.semianalysis.com/) website for validated
+   performance results.
+   ```
+
+   ``` bash
+   ============ Serving Benchmark Result ============
+   Successful requests:                     20480
+   Benchmark duration (s):                  1194.25
+   Total input tokens:                      20971520
+   Total generated tokens:                  20971520
+   Request throughput (req/s):              17.15
+   Output token throughput (tok/s):         17560.38
+   Total Token throughput (tok/s):          35120.76
+   ---------------Time to First Token----------------
+   Mean TTFT (ms):                          21601.77
+   Median TTFT (ms):                        24525.21
+   P99 TTFT (ms):                           85417.53
+   -----Time per Output Token (excl. 1st token)------
+   Mean TPOT (ms):                          92.41
+   Median TPOT (ms):                        85.46
+   P99 TPOT (ms):                           138.67
+   ---------------Inter-token Latency----------------
+   Mean ITL (ms):                           92.41
+   Median ITL (ms):                         74.76
+   P99 ITL (ms):                            263.07
+   ----------------End-to-end Latency----------------
+   Mean E2EL (ms):                          116133.48
+   Median E2EL (ms):                        110349.39
+   P99 E2EL (ms):                           227243.97
+   ==================================================
+   ```
+
+## Troubleshooting
+
+The following section outlines common issues and their solutions.
+
+### Bandwidth test fails with error
+
+1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`
+
+    ```bash
+    which ib_write_bw
+    ```
+
+2. Confirm the `SERVER_IP` is accessible
+
+    ```bash
+    ping <SERVER_IP>
+    ```
+
+3. Check system logs, use `dmesg` for kernel-level errors
+
+    ``` bash
+    sudo dmesg -T | grep -i 'error|warn|fail|exception'
+    ```
+
+### Slurm job fails
+
+Common causes and solutions for Slurm job submission failures include:
+
+1. Shared storage access:
+   * Verify that both `sglang_disagg` and model directories are located in a shared NFS mount accessible to all compute nodes.
+   * Ensure proper permissions: `chmod -R 755 /shared/path/sglang_disagg /shared/path/models`
+
+2. Log analysis:
+   * Examine `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` on each participating node for detailed error messages.
+   * Check for common issues like missing dependencies, GPU allocation failures, or network connectivity problems.
+
+3. Configuration validation:
+   * Verify SLURM parameters in `run_submit_disagg.sh`:
+     * `SLURM_ACCOUNT`: Ensure your account has access to the cluster
+     * `SLURM_PARTITION`: Confirm the partition exists and is accessible
+     * `MODEL_PATH`: Check that the path is correct and accessible from compute nodes
+     * `MODEL_NAME`: Verify the model subdirectory exists within `MODEL_PATH`
+   * Use `sinfo` to check partition and node availability.
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
@@ -0,0 +1,627 @@
+# vLLM distributed inference with MoRI
+
+This document provides a comprehensive guide for setting up a high-performance
+vLLM serving environment on an AMD Instinct MI300X or MI325X GPU cluster using
+the [MoRI (Modular RDMA Interface)](https://github.com/rocm/mori) communication
+backend. It also includes detailed instructions on how to reproduce the
+benchmark results published in the AMD ROCm blog [Practical, Fault-Robust
+Distributed Inference for DeepSeek on AMD
+MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
+
+## Prerequisites
+
+The following hardware configuration is required to implement this setup:
+
+* **Nodes**: A minimum of two GPU nodes (virtual machines or physical machines)
+  for wide expert parallelism (EP) evaluation.
+* **GPUs**: 8x AMD Instinct MI300X/MI325X GPU cards per node.
+* **Networking**: 8x NVIDIA Mellanox ConnectX-7 (CX7) NICs per node, providing
+  a dedicated 1:1 mapping between GPUs and network interfaces for optimal
+  inter-node communication.
+
+## System configuration
+
+This section outlines infrastructure steps required to prepare your cluster for
+high-performance AI workloads. It covers validating your system's software
+baselines and firmware versions, configuring high-bandwidth backend networking
+for inter-node communication, and establish shared storage to ensure
+    a synchronized distributed computing environment.
+
+### Verify baseline software
+
+This setup has been validated using the **AI/ML Ready Image (ROCm 7-based)** on
+Digital Ocean AMD GPU Droplets. The following table outlines the software
+stack versions and appropriate shell commands for verification:
+
+| Component | Version | Verification command |
+| :--- | :--- | :--- |
+| **OS** | Ubuntu 24.04.3 LTS | `cat /etc/os-release` |
+| **Kernel** | 6.8.0-87-generic | `uname -r` |
+| **ROCm** | 7.0.2 | `amd-smi version` |
+| **PLDM bundle (firmware) for MI300X** | 01.25.03.12 | [Verify BKC](#verify-best-known-configuration-bkc) |
+| **PLDM bundle (firmware) for MI325X** | 01.25.03.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
+| **CX7 Firmware** | 28.46.3048 | `dkms status` |
+| **CX7 Driver** | 24.10-3.2.5 | `dkms status` |
+| **DOCA** | 2.9.3 | `dpkg -l \| grep doca` |
+
+### Verify best known configuration (BKC)
+
+The BKC defines a validated configuration of GPU firmware, baseboard firmware,
+ROCm user space components, the AMD GPU Driver, and virtualization tooling.
+These components are tested together to attain best performance and compatibility.
+
+While AMD publishes the AMD GPU driver and ROCm user space components, your
+server OEM or infrastructure provider distributes the firmware packages. AMD
+supplies those firmware images (PLDM bundles), which the OEM integrates and
+distributes.
+
+To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
+Redfish API:
+
+1. Prepare credentials: Identify your BMC IP, username, and password.
+2. Run Redfish queries: Use the following commands to check the active
+   firmware inventory.
+
+     ``` bash
+     # Define BMC connection variables
+     BMC_IP="<BMC_IP>"
+     AUTH="<username>:<password>"
+
+     # Query active BKC bundle version
+     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
+          -u "${AUTH}" -k | json_pp
+
+     # Query active IFWI (Integrated Firmware Image)
+     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
+          -u "${AUTH}" -k | json_pp
+     ```
+
+### Run basic system health checks
+
+Before proceeding with software deployment, verify that all cluster nodes
+comply with the [MI300X Basic Health
+Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi300x.html#basic-health-checks)
+or [MI325X Basic Health
+Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi325x.html#basic-health-checks).
+Key requirements include specific kernel boot arguments, minimum system memory
+thresholds, PCIe Gen5 link stability, and so on.
+
+### Configure your backend network (netplan)
+
+Configure the backend NICs for high-bandwidth inter-node communication. Suppose
+the GPU’s eight network interface controllers (NICs) are eth2 to eth9. Each NIC
+must have its own subnet that is disjoint from the others. For example, `eth2`
+could use `192.168.50.0/24`, `eth3` could use `192.168.51.0/24`, and so on.
+Each node needs a unique IP address on each subnet. You should use the same
+final octet in each subnet for a given node. For example, one node would have
+the addresses `192.168.50.2`, `192.168.51.2`, and so on. Another node might
+have `192.168.50.3`, `192.168.51.3`, and so on. Ensure MTU is set to `4200`.
+
+```{note}
+Ensure you identify the correct interface names for your system using ip link
+before applying this configuration.
+```
+
+For example, your `/etc/netplan/50-backend.yaml` might include something like
+the following:
+
+```yaml
+eth2:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.50.2/24
+  mtu: 4200
+eth3:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.51.2/24
+  mtu: 4200
+eth4:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.52.2/24
+  mtu: 4200
+eth5:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.53.2/24
+  mtu: 4200
+eth6:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.54.2/24
+  mtu: 4200
+eth7:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.55.2/24
+  mtu: 4200
+eth8:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.56.2/24
+  mtu: 4200
+eth9:
+  dhcp4: false
+  dhcp6: false
+  link-local: []
+  addresses:
+    - 192.168.57.2/24
+  mtu: 4200
+```
+
+To apply the configuration, use the following command.
+
+```bash
+sudo netplan apply
+```
+
+To verify your configuration, use the following command.
+
+```bash
+sudo apt install -y net-tools && ip -br a
+```
+
+### Configure your network file system (NFS)
+
+Setting up a shared NFS volume facilitates centralized storage for models,
+recipes, and logs across the cluster. Use the following commands to install the
+necessary client tools and mount the remote directory.
+
+```{important}
+Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
+server details and desired local mount path.
+```
+
+``` bash
+sudo apt update && sudo apt install -y nfs-common
+sudo mkdir -p /mount/point
+sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
+echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
+```
+
+### Configure static hostname resolution for backend initialization (optional)
+
+If the high-speed RDMA/IB interfaces are used for the initial distributed
+coordination (such as `MASTER_ADDR`), you must configure static hostname
+resolution. This ensures that cluster host names resolve to the backend network
+IPs rather than the management or local loopback addresses.
+
+Follow these steps to configure static hostname resolution:
+
+1. Edit `/etc/hosts` on all nodes: for example, using `sudo vim /etc/hosts`.
+2. Add the backend IP and hostname mappings.
+3. Comment out any default local mappings (such as `127.0.1.1`) for the current
+   hostname to avoid resolution conflicts.
+
+For example, your `/etc/hosts` entries might look like:
+
+```text
+# Map host names to backend network IPs
+192.168.50.2 mori_test_01
+192.168.50.3 mori_test_02
+
+# Comment out the default entry to ensure resolution via the backend IP
+# 127.0.1.1 mori_test_01 mori_test_01
+```
+
+## Software installation
+
+Next, install the essential software stack required to operate the AMD Instinct
+GPUs and high-speed networking components. Follow these steps to deploy the
+NVIDIA DOCA drivers for Mellanox ConnectX-7 NICs, the ROCm software stack, and
+the necessary kernel modules to enable hardware acceleration.
+
+### Install CX7 driver and firmware
+
+1. Download and install the `DOCA 2.9.3` driver following the instructions in
+   [NVIDIA DOCA 2.9.3
+   Downloads](https://developer.nvidia.com/doca-2-9-3-download-archive?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-all&Distribution=Ubuntu&version=24.04&installer_type=deb_local).
+
+2. Download the appropriate firmware for your hardware PSID from the [NVIDIA
+   official website](https://network.nvidia.com/support/firmware/connectx7/)
+   and flash the device.
+
+3. To verify driver and firmware versions, use the following command. Replace
+   `IB Device` with your specific backend interface.
+
+   ```bash
+   ethtool -i <IB Device>
+   ```
+
+### Install ROCm
+
+Use the following commands to quickly install ROCm 7.0.2 on Ubuntu 24.04:
+
+``` bash
+wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
+sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
+sudo apt update
+sudo apt install python3-setuptools python3-wheel
+sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
+sudo apt install rocm
+```
+
+For detailed installation instructions, refer to the [ROCm 7.0.2
+documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#rocm-installation).
+
+### Install AMD GPU Driver (amdgpu)
+
+Use the following commands to quickly install the AMD GPU Driver (ROCm 7.0.2) on Ubuntu 24.04:
+
+``` bash
+wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
+sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
+sudo apt update
+sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
+sudo apt install amdgpu-dkms
+```
+
+For detailed installation instructions, refer to the [ROCm 7.0.2
+documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#amdgpu-driver-installation).
+
+## Network verification and testing
+
+Before deploying the inference engine, validate the health and performance of
+the cluster interconnects.
+
+### Verify network connectivity
+
+Verify that all network interfaces are reachable across the cluster nodes.
+Assuming `eth0` is the management interface, `eth1` is for the VPC, and `eth2`
+through `eth9` are the dedicated RoCE backend interfaces, use the following
+loop to test reachability to a remote node (for instance, a target node with
+host IP suffix `.3`).
+
+```bash
+# Test connectivity for RoCE subnets 192.168.50.x through 192.168.57.x
+for i in {0..7}; do ping -c 1 192.168.5${i}.3; done
+```
+
+### Validate your RDMA setup
+
+Confirm that all eight RDMA network interfaces are in `UP` state. Verify the MTU
+setting of `4096` and ensure each device has a valid GID mapped to its assigned
+IP address.
+
+``` bash
+ibv_devinfo -v
+```
+
+The output should look something like this:
+
+``` bash
+hca_id: mlx5_0
+        transport:                      InfiniBand (0)
+        fw_ver:                         28.46.3048
+        ...
+        board_id:                       MT_0000000838
+        phys_port_cnt:                  1
+                port:   1
+                        state:                  PORT_ACTIVE (4)
+                        max_mtu:                4096 (5)
+                        active_mtu:             4096 (5)
+                        sm_lid:                 0
+                        port_lid:               0
+                        port_lmc:               0x00
+                        link_layer:             Ethernet
+                        ...
+                        GID[  0]:               fe80:0000:0000:0000:d894:24ff:fe4a:96e2, RoCE v1
+                        GID[  1]:               fe80::d894:24ff:fe4a:96e2, RoCE v2
+                        GID[  2]:               0000:0000:0000:0000:0000:ffff:c0a8:3903, RoCE v1
+                        GID[  3]:               ::ffff:192.168.57.3, RoCE v2
+```
+
+### Run RDMA bandwidth benchmarks
+
+Verify the inter-node RDMA performance to ensure the network fabric can
+saturate the link bandwidth.
+
+#### Install RDMA Performance Tools
+
+To get started, build the ROCm-optimized `rdma-perftest` test suite from
+source:
+
+```bash
+sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
+git clone https://github.com/ROCm/rdma-perftest
+cd rdma-perftest/
+./autogen.sh
+./configure --enable-rocm --with-rocm=/opt/rocm
+make -j$(nproc)
+sudo make install
+```
+
+#### Run a bandwidth test (GPU memory)
+
+Perform a bandwidth test using ROCm GPU memory between two nodes. One acts
+as a server and the other acts as a client. For 400G interfaces, the expected
+peak throughput is approximately 390 Gbps. Replace `<SERVER_IP>` with the
+appropriate IP.
+
+```bash
+# On Server Node
+./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
+
+# On Client Node
+./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
+```
+
+## vLLM serving and MoRI unit tests
+
+### Install Docker Engine
+
+Install the Docker engine to manage the containerized vLLM and MoRI serving
+environments.
+
+```bash
+sudo apt update && sudo apt install -y docker.io
+```
+
+### Download the DeepSeek PTPC model
+
+This guide uses the
+[DeepSeek-R1-FP8-Dynamic](https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic)
+model optimized for PTPC. Use the following commands to install the Hugging
+Face CLI and download the model to your shared NFS directory:
+
+```bash
+# Set up a virtual environment and install the Hugging Face CLI
+sudo apt update && sudo apt install -y python3-venv
+python3 -m venv ~/venvs/hf
+source ~/venvs/hf/bin/activate
+pip install huggingface_hub
+
+# Download the model to the shared NFS mount point
+huggingface-cli download --token <your_hf_token> \
+    EmbeddedLLM/deepseek-r1-FP8-Dynamic \
+    --local-dir /mount/point/models/EmbeddedLLM/deepseek-r1-FP8-Dynamic
+```
+
+### Launch the serving container
+
+Deploy the vLLM MoRI serving Docker container on each node.
+
+```bash
+CONTAINER_NAME=vllm_mori
+IMAGE_NAME=aigmkt/vllm:mori_rocm6.4.1_20251105
+
+docker run -it \
+    --rm \
+    --device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
+    --network host --ipc host \
+    --group-add video \
+    --cap-add SYS_PTRACE \
+    --security-opt seccomp=unconfined \
+    --privileged \
+    -v /mount/point/models:/models \
+    --shm-size 128G \
+    --name ${CONTAINER_NAME} \
+    ${IMAGE_NAME} /bin/bash
+```
+
+### Run MoRI inter-node unit tests
+
+Before starting the vLLM service, run the MoRI unit test to verify that the
+inter-node communication backend is correctly configured.
+
+The key configuration variables are:
+
+* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
+* `<MASTER_IP>`: The IP address of the primary node's backend interface.
+
+```{note}
+You can find reference performance data in the [ROCm/MoRI
+repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
+```
+
+```bash
+# Set up environment inside the container
+cd /app/mori
+export PYTHONPATH=/app/mori:$PYTHONPATH
+export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
+
+# Node 0 (Primary)
+torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
+    --master_addr="<MASTER_IP>" --master_port=1234 \
+    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
+    --cmd bench --kernel-type v1
+
+# Node 1 (Secondary)
+torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
+    --master_addr="<MASTER_IP>" --master_port=1234 \
+    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
+    --cmd bench --kernel-type v1
+```
+
+### Deploy and serve the model
+
+To deploy DeepSeek-R1 (PTPC) with Expert Parallelism 16 (EP16) across two
+nodes, use the following serving scripts.
+
+#### Create serving scripts
+
+Create the following scripts inside the container on each node.
+
+* Node 0 (master node): `ep16_node0.sh`
+
+   ```bash
+   #!/bin/bash
+
+   # Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
+   export VLLM_ROCM_USE_AITER=1
+   export VLLM_ROCM_USE_AITER_MOE=1
+   export VLLM_LOGGING_LEVEL=INFO
+   export VLLM_USE_V1=1
+   export VLLM_ROCM_USE_AITER_MLA=1
+   export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
+   export VLLM_ALL2ALL_BACKEND=mori
+
+   vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
+       -dp 16 \
+       --enable-expert-parallel \
+       --data-parallel-size-local 8 \
+       --data-parallel-address ${IP} \
+       --data-parallel-rpc-port 1212 \
+       --served-model-name deepseek \
+       --port 8777 \
+       --block-size 1 \
+       --distributed-executor-backend mp \
+       --gpu-memory-utilization 0.8 \
+       --max-model-len 8192 \
+       --max-num-batched-tokens 4096 \
+       --max-num-seqs 4096 \
+       --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
+       --cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
+       --kv-cache-dtype fp8 \
+       --no-enable-prefix-caching \
+       --trust-remote-code 2>&1 | tee serving_node0_ep16.log
+    ```
+
+* Node 1: `ep16_node1.sh`
+
+   ```bash
+   #!/bin/bash
+
+   # Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
+   export VLLM_ROCM_USE_AITER=1
+   export VLLM_ROCM_USE_AITER_MOE=1
+   export VLLM_LOGGING_LEVEL=INFO
+   export VLLM_USE_V1=1
+   export VLLM_ROCM_USE_AITER_MLA=1
+   export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
+   export VLLM_ALL2ALL_BACKEND=mori
+
+   vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
+           -dp 16 \
+           --enable-expert-parallel \
+           --headless \
+           --data-parallel-size-local 8 \
+           --data-parallel-start-rank 8 \
+           --data-parallel-address ${IP} \
+           --data-parallel-rpc-port 1212 \
+           --served-model-name deepseek \
+           --port 8777 \
+           --block-size 1 \
+           --distributed-executor-backend mp \
+           --gpu_memory_utilization 0.8 \
+           --max-model-len 8192 \
+           --max_num_batched_token 4096 \
+           --max-num-seqs 4096 \
+           --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
+           --cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
+           --kv-cache-dtype fp8 \
+           --no-enable-prefix-caching \
+           --trust-remote-code 2>&1 | tee serving_node1_ep16.log
+    ```
+
+#### Run the serving scripts
+
+Run the scripts on each node to launch the distributed serving instance.
+Replace `<MASTER_IP>` with the backend network IP of Node 0.
+
+```bash
+# On Node 0 (Primary)
+export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
+export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
+IP=<MASTER_IP> bash ep16_node0.sh
+
+# On Node 1 (Secondary)
+export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
+export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
+IP=<MASTER_IP> bash ep16_node1.sh
+```
+
+## Reproducing performance
+
+This section details how to reproduce the performance metrics published in the
+AMD ROCm Blog: [Practical, Fault-Robust Distributed Inference for DeepSeek on
+AMD
+MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
+
+### Configuration for EP16 (16 GPUs)
+
+To achieve the reported throughput, expert parallelism 16 (EP16) is used across
+the decode nodes.
+
+#### Benchmark target
+
+* Decode throughput: ~12.4k output tokens/s per node.
+
+### Performance reproduction commands
+
+Use the following configurations to reproduce published performance metrics.
+
+#### Decode benchmark
+
+To reproduce the 12.4k output tokens/s, use the following configuration:
+
+```bash
+#!/bin/bash
+
+MAX_CONCURRENCY=${1:-3072}
+TIMES=2
+NUM_PROMPTS=$((MAX_CONCURRENCY*TIMES))
+vllm bench serve \
+    --max-concurrency $MAX_CONCURRENCY \
+    --num-prompts $NUM_PROMPTS \
+    --model /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
+    --served-model-name deepseek \
+    --port 8777 \
+    --ignore-eos \
+    --trust-remote-code \
+    --dataset-name random \
+    --seed 2025 \
+    --random-input-len 2048 \
+    --random-output-len 1024 2>&1 | tee bench_decode_${MAX_CONCURRENCY}_isl_2k_osl_1k.log
+```
+
+To calculate the per-node throughput for comparison with the blog data, take
+the reported **Peak output token throughput (tok/s)** from the benchmark
+results and divide it by the total number of nodes in the cluster.
+
+## Troubleshooting
+
+The following section outlines common issues and their solutions.
+
+### Bandwidth test fails with error
+
+1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`.
+
+    ``` bash
+    which ib_write_bw
+    ```
+
+2. Confirm the `SERVER_IP` is accessible.
+
+    ``` bash
+    ping <SERVER_IP>
+    ```
+
+3. Check system logs, use `dmesg` for kernel-level errors.
+
+    ``` bash
+    sudo dmesg -T | grep -i 'error|warn|fail|exception'
+    ```
+
+### vLLM EP 16 with MoRI backend fails to launch
+
+1. Error: `Waiting for init message from front-end.` Check the connectivity of the `IP`. Disable firewall/selinux or allow traffic for port `1212`.
+
+2. Verify server name resolution. Ensure server names are correctly mapped in `/etc/hosts`.
+
+3. Confirm whether environment variable `GLOO_SOCKET_IFNAME` is set before running the vLLM serving script.
--- a/docs/how-to/rocm-for-ai/inference/index.rst
+++ b/docs/how-to/rocm-for-ai/inference/index.rst
@@ -26,6 +26,12 @@ training, fine-tuning, and inference. It leverages popular machine learning fram

 - :doc:`SGLang inference performance testing <benchmark-docker/sglang>`

+- :doc:`vLLM distributed inference with MoRI <benchmark-docker/vllm-mori-distributed>`
+
+- :doc:`SGLang distributed inference with MoRI <benchmark-docker/sglang-mori-distributed>`
+
+- :doc:`SGLang distributed inference with Mooncake <benchmark-docker/sglang-distributed>`
+
 - :doc:`xDiT diffusion inference <xdit-diffusion-inference>`

 - :doc:`Deploying your model <deploy-your-model>`
--- a/docs/reference/api-libraries.md
+++ b/docs/reference/api-libraries.md
@@ -18,6 +18,7 @@
 (artificial-intelligence-apis)=

 * {doc}`Composable Kernel <composable_kernel:index>`
+* {doc}`hipDNN <hipdnn:index>`
 * {doc}`MIGraphX <amdmigraphx:index>`
 * {doc}`MIOpen <miopen:index>`
 * {doc}`MIVisionX <mivisionx:index>`
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -119,6 +119,10 @@ subtrees:
            title: PyTorch inference performance testing
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
            title: SGLang inference performance testing
+          - file: how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
+            title: vLLM distributed inference with MoRI
+          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
+            title: SGLang distributed inference with MoRI
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
            title: SGLang distributed inference with Mooncake
          - file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
--- a/docs/what-is-rocm.rst
+++ b/docs/what-is-rocm.rst
@@ -36,6 +36,7 @@ Machine Learning & Computer Vision
  :header: "Component", "Description"

  ":doc:`Composable Kernel <composable_kernel:index>`", "Provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures"
+  ":doc:`hipDNN <hipdnn:index>`", "A graph-based deep learning library that enables multi-operation fusion for improved performance on AMD GPUs. "
  ":doc:`MIGraphX <amdmigraphx:index>`", "Graph inference engine that accelerates machine learning model inference"
  ":doc:`MIOpen <miopen:index>`", "An open source deep-learning library"
  ":doc:`MIVisionX <mivisionx:index>`", "Set of comprehensive computer vision and machine learning libraries, utilities, and applications"
--- a/tools/rocm-build/rocm-7.2.0.xml
+++ b/tools/rocm-build/rocm-7.2.0.xml
@@ -0,0 +1,44 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<manifest>
+    <remote name="rocm-org" fetch="https://github.com/ROCm/" />
+    <default revision="refs/tags/rocm-7.2.0"
+     remote="rocm-org"
+     sync-c="true"
+     sync-j="4" />
+<!--list of projects for ROCm-->
+    <project name="ROCK-Kernel-Driver" />
+    <project name="rocm_bandwidth_test" />
+    <project name="rocm-examples" />
+<!--HIP Projects-->
+    <project name="HIPIFY" />
+<!-- The following projects are all associated with the AMDGPU LLVM compiler -->
+    <project name="half" />
+    <project name="llvm-project" />
+    <project name="spirv-llvm-translator" />
+<!-- gdb projects -->
+    <project name="ROCdbgapi" />
+    <project name="ROCgdb" />
+    <project name="rocr_debug_agent" />
+<!-- ROCm Libraries -->
+    <project groups="mathlibs" name="AMDMIGraphX" />
+    <project groups="mathlibs" name="MIVisionX" />
+    <project groups="mathlibs" name="ROCmValidationSuite" />
+    <project groups="mathlibs" name="composable_kernel" />
+    <project groups="mathlibs" name="hipfort" />
+    <project groups="mathlibs" name="rccl" />
+    <project groups="mathlibs" name="rocAL" />
+    <project groups="mathlibs" name="rocALUTION" />
+    <project groups="mathlibs" name="rocDecode" />
+    <project groups="mathlibs" name="rocJPEG" />
+    <project groups="mathlibs" name="rocm-libraries" />
+    <project groups="mathlibs" name="rocm-systems" />
+    <project groups="mathlibs" name="rocPyDecode" />
+    <project groups="mathlibs" name="rocSHMEM" />
+    <project groups="mathlibs" name="rocm-cmake" />
+    <project groups="mathlibs" name="rpp" />
+    <project groups="mathlibs" name="TransferBench" />
+<!-- Projects for OpenMP-Extras -->
+    <project name="aomp" path="openmp-extras/aomp" />
+    <project name="aomp-extras" path="openmp-extras/aomp-extras" />
+    <project name="flang" path="openmp-extras/flang" />
+</manifest>
Author	SHA1	Message	Date
Matt Williams	69360270f5	Adding hipDNN links	2026-01-30 10:42:06 -05:00
Ibrahim Wani	2da4c460ad	Add torch to origami pip modules (#5896 ) * Add torch to origami pip modules * Fix free space error * Add numpy	2026-01-29 09:56:31 -07:00
peterjunpark	d1165b7359	Publish vLLM / SGLang + MoRI distributed inference cookbooks (#5912 ) * add recipes * clean up update clean up fix * update sglang docker instructions docker image tag add user to docker group fix * update pldm/bkc * update pldm/bkc * add bkc note * update bkc notes * update article info * update wordlist * fix linting issues * fix linting issues * fix linting * fix ref	2026-01-29 11:20:28 -05:00
Joseph Macaranas	f6652b4fad	[Azure CI] Increase hiptensor build timeout to 120 minutes (#5910 )	2026-01-28 11:56:05 -05:00
Ibrahim Wani	aa864ee964	Replace old python bindings target with new (#5887 )	2026-01-23 10:06:11 -07:00
Pratik Basyal	0a834eff9e	PLDM note update (#5893 )	2026-01-23 10:09:59 -05:00
Kiriti Gowda	380898f4a8	AzureCI: MIVisionX - Remove MIOpen & MIGraphX AzureCI: MIVisionX - Remove MIOpen & MIGraphX deps for build and test	2026-01-22 21:03:53 -08:00
Kiriti Gowda	fc6332c6b3	AzureCI: MIVisionX - Remove MIOpen & MIGraphX deps for build and test	2026-01-22 16:39:44 -08:00
Pratik Basyal	a85735d430	720 reference link update and note fixes [Develop] (#5883 ) * Links updated to 7.2.0 * COmpatibility note fixed	2026-01-22 12:11:40 -05:00
JeniferC99	c6bf8d2054	Merge pull request #5880 from ROCm/jechirst/bumprocmversion Jechirst/bumprocmversion	2026-01-21 15:51:21 -08:00
jechrist	76570de120	update version to 7.2.0	2026-01-21 15:50:02 -08:00
jechrist	fbd90eccfc	update version to 7.2.0	2026-01-21 15:49:36 -08:00
JeniferC99	0c74bc889f	Update default.xml	2026-01-21 15:36:50 -08:00
Pratik Basyal	599328c44e	7.2.0 Known issues and PLDM table updated (#5877 ) * Known issues and PLDM table updated * JAX workload known issues added * Minor changes	2026-01-21 17:20:15 -05:00