Update Primus docs for 26.1 release (#5911 ) (#5918 )

* archive previous versions update conf fix fix docker hub url fix * update history pages * update docker info * update configs * update primus commit (cherry picked from commit d8b6ee47e3)
Publish vLLM / SGLang + MoRI distributed inference cookbooks (#5912 ) (#5913 )
2026-02-01 01:45:18 -05:00 · 2026-01-30 12:54:26 -05:00 · 2026-01-29 11:42:03 -05:00 · 2026-01-23 10:32:00 -05:00 · 2026-01-22 12:21:46 -05:00 · 2026-01-21 17:29:18 -05:00
33 changed files with 5406 additions and 246 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -39,6 +39,7 @@ autograd
 Backported
 BARs
 BatchNorm
 BKC
 BLAS
 BMC
 BabelStream
@@ -53,6 +54,7 @@ CDNA
 CGUI
 CHTML
 CIFAR
 CNP
 CLI
 CLion
 CMake
@@ -96,6 +98,7 @@ Dashboarding
 Dataloading
 dataflows
 DBRX
 DCQCN
 DDR
 DF
 DGEMM
@@ -110,8 +113,10 @@ DMA
 DOMContentLoaded
 DNN
 DNNL
 DOCA
 DPM
 DRI
 DSCP
 DW
 DWORD
 Dask
@@ -127,7 +132,9 @@ Deprecations
 DevCap
 DirectX
 Disaggregated
 disagg
 disaggregated
 disaggregation
 Dockerfile
 Dockerized
 Doxygen
@@ -179,6 +186,8 @@ GFLOPS
 GFortran
 GFXIP
 GGUF
 GID
 Gbps
 Gemma
 GiB
 GIM
@@ -248,6 +257,7 @@ IOP
 IOPS
 IOPM
 IOV
 IPs
 IRQ
 ISA
 ISV
@@ -312,6 +322,7 @@ MNIST
 MPI
 MPT
 MSVC
 MTU
 mul
 MVAPICH
 MVFFR
@@ -334,6 +345,7 @@ MLA
 MosaicML
 MoEs
 Mooncake
 MoRI
 Mpops
 Multicore
 multihost
@@ -403,16 +415,21 @@ PEQT
 PIL
 PILImage
 PJRT
 PLDM
 POR
 PRNG
 PRs
 PSID
 PTPC
 PaLM
 Pageable
 PeerDirect
 Pensando
 PerfDb
 Perfetto
 PipelineParallel
 PnP
 Pollara
 PowerEdge
 PowerShell
 Pretrained
@@ -424,6 +441,7 @@ Pytest
 PyTorch
 QPS
 Qcycles
 QoS
 Qwen
 RAII
 RAS
@@ -457,6 +475,7 @@ RPP
 RST
 RW
 Radeon
 Redfish
 RelWithDebInfo
 Req
 Rickle
@@ -724,6 +743,7 @@ enqueue
 env
 epilog
 etcetera
 eth
 ethernet
 exascale
 executables
@@ -819,6 +839,7 @@ llvm
 lm
 localscratch
 logits
 loopback
 lossy
 macOS
 matchers
@@ -844,6 +865,7 @@ nanoGPT
 NCS
 NOP
 NVLink
 netplan
 num
 numref
 ocl
@@ -911,6 +933,7 @@ rc
 rccl
 rdc
 rdma
 reachability
 reStructuredText
 redirections
 refactorization
@@ -980,6 +1003,7 @@ shader
 sharding
 sigmoid
 sles
 slurm
 sm
 smi
 softmax
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -199,7 +199,7 @@ for a complete overview of this release.
  * fftw_execute_dft_c2r
  * fftwf_execute_dft_c2r
-### **HIPIFY** (22.2.0)
+### **HIPIFY** (22.0.0)
 #### Added
@@ -279,6 +279,10 @@ for a complete overview of this release.
 * Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches).
 #### Upcoming changes
 * As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
 ### **MIGraphX** (2.15.0) 
 #### Added
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -139,7 +139,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
          </td>
      </tr>
      <tr>
-          <td>MI300X</td>
+          <td>MI300X<a href="#footnote1"><sup>[2]</sup></a></td>
          <td>01.25.03.12</td>
          <td rowspan="6" style="vertical-align: middle;">
              30.30.0<br>
@@ -180,6 +180,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
 </div>
 <p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p>
 <p id="footnote1">[2]: For AMD Instinct MI300X KVM SR-IOV with Multi-VF (8 VF) support requires a compatible firmware BKC bundle which will be released in coming months.</p>
 #### Node power management for multi-GPU nodes added
@@ -245,7 +246,7 @@ New Stream Management API `hipStreamCopyAttributes` is implemented for CUDA Pari
 The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit.  This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs.  The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.
-In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/latest/install.html).
+In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/docs-7.2.0/install.html).
 ### Software-managed plan cache support for hipTensor
@@ -285,7 +286,7 @@ MIGraphX has the following enhancements:
 ### AMDGPU wavefront size macro removal
-The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize).
+The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/hip_cpp_language_extensions.html#warpsize).
 For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example:
 ```
@@ -349,13 +350,13 @@ The ROCm Offline Installer Creator 7.2.0 includes the following features and imp
 * Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation.
 * Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer.
-See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information.
+See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) for more information.
 ### ROCm Runfile Installer updates
 The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
-For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html).
+For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html).
 ### Expansion of the ROCm examples repository
@@ -375,7 +376,7 @@ Usage examples are now available for the [ROCgdb](https://github.com/ROCm/rocm-e
 ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
-* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/latest/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html) set continues to provide detailed information, tutorials, and reference content.
+* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/docs-7.2.0/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/index.html) set continues to provide detailed information, tutorials, and reference content.
 * The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html).
@@ -913,7 +914,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
  * fftw_execute_dft_c2r
  * fftwf_execute_dft_c2r
-### **HIPIFY** (22.2.0)
+### **HIPIFY** (22.0.0)
 #### Added
@@ -995,7 +996,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
 #### Upcoming changes
-* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
+* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/docs-7.2.0/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/docs-7.2.0/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
 ### **MIGraphX** (2.15.0) 
@@ -1432,7 +1433,11 @@ python3 -m pip install --user .
 `sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
 ```
-For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release.
+For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release. See [GitHub issue #5875](https://github.com/ROCm/ROCm/issues/5875).
 ### Intermittent errors when running JAX workloads
 You might experience intermittent errors or segmentation faults when running JAX workloads. The issue is currently under investigation and will be addressed in an upcoming ROCm release. See [GitHub issue #5878](https://github.com/ROCm/ROCm/issues/5878).
 ### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs
@@ -1477,7 +1482,7 @@ The following changes to the ROCm software stack are anticipated for future rele
 ###  ROCm Offline Installer Creator deprecation
-The ROCm Offline Installer Creator is deprecated with the ROCm 7.2.0 release. Equivalent installation capabilities are available through the ROCm Runfile Installer, a self-extracting installer that is not based on OS package managers. This installer will be removed in a future release.
+The [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) is deprecated with the ROCm 7.2.0 release and will be removed in a future release. Equivalent installation capabilities are available through the [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html), a self-extracting installer that is not based on OS package managers.
 ### ROCm SMI deprecation
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -171,6 +171,7 @@ Operating systems, kernel and Glibc versions
 *********************************************
 For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
 .. note::
  * See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL).
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -138,12 +138,14 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.8", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.9", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-megatron", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.8", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.9", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3", "os": ["linux"]},
@@ -154,16 +156,17 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.8", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.9", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.8", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.9", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},    
    {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
@@ -193,11 +196,16 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},
--- a/docs/data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
@@ -1,15 +1,13 @@
 docker:
-  pull_tag: rocm/primus:v25.10
+  pull_tag: rocm/primus:v26.1
-  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
+  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
  components:
    ROCm: 7.1.0
    Primus: 0.3.0
    Primus Turbo: 0.1.1
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
-    Transformer Engine: 2.4.0.dev0+32e2d1d4
+    Transformer Engine: 2.6.0.dev0+f141f34b
    Flash Attention: 2.8.3
-    hipBLASLt: 1.2.0-09ab7153e2
+    hipBLASLt: 34459f66ea
    Triton: 3.4.0
    RCCL: 2.27.7
 model_groups:
--- a/docs/data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.11-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.11-benchmark-models.yaml
@@ -0,0 +1,47 @@
 docker:
  pull_tag: rocm/primus:v25.11
  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
    Transformer Engine: 2.4.0.dev0+32e2d1d4
    Flash Attention: 2.8.3
    hipBLASLt: 1.2.0-09ab7153e2
    Triton: 3.4.0
    RCCL: 2.27.7
 model_groups:
  - group: Meta Llama
    tag: llama
    models:
      - model: Llama 3.3 70B
        mad_tag: pyt_megatron_lm_train_llama-3.3-70b
      - model: Llama 3.1 8B
        mad_tag: pyt_megatron_lm_train_llama-3.1-8b
      - model: Llama 3.1 70B
        mad_tag: pyt_megatron_lm_train_llama-3.1-70b
      - model: Llama 2 7B
        mad_tag: pyt_megatron_lm_train_llama-2-7b
      - model: Llama 2 70B
        mad_tag: pyt_megatron_lm_train_llama-2-70b
  - group: DeepSeek
    tag: deepseek
    models:
      - model: DeepSeek-V3 (proxy)
        mad_tag: pyt_megatron_lm_train_deepseek-v3-proxy
      - model: DeepSeek-V2-Lite
        mad_tag: pyt_megatron_lm_train_deepseek-v2-lite-16b
  - group: Mistral AI
    tag: mistral
    models:
      - model: Mixtral 8x7B
        mad_tag: pyt_megatron_lm_train_mixtral-8x7b
      - model: Mixtral 8x22B (proxy)
        mad_tag: pyt_megatron_lm_train_mixtral-8x22b-proxy
  - group: Qwen
    tag: qwen
    models:
      - model: Qwen 2.5 7B
        mad_tag: pyt_megatron_lm_train_qwen2.5-7b
      - model: Qwen 2.5 72B
        mad_tag: pyt_megatron_lm_train_qwen2.5-72b
--- a/docs/data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.11-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.11-benchmark-models.yaml
@@ -0,0 +1,58 @@
 docker:
  pull_tag: rocm/primus:v25.11
  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
    Transformer Engine: 2.4.0.dev0+32e2d1d4
    Flash Attention: 2.8.3
    hipBLASLt: 1.2.0-09ab7153e2
    Triton: 3.4.0
    RCCL: 2.27.7
 model_groups:
  - group: Meta Llama
    tag: llama
    models:
      - model: Llama 3.3 70B
        mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
        config_name: llama3.3_70B-pretrain.yaml
      - model: Llama 3.1 70B
        mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
        config_name: llama3.1_70B-pretrain.yaml
      - model: Llama 3.1 8B
        mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
        config_name: llama3.1_8B-pretrain.yaml
      - model: Llama 2 7B
        mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
        config_name: llama2_7B-pretrain.yaml
      - model: Llama 2 70B
        mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
        config_name: llama2_70B-pretrain.yaml
  - group: DeepSeek
    tag: deepseek
    models:
      - model: DeepSeek-V3 (proxy)
        mad_tag: primus_pyt_megatron_lm_train_deepseek-v3-proxy
        config_name: deepseek_v3-pretrain.yaml
      - model: DeepSeek-V2-Lite
        mad_tag: primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
        config_name: deepseek_v2_lite-pretrain.yaml
  - group: Mistral AI
    tag: mistral
    models:
      - model: Mixtral 8x7B
        mad_tag: primus_pyt_megatron_lm_train_mixtral-8x7b
        config_name: mixtral_8x7B_v0.1-pretrain.yaml
      - model: Mixtral 8x22B (proxy)
        mad_tag: primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
        config_name: mixtral_8x22B_v0.1-pretrain.yaml
  - group: Qwen
    tag: qwen
    models:
      - model: Qwen 2.5 7B
        mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
        config_name: primus_qwen2.5_7B-pretrain.yaml
      - model: Qwen 2.5 72B
        mad_tag: primus_pyt_megatron_lm_train_qwen2.5-72b
        config_name: qwen2.5_72B-pretrain.yaml
--- a/docs/data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
@@ -0,0 +1,32 @@
 docker:
  pull_tag: rocm/primus:v25.11
  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
    Transformer Engine: 2.4.0.dev0+32e2d1d4
    Flash Attention: 2.8.3
    hipBLASLt: 1.2.0-09ab7153e2
 model_groups:
  - group: Meta Llama
    tag: llama
    models:
      - model: Llama 3.1 8B
        mad_tag: primus_pyt_train_llama-3.1-8b
        model_repo: Llama-3.1-8B
        url: https://huggingface.co/meta-llama/Llama-3.1-8B
        precision: BF16
      - model: Llama 3.1 70B
        mad_tag: primus_pyt_train_llama-3.1-70b
        model_repo: Llama-3.1-70B
        url: https://huggingface.co/meta-llama/Llama-3.1-70B
        precision: BF16
  - group: DeepSeek
    tag: deepseek
    models:
      - model: DeepSeek V3 16B
        mad_tag: primus_pyt_train_deepseek-v3-16b
        model_repo: DeepSeek-V3
        url: https://huggingface.co/deepseek-ai/DeepSeek-V3
        precision: BF16
--- a/docs/data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
@@ -0,0 +1,195 @@
 docker:
  pull_tag: rocm/primus:v25.11
  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
    Transformer Engine: 2.4.0.dev0+32e2d1d4
    Flash Attention: 2.8.3
    hipBLASLt: 1.2.0-09ab7153e2
 model_groups:
  - group: Meta Llama
    tag: llama
    models:
    - model: Llama 4 Scout 17B-16E
      mad_tag: pyt_train_llama-4-scout-17b-16e
      model_repo: Llama-4-17B_16E
      url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 3.3 70B
      mad_tag: pyt_train_llama-3.3-70b
      model_repo: Llama-3.3-70B
      url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
      precision: BF16
      training_modes: [finetune_fw, finetune_lora, finetune_qlora]
    - model: Llama 3.2 1B
      mad_tag: pyt_train_llama-3.2-1b
      model_repo: Llama-3.2-1B
      url: https://huggingface.co/meta-llama/Llama-3.2-1B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 3.2 3B
      mad_tag: pyt_train_llama-3.2-3b
      model_repo: Llama-3.2-3B
      url: https://huggingface.co/meta-llama/Llama-3.2-3B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 3.2 Vision 11B
      mad_tag: pyt_train_llama-3.2-vision-11b
      model_repo: Llama-3.2-Vision-11B
      url: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
      precision: BF16
      training_modes: [finetune_fw]
    - model: Llama 3.2 Vision 90B
      mad_tag: pyt_train_llama-3.2-vision-90b
      model_repo: Llama-3.2-Vision-90B
      url: https://huggingface.co/meta-llama/Llama-3.2-90B-Vision
      precision: BF16
      training_modes: [finetune_fw]
    - model: Llama 3.1 8B
      mad_tag: pyt_train_llama-3.1-8b
      model_repo: Llama-3.1-8B
      url: https://huggingface.co/meta-llama/Llama-3.1-8B
      precision: BF16
      training_modes: [pretrain, finetune_fw, finetune_lora, HF_pretrain]
    - model: Llama 3.1 70B
      mad_tag: pyt_train_llama-3.1-70b
      model_repo: Llama-3.1-70B
      url: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
      precision: BF16
      training_modes: [pretrain, finetune_fw, finetune_lora]
    - model: Llama 3.1 405B
      mad_tag: pyt_train_llama-3.1-405b
      model_repo: Llama-3.1-405B
      url: https://huggingface.co/meta-llama/Llama-3.1-405B
      precision: BF16
      training_modes: [finetune_qlora]
    - model: Llama 3 8B
      mad_tag: pyt_train_llama-3-8b
      model_repo: Llama-3-8B
      url: https://huggingface.co/meta-llama/Meta-Llama-3-8B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 3 70B
      mad_tag: pyt_train_llama-3-70b
      model_repo: Llama-3-70B
      url: https://huggingface.co/meta-llama/Meta-Llama-3-70B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 2 7B
      mad_tag: pyt_train_llama-2-7b
      model_repo: Llama-2-7B
      url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
      precision: BF16
      training_modes: [finetune_fw, finetune_lora, finetune_qlora]
    - model: Llama 2 13B
      mad_tag: pyt_train_llama-2-13b
      model_repo: Llama-2-13B
      url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Llama 2 70B
      mad_tag: pyt_train_llama-2-70b
      model_repo: Llama-2-70B
      url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
      precision: BF16
      training_modes: [finetune_lora, finetune_qlora]
  - group: OpenAI
    tag: openai
    models:
    - model: GPT OSS 20B
      mad_tag: pyt_train_gpt_oss_20b
      model_repo: GPT-OSS-20B
      url: https://huggingface.co/openai/gpt-oss-20b
      precision: BF16
      training_modes: [HF_finetune_lora]
    - model: GPT OSS 120B
      mad_tag: pyt_train_gpt_oss_120b
      model_repo: GPT-OSS-120B
      url: https://huggingface.co/openai/gpt-oss-120b
      precision: BF16
      training_modes: [HF_finetune_lora]
  - group: DeepSeek
    tag: deepseek
    models:
    - model: DeepSeek V2 16B
      mad_tag: primus_pyt_train_deepseek-v2
      model_repo: DeepSeek-V2
      url: https://huggingface.co/deepseek-ai/DeepSeek-V2
      precision: BF16
      training_modes: [pretrain]
  - group: Qwen
    tag: qwen
    models:
    - model: Qwen 3 8B
      mad_tag: pyt_train_qwen3-8b
      model_repo: Qwen3-8B
      url: https://huggingface.co/Qwen/Qwen3-8B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Qwen 3 32B
      mad_tag: pyt_train_qwen3-32b
      model_repo: Qwen3-32
      url: https://huggingface.co/Qwen/Qwen3-32B
      precision: BF16
      training_modes: [finetune_lora]
    - model: Qwen 2.5 32B
      mad_tag: pyt_train_qwen2.5-32b
      model_repo: Qwen2.5-32B
      url: https://huggingface.co/Qwen/Qwen2.5-32B
      precision: BF16
      training_modes: [finetune_lora]
    - model: Qwen 2.5 72B
      mad_tag: pyt_train_qwen2.5-72b
      model_repo: Qwen2.5-72B
      url: https://huggingface.co/Qwen/Qwen2.5-72B
      precision: BF16
      training_modes: [finetune_lora]
    - model: Qwen 2 1.5B
      mad_tag: pyt_train_qwen2-1.5b
      model_repo: Qwen2-1.5B
      url: https://huggingface.co/Qwen/Qwen2-1.5B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
    - model: Qwen 2 7B
      mad_tag: pyt_train_qwen2-7b
      model_repo: Qwen2-7B
      url: https://huggingface.co/Qwen/Qwen2-7B
      precision: BF16
      training_modes: [finetune_fw, finetune_lora]
  - group: Stable Diffusion
    tag: sd
    models:
    - model: Stable Diffusion XL
      mad_tag: pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning
      model_repo: SDXL
      url: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
      precision: BF16
      training_modes: [posttrain]
  - group: Flux
    tag: flux
    models:
    - model: FLUX.1-dev
      mad_tag: pyt_train_flux
      model_repo: Flux
      url: https://huggingface.co/black-forest-labs/FLUX.1-dev
      precision: BF16
      training_modes: [posttrain]
  - group: NCF
    tag: ncf
    models:
    - model: NCF
      mad_tag: pyt_ncf_training
      model_repo:
      url: https://github.com/ROCm/FluxBenchmark
      precision: FP32
  - group: DLRM
    tag: dlrm
    models:
    - model: DLRM v2
      mad_tag: pyt_train_dlrm
      model_repo: DLRM
      url: https://github.com/AMD-AGI/DLRMBenchmark
      training_modes: [pretrain]
--- a/docs/data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
@@ -1,13 +1,13 @@
 docker:
-  pull_tag: rocm/primus:v25.11
+  pull_tag: rocm/primus:v26.1
-  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
+  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
-    Transformer Engine: 2.4.0.dev0+32e2d1d4
+    Transformer Engine: 2.6.0.dev0+f141f34b
    Flash Attention: 2.8.3
-    hipBLASLt: 1.2.0-09ab7153e2
+    hipBLASLt: 34459f66ea
    Triton: 3.4.0
    RCCL: 2.27.7
 model_groups:
--- a/docs/data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
@@ -1,13 +1,13 @@
 docker:
-  pull_tag: rocm/primus:v25.11
+  pull_tag: rocm/primus:v26.1
-  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
+  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
  components:
    ROCm: 7.1.0
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
-    Transformer Engine: 2.4.0.dev0+32e2d1d4
+    Transformer Engine: 2.6.0.dev0+f141f34b
    Flash Attention: 2.8.3
-    hipBLASLt: 1.2.0-09ab7153e2
+    hipBLASLt: 34459f66ea
 model_groups:
  - group: Meta Llama
    tag: llama
--- a/docs/data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
@@ -1,15 +1,13 @@
 docker:
-  pull_tag: rocm/primus:v25.10
+  pull_tag: rocm/primus:v26.1
-  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
+  docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
  components:
    ROCm: 7.1.0
    Primus: 0.3.0
    Primus Turbo: 0.1.1
    PyTorch: 2.10.0.dev20251112+rocm7.1
    Python: "3.10"
-    Transformer Engine: 2.4.0.dev0+32e2d1d4
+    Transformer Engine: 2.6.0.dev0+f141f34b
    Flash Attention: 2.8.3
-    hipBLASLt: 1.2.0-09ab7153e2
+    hipBLASLt: 34459f66ea
 model_groups:
  - group: Meta Llama
    tag: llama
--- a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
+++ b/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
@@ -130,7 +130,7 @@ After loading the model in this way, the model is fully ready to use the resourc
 torchtune for fine-tuning and inference
 =============================================
-`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU 
+`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
 model fine-tuning and inference with LLMs.
 #. Install torchtune using pip.
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -25,6 +25,5 @@ In this guide, you'll learn how to use ROCm for AI:
 - :doc:`Inference optimization <inference-optimization/index>`
 To learn about ROCm for HPC applications and scientific computing, see
 :doc:`../rocm-for-hpc/index`.
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
@@ -0,0 +1,904 @@
 # SGLang distributed inference with MoRI
 This document provides a comprehensive guide for deploying a high-performance
 SGLang distributed inference serving environment on an AMD Instinct MI355X GPU
 cluster, utilizing the [MoRI (Modular RDMA
 Interface)](https://github.com/rocm/mori) communication backend for optimized
 inter-node collective operations. It also includes systematic instructions for
 benchmarking 1P2D (1 prefill 2 decode, 3 nodes) configurations using automated
 scripts.
 ## Prerequisites
 The following configuration is required to implement this setup:
 * **Nodes:** A minimum of three GPU nodes (Virtual machines or Physical
    machines) for wide expert parallelism (EP) evaluation.
 * **GPUs** 8x AMD Instinct MI355X GPU cards per node.
 * **Networking:**   8x AMD Pensando™ Pollara 400 AI NICs per node, providing
  a dedicated 1:1 mapping between GPUs and network interfaces for optimal
  inter-node communication.
 * **Orchestration:** A Slurm cluster with at least three nodes -- one for
  prefill service and two for decode services (EP16)
 ## System configuration
 This section outlines the infrastructure setup required to support your AMD
 Instinct MI355X cluster. It covers essential procedures for verifying software
 baselines and firmware versions, configuring the AMD Pensando Pollara 400 AI
 NICs for high-bandwidth networking, and applying thermal and Quality of Service
 (QoS) tunings to ensure a stable, lossless RDMA fabric.
 (sglang-mori-verify-baseline)=
 ### Verify baseline software
 The following table outlines the validated software stack. Use the provided
 shell commands to verify the environment on each node before proceeding.
 | Component | Version | Verification command |
 | :--- | :--- | :--- |
 | **OS** | Ubuntu 22.04.5 LTS | `cat /etc/os-release` |
 | **Kernel** | 5.15.0-163-generic | `uname -r` |
 | **ROCm** | 7.1.1 | `amd-smi version` |
 | **PLDM bundle (firmware)** | 01.25.16.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
 | **AI NIC Firmware** | 1.117.5.a.45 | `dkms status` |
 | **AI NIC Driver** | 25.11.1.001 | `dkms status` |
 ### Verify best known configuration (BKC)
 The BKC defines a validated configuration of GPU firmware, baseboard firmware,
 ROCm user space components, the AMD GPU Driver, and virtualization tooling.
 These components are tested together to attain best performance and compatibility.
 While AMD publishes the AMD GPU driver and ROCm user space components, your
 server OEM or infrastructure provider distributes the firmware packages. AMD
 supplies those firmware images (PLDM bundles), which the OEM integrates and
 distributes.
 To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
 Redfish API:
 1. Prepare credentials: Identify your BMC IP, username, and password.
 2. Run Redfish queries: Use the following commands to check the active
   firmware inventory.
     ``` bash
     # Define BMC connection variables
     BMC_IP="<BMC_IP>"
     AUTH="<username>:<password>"
     # Query active BKC bundle version
     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
          -u "${AUTH}" -k | json_pp
     # Query active IFWI (Integrated Firmware Image)
     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
          -u "${AUTH}" -k | json_pp
     ```
 ### Run basic system health checks
 Before proceeding with software deployment, verify that all cluster nodes
 comply with the [MI355X Basic Health
 Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi355x.html#basic-health-checks)
 Key requirements include specific kernel boot arguments, minimum system memory
 thresholds, PCIe Gen5 link stability, and so on.
 ### Install AMD Pensando Pollara 400 AI NIC drivers
 For detailed instructions on upgrading the firmware and installing drivers for
 the AMD Pensando Pollara 400 AI NIC, refer to the [AMD Instinct System
 Acceptance
 Guide](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/network/nic-installation.html#amd-pensando-pollara-400-ai-nic).
 After installation, verify the active firmware version on all NICs to ensure it
 matches the software baseline. See [Verify baseline software](#verify-best-known-configuration-bkc).
 To display the current firmware version for all AI NICs, use the following command.
 ```bash
 sudo nicctl show version firmware
 ```
 ### Configure thermal management (fan speed)
 For systems equipped with 400G optics, standard fan profiles are often
 insufficient for maintaining stable operating temperatures. To prevent thermal
 throttling or optics failure, the system fans must be set to `FullSpeed`.
 * Requirement: A fan speed of approximately 25,000 RPM is required to maintain
  the AI NIC modules at an optimal operating temperature (~50°C).
 * Constraint: Default profiles (typically around 4,000 RPM) and "Performance IO"
  settings (around 9,000 RPM) do not provide adequate airflow for 400G optical
  transceivers.
 #### Configure fan speed via Redfish (Supermicro)
 Run the following command to set the fan mode to `FullSpeed` through the BMC:
 ``` bash
 # Define BMC connection variables
 BMC_IP="<BMC_IP>"
 AUTH="<username>:<password>"
 # Set Fan Mode to FullSpeed
 curl -X PATCH "https://${BMC_IP}/redfish/v1/Managers/1/Oem/Supermicro/FanMode" \
     -k -u "${AUTH}" \
     -H "Content-Type: application/json" \
     -d '{"Mode": "FullSpeed"}'
 ```
 ### Configure your backend network (netplan)
 Configure the backend NICs for high-bandwidth inter-node communication. Suppose
 the GPU’s eight network interface controllers (NICs) are `benic1p1` to
 `benic8p1`. Each NIC must have its own subnet that is disjoint from the others.
 Each node needs a unique IP address on each subnet. You should use the same
 final octet in each subnet for a given node. For example, one node would have
 the addresses `192.168.1.36`, `192.168.2.36`, and so on. Another node would
 have `192.168.1.37`, `192.168.2.37`, and so on. Ensure MTU is set to `9000`.
 ```{note}
 Ensure you identify the correct interface names for your system using ip link
 before applying this configuration.
 ```
 For example, your `/etc/netplan/70-backend.yaml` should look like the
 following:
 ```yaml
 network:
  ethernets:
    benic8p1:
      addresses:
      - 192.168.8.38/31
      match:
        macaddress: 04:90:81:2a:34:08
      mtu: 9000
      routes:
      - table: 108
        to: 0.0.0.0/0
        via: 192.168.8.39
      routing-policy:
      - from: 192.168.8.38
        table: 108
      set-name: benic8p1
    benic7p1:
      addresses:
      - 192.168.7.38/31
      match:
        macaddress: 04:90:81:2b:82:40
      mtu: 9000
      routes:
      - table: 107
        to: 0.0.0.0/0
        via: 192.168.7.39
      routing-policy:
      - from: 192.168.7.38
        table: 107
      set-name: benic7p1
    benic6p1:
      addresses:
      - 192.168.6.38/31
      match:
        macaddress: 04:90:81:30:c9:30
      mtu: 9000
      routes:
      - table: 106
        to: 0.0.0.0/0
        via: 192.168.6.39
      routing-policy:
      - from: 192.168.6.38
        table: 106
      set-name: benic6p1
    benic5p1:
      addresses:
      - 192.168.5.38/31
      match:
        macaddress: 04:90:81:2a:23:40
      mtu: 9000
      routes:
      - table: 105
        to: 0.0.0.0/0
        via: 192.168.5.39
      routing-policy:
      - from: 192.168.5.38
        table: 105
      set-name: benic5p1
    benic4p1:
      addresses:
      - 192.168.4.38/31
      match:
        macaddress: 04:90:81:2d:69:60
      mtu: 9000
      routes:
      - table: 104
        to: 0.0.0.0/0
        via: 192.168.4.39
      routing-policy:
      - from: 192.168.4.38
        table: 104
      set-name: benic4p1
    benic3p1:
      addresses:
      - 192.168.3.38/31
      match:
        macaddress: 04:90:81:2a:2c:40
      mtu: 9000
      routes:
      - table: 103
        to: 0.0.0.0/0
        via: 192.168.3.39
      routing-policy:
      - from: 192.168.3.38
        table: 103
      set-name: benic3p1
    benic2p1:
      addresses:
      - 192.168.2.38/31
      match:
        macaddress: 04:90:81:30:d5:30
      mtu: 9000
      routes:
      - table: 102
        to: 0.0.0.0/0
        via: 192.168.2.39
      routing-policy:
      - from: 192.168.2.38
        table: 102
      set-name: benic2p1
    benic1p1:
      addresses:
      - 192.168.1.38/31
      match:
        macaddress: 04:90:81:30:e4:00
      mtu: 9000
      routes:
      - table: 101
        to: 0.0.0.0/0
        via: 192.168.1.39
      routing-policy:
      - from: 192.168.1.38
        table: 101
      set-name: benic1p1
 ```
 To apply the configuration, use the following command.
 ```bash
 sudo netplan apply
 ```
 To verify your configuration, use the following command.
 ```bash
 sudo apt install -y net-tools && ip -br a
 ```
 ### Configure Quality of Service (QoS) and Congestion Control (DCQCN)
 To ensure lossless communication and optimal performance for RDMA traffic, the
 network must be configured with specific QoS and Data Center Quantized
 Congestion Notification (DCQCN) settings.
 The following configuration achieves:
 • It enables RX and TX Pause frames on the ports
 • Maps DSCP 24 (Data) to Q3 and DSCP 46 (CNP) to Q6, all other DSCP to Q0
 • Enables PFC for Q3
 • Scheduling : 99% to Q3, 1% to Q0 and strict priority for Q6
 #### Configure DCQCN
 Create and run a `/nfsdata/enable_dcqcn.sh` script to initialize congestion
 control parameters.
 ``` bash
 # !/bin/bash
 TOKEN_BUCKET_SIZE=800000
 AI_RATE=160
 ALPHA_UPDATE_INTERVAL=1
 ALPHA_UPDATE_G=512
 INITIAL_ALPHA_VALUE=64
 RATE_INCREASE_BYTE_COUNT=431068
 HAI_RATE=300
 RATE_REDUCE_MONITOR_PERIOD=1
 RATE_INCREASE_THRESHOLD=1
 RATE_INCREASE_INTERVAL=1
 CNP_DSCP=46
 ROCE_DEVICES=$(ibv_devices | grep ionic_ | awk '{print $1}' | paste -sd " ")
 for roce_dev in $ROCE_DEVICES
 do
    sudo nicctl update dcqcn -r $roce_dev -i 1 \
    --token-bucket-size $TOKEN_BUCKET_SIZE \
    --ai-rate $AI_RATE \
    --alpha-update-interval $ALPHA_UPDATE_INTERVAL \
    --alpha-update-g $ALPHA_UPDATE_G \
    --initial-alpha-value $INITIAL_ALPHA_VALUE \
    --rate-increase-byte-count $RATE_INCREASE_BYTE_COUNT \
    --hai-rate $HAI_RATE \
    --rate-reduce-monitor-period $RATE_REDUCE_MONITOR_PERIOD \
    --rate-increase-threshold $RATE_INCREASE_THRESHOLD  \
    --rate-increase-interval $RATE_INCREASE_INTERVAL \
    --cnp-dscp $CNP_DSCP
 done
 ```
 #### Configure QoS and PFC
 Create and run `/nfsdata/qos.sh` to set up traffic classes and scheduling.
 ``` bash
 #!/bin/bash
 # qos.sh
 # Enable PFC and Auto-negotiation on all ports
 for i in $(sudo nicctl show port | grep Port | awk {'print $3'}); do sudo nicctl update port -p $i --pause-type pfc --rx-pause enable --tx-pause enable; done
 for i in $(sudo nicctl show port | grep Port | awk '{print $3}'); do sudo nicctl update port --port $i --auto-neg enable; done
 # Define Priorities
 cts_dscp=46
 cts_prio=6
 data_dscp=24
 data_prio=3
 default_prio=0
 cnp_dscp=46
 cnp_prio=6
 sudo nicctl update qos pfc --priority 0 --no-drop disable
 sudo nicctl update qos dscp-to-purpose --dscp 48 --purpose none
 sudo nicctl update qos dscp-to-purpose --dscp 46 --purpose none
 sudo nicctl update qos --classification-type pcp
 sudo nicctl update qos --classification-type dscp
 sudo nicctl update qos dscp-to-priority --dscp 0-63 --priority 0
 sudo nicctl update qos dscp-to-priority --dscp 0-23,25-45,47-63 --priority $default_prio
 sudo nicctl update qos dscp-to-priority --dscp $cts_dscp --priority $cts_prio
 sudo nicctl update qos dscp-to-priority --dscp $data_dscp --priority $data_prio
 sudo nicctl update qos dscp-to-priority --dscp $cnp_dscp --priority $cnp_prio
 sudo nicctl update qos pfc --priority $data_prio --no-drop enable
 sudo nicctl update qos scheduling --priority $data_prio,$default_prio,$cts_prio --dwrr 99,1,0 --rate-limit 0,0,10
 ```
 #### Verification your configuration
 Verify the configuration using `nicctl`.
 * Verify QoS classification:
  ``` bash
  sudo nicctl show qos
  ```
  Expected QoS output:
  ``` bash
  NIC  : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
  Port : 04908130-a7a0-4242-4242-000011010000
  Classification type         : DSCP
  DSCP-to-priority :
  DSCP bitmap               : 0xffffbffffeffffff ==> priority : 0
  DSCP bitmap               : 0x0000000001000000 ==> priority : 3
  DSCP bitmap               : 0x0000400000000000 ==> priority : 6
  DSCP                      : 0-23, 25-45, 47-63 ==> priority : 0
  DSCP                      : 24 ==> priority : 3
  DSCP                      : 46 ==> priority : 6
  ```
 * Verify DCQCN and scheduling:
  ``` bash
  sudo nicctl show dcqcn
  ```
  Expected DCQCN and scheduling output:
  ``` bash
  NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
  ------------------------------------------------------------------------------------------
  Lif id                                     : 43000070-0100-0000-4242-04908130a7a0
  ROCE device                                : ionic_7
  DCQCN profile id                         : 1
  Status                                   : Enabled
  Rate increase in AI phase                : 160
  Rate increase byte count                 : 431068
  Alpha update G value                     : 512
  Alpha update interval                    : 1
  Rate increase in HAI phase               : 300
  Initial alpha value                      : 64
  Rate reduce monitor period               : 1
  Rate increase threshold                  : 1
  Rate increase interval                   : 1
  Token bucket size                        : 800000
  DSCP value used for CNP                  : 46
  PFC :
  PFC priority bitmap       : 0x8
  PFC no-drop priorities    : 3
  Scheduling :
  --------------------------------------------
  Priority  Scheduling  Bandwidth Rate-limit
            Type        (in %age) (in Gbps)
  --------------------------------------------
  0         DWRR        1         N/A
  3         DWRR        99        N/A
  6         strict      N/A       10
  ```
 ### Configure your network file system (NFS)
 Setting up a shared NFS volume facilitates centralized storage for models,
 recipes, and logs across the cluster. Use the following commands to install the
 necessary client tools and mount the remote directory.
 ```{important}
 Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
 server details and desired local mount path.
 ```
 ``` bash
 sudo apt update && sudo apt install -y nfs-common
 sudo mkdir -p /mount/point
 sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
 echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
 ```
 ## Software installation
 Next, install the core compute stack required to operate the AMD Instinct GPUs.
 The following steps guide you through deploying the ROCm software stack and the
 necessary kernel-mode drivers to enable hardware acceleration and optimize the
 environment for distributed inference workloads.
 ### Install ROCm
 Use the following commands to quickly install ROCm 7.1.1 on Ubuntu 22.04:
 ``` bash
 wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
 sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
 sudo apt update
 sudo apt install python3-setuptools python3-wheel
 sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
 sudo apt install rocm
 ```
 For detailed installation instructions, refer to the [ROCm 7.1.1
 documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#rocm-installation).
 ### Install AMD GPU Driver (amdgpu)
 Use the following commands to quickly install the AMD GPU Driver (ROCm 7.1.1)
 on Ubuntu 22.04:
 ``` bash
 wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
 sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
 sudo apt update
 sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
 sudo apt install amdgpu-dkms
 ```
 For detailed installation instructions, refer to the [ROCm 7.1.1
 documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#amdgpu-driver-installation).
 ## Network verification and testing
 Before deploying the inference engine, validate the health and performance of
 the cluster interconnects.
 ### Verify network connectivity
 Verify that all network interfaces are reachable across the cluster nodes.
 Assuming `eth0` is the management interface, and `benic1p1` through `benic8p1` are the
 dedicated RoCE backend interfaces, use the following loop to test reachability
 to a remote node (for instance, a target node with host IP suffix `.38`).
 ```bash
 # Test connectivity for RoCE subnets 192.168.x.38 (node B) through 192.168.x.37 (node A)
 for i in {1..8}; do ping -c 1 192.168.${i}.38; done
 ```
 ### Validate your RDMA setup
 Confirm that all eight RDMA network interfaces are in the `UP` state and
 correctly configured with the required MTU and GID settings.
 #### Verify link status MTU, NIC temperature, and NIC speed
 ```bash
 sudo nicctl show port
 ```
 The output should look something like this:
 ```bash
 -------------------------------------------------------------------------------------
 NIC  : 42424650-4c32-3531-3530-314343000000 (0000:f6:00.0)
 Port : 04908132-5d88-4242-4242-000011010000 (eth1/1)
  Spec:
    Ifindex                                  : 0x11010000
    Type                                     : ETH
    speed                                    : 400G
    Admin state                              : UP
    FEC type                                 : RS
    Pause type                               : PFC
    Number of lanes                          : 4
    MTU                                      : 9216
    TX pause                                 : enabled
    RX pause                                 : enabled
    Auto negotiation                         : enabled
  Status:
    Physical port                            : 1
    Operational status                       : UP
    Link FSM state                           : UP
    FEC type                                 : RS
    Cable type                               : Fiber
    Number of lanes                          : 4
    speed                                    : 400G
    Auto negotiation                         : disabled
    MAC ID                                   : 0
    MAC channel                              : 0
    MAC address                              : 04:90:81:32:5d:88
    Transceiver type                         : QSFP_CMIS
    Transceiver state                        : SPROM-READ
    Transceiver PID                          : QSFP-400G-DR4
    Transceiver temperature (in C)           : 45
    Transceiver warning temperature (in C)   : 75
    Transceiver alarm temperature (in C)     : 80
 -------------------------------------------------------------------------------------
 ```
 #### Verify GID
 Ensure each device has a valid GID mapped to its assigned IP address.
 ```bash
 ibv_devinfo -v | grep GID
 ```
 The output should look something like this:
 ```bash
      GID[  0]:               fe80::690:81ff:fe30:a7a0, RoCE v2
      GID[  1]:               ::ffff:192.168.7.36, RoCE v2
 ```
 ### Run RDMA bandwidth benchmarks
 Verify the inter-node RDMA performance to ensure the network fabric can
 saturate the link bandwidth.
 #### Install RDMA Performance Tools
 To get started, build the ROCm-optimized `rdma-perftest` test suite from
 source:
 ```bash
 sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
 git clone https://github.com/ROCm/rdma-perftest
 cd rdma-perftest/
 ./autogen.sh
 ./configure --enable-rocm --with-rocm=/opt/rocm
 make -j$(nproc)
 sudo make install
 ```
 #### Run a bandwidth test (GPU memory)
 Perform a bandwidth test using ROCm GPU memory between two nodes. One acts as
 a server and the other acts as a client. Replace `<SERVER_IP>` with the
 appropriate IP.
 ```bash
 # On Server Node
 ./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
 # On Client Node
 ./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
 ```
 ## SGLang serving and MoRI unit tests
 ### Install Docker Engine
 Install the Docker engine to manage the containerized vLLM and MoRI serving
 environments.
 ```bash
 sudo apt update && sudo apt install -y docker.io
 sudo usermod -aG docker "$USER"
 ```
 ### Launch the serving container
 Deploy the SGLang MoRI serving container on each node.
 ```bash
 CONTAINER_NAME=sglang_mori
 IMAGE_NAME=rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-0113
 docker run -it \
    --rm \
    --device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
    --network host --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    --shm-size 128G \
    --name ${CONTAINER_NAME} \
    ${IMAGE_NAME} /bin/bash
 ```
 ### Run MoRI inter-node unit tests
 Before starting the vLLM service, run the MoRI unit test to verify that the
 inter-node communication backend is correctly configured.
 MoRI unit test uses 2 nodes as a minimal validation before running the full
 1P2D (3 nodes) benchmark.
 The key configuration variables are:
 * `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
 * `<MASTER_IP>`: The IP address of the primary node's backend interface.
 ```{note}
 You can find reference performance data in the [ROCm/MoRI
 repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
 ```
 ```bash
 # Set up environment inside the container
 export PYTHONPATH=/app/mori:$PYTHONPATH
 export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
 # Node 0 (Primary)
 torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
    --master_addr="<MASTER_IP>" --master_port=1234 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
    --cmd bench --kernel-type v1
 # Node 1 (Secondary)
 torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
    --master_addr="<MASTER_IP>" --master_port=1234 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
    --cmd bench --kernel-type v1
 ```
 ## End-to-end 1P2D performance testing
 This section guides you through running distributed inference benchmarks using
 the SGLang disagg recipe. For detailed implementation details, refer to the
 [SGLang Disaggregation
 Recipe](https://github.com/billishyahao/sglang_disagg/blob/9n_cluster/README.md).
 ### Download the model and setup your run environment
 This performance test supports the following models:
 * [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)
 * [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
 * [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
 To set up your environment and download the models using the Hugging Face CLI,
 use the following commands. Modify the `huggingface-cli download` command
 to download the desired model.
 ```bash
 # Set up a virtual environment and install the Hugging Face CLI
 sudo apt update && sudo apt install -y python3-venv
 python3 -m venv ~/venvs/hf
 source ~/venvs/hf/bin/activate
 pip install huggingface_hub
 # Download the model to the shared NFS mount point
 # Replace 'deepseek-ai/DeepSeek-R1-0528' with your desired model
 huggingface-cli download --token <your_hf_token> \
    deepseek-ai/DeepSeek-R1-0528 \
    --local-dir /mount/point/models/DeepSeek-R1
 ```
 ### Clone the SGLang disaggregation recipe
 Clone the SGLang disaggregation repository to the shared file system and switch
 to the appropriate branch:
 ```bash
 git clone https://github.com/billishyahao/sglang_disagg.git
 git checkout 9n_cluster
 cd sglang_disagg
 ```
 ```{note}
 In the 1P2D configuration, the prefill service and benchmark process run on the
 same node, while remaining nodes handle decode services.
 ```
 ### Configure InfiniBand devices
 Identify and configure the available InfiniBand devices.
 1. List available devices using the following command.
   ```bash
   ibv_devinfo -l
   ```
   Example output:
   ```bash
   8 HCAs found:
           ionic_0
           ionic_1
           ionic_2
           ionic_3
           ionic_4
           ionic_5
           ionic_6
           ionic_7
   ```
 2. Update environment variables. Edit `set_env_vars.sh` and add the
   comma-separated list of your system's IB devices. For example:
   ```bash
   export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
   ```
 ### Configure the script and submit the job
 1. To set the required configuration parameters, update the following
   environment variables in `run_submit_disagg.sh` to match your cluster setup:
   ```bash
   # SLURM Job Configuration
   export SLURM_ACCOUNT="amd"       # The account name for SLURM job accounting and resource allocation
   export SLURM_PARTITION="compute" # The specific cluster partition (queue) to submit the job to
   export TIME_LIMIT="24:00:00"     # Maximum wall time for the job (Hours:Minutes:Seconds)
   # Model Configuration
   export MODEL_PATH="/nfsdata"     # Base directory where the model weights are stored
   export MODEL_NAME="DeepSeek-R1"  # Specific model directory name (joined with MODEL_PATH)
   export CONTAINER_IMAGE="rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-1224" # Docker image to use for the environment
   # Cluster Topology (Disaggregation Setup)
   export PREFILL_NODES=1           # Number of prefill nodes
   export PREFILL_WORKERS=1         # Number of prefill workers
   export DECODE_NODES=2            # Number of decode nodes
   export DECODE_WORKERS=2          # Number of decode workers
   # Benchmark/Workload Parameters
   export ISL=1024                  # Input Sequence Length (number of tokens in the prompt)
   export OSL=1024                  # Output Sequence Length (number of tokens to generate)
   export CONCURRENCIES="2048"      # Total number of concurrent requests to simulate in the benchmark. The value can be "32,64,128"
   export REQUEST_RATE="inf"        # Request per second rate. "inf" means send all requests immediately
   # Parallelism Strategies
   export PREFILL_ENABLE_EP=true    # Enable Expert Parallelism (EP) for the prefill phase
   export PREFILL_ENABLE_DP=true    # Enable Data Parallelism (DP) for the prefill phase
   export DECODE_ENABLE_EP=true     # Enable Expert Parallelism (EP) for the decode phase
   export DECODE_ENABLE_DP=true     # Enable Data Parallelism (DP) for the decode phase
   ```
 2. Then submit the batch job into slurm cluster through `bash ./run_submit_disagg.sh`.
   ```bash
   bash ./run_submit_disagg.sh
   ```
 ### Log file analysis
 1. After submission, retrieve the SLURM job ID:
   ```bash
   squeue
   ```
   Example output:
   ```bash
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   123   compute       1p2d   alice  R    00:10:32      4 node[01-04]
   ```
 2. A directory named `slurm_job-$SLURM_JOB_ID` is created in `/tmp` on each
   participating node. The directory contains:
   | Log File | Description |
   | :--------| :-----------|
   | `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` | Main service log per node |
   | `decode_NODE${NODE_RANK}.log` | SGLang decode service details |
   | `prefill_NODE${NODE_RANK}.log` | SGLang prefill service details |
 3. The benchmark results will be displayed in
   `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log`. Key metrics include:
   ```{note}
   The following benchmark utility output is provided for reference only and
   should not be used to compare performance. See the
   [InferenceMAX](https://inferencemax.semianalysis.com/) website for validated
   performance results.
   ```
   ``` bash
   ============ Serving Benchmark Result ============
   Successful requests:                     20480
   Benchmark duration (s):                  1194.25
   Total input tokens:                      20971520
   Total generated tokens:                  20971520
   Request throughput (req/s):              17.15
   Output token throughput (tok/s):         17560.38
   Total Token throughput (tok/s):          35120.76
   ---------------Time to First Token----------------
   Mean TTFT (ms):                          21601.77
   Median TTFT (ms):                        24525.21
   P99 TTFT (ms):                           85417.53
   -----Time per Output Token (excl. 1st token)------
   Mean TPOT (ms):                          92.41
   Median TPOT (ms):                        85.46
   P99 TPOT (ms):                           138.67
   ---------------Inter-token Latency----------------
   Mean ITL (ms):                           92.41
   Median ITL (ms):                         74.76
   P99 ITL (ms):                            263.07
   ----------------End-to-end Latency----------------
   Mean E2EL (ms):                          116133.48
   Median E2EL (ms):                        110349.39
   P99 E2EL (ms):                           227243.97
   ==================================================
   ```
 ## Troubleshooting
 The following section outlines common issues and their solutions.
 ### Bandwidth test fails with error
 1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`
    ```bash
    which ib_write_bw
    ```
 2. Confirm the `SERVER_IP` is accessible
    ```bash
    ping <SERVER_IP>
    ```
 3. Check system logs, use `dmesg` for kernel-level errors
    ``` bash
    sudo dmesg -T | grep -i 'error|warn|fail|exception'
    ```
 ### Slurm job fails
 Common causes and solutions for Slurm job submission failures include:
 1. Shared storage access:
   * Verify that both `sglang_disagg` and model directories are located in a shared NFS mount accessible to all compute nodes.
   * Ensure proper permissions: `chmod -R 755 /shared/path/sglang_disagg /shared/path/models`
 2. Log analysis:
   * Examine `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` on each participating node for detailed error messages.
   * Check for common issues like missing dependencies, GPU allocation failures, or network connectivity problems.
 3. Configuration validation:
   * Verify SLURM parameters in `run_submit_disagg.sh`:
     * `SLURM_ACCOUNT`: Ensure your account has access to the cluster
     * `SLURM_PARTITION`: Confirm the partition exists and is accessible
     * `MODEL_PATH`: Check that the path is correct and accessible from compute nodes
     * `MODEL_NAME`: Verify the model subdirectory exists within `MODEL_PATH`
   * Use `sinfo` to check partition and node availability.
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
@@ -0,0 +1,627 @@
 # vLLM distributed inference with MoRI
 This document provides a comprehensive guide for setting up a high-performance
 vLLM serving environment on an AMD Instinct MI300X or MI325X GPU cluster using
 the [MoRI (Modular RDMA Interface)](https://github.com/rocm/mori) communication
 backend. It also includes detailed instructions on how to reproduce the
 benchmark results published in the AMD ROCm blog [Practical, Fault-Robust
 Distributed Inference for DeepSeek on AMD
 MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
 ## Prerequisites
 The following hardware configuration is required to implement this setup:
 * **Nodes**: A minimum of two GPU nodes (virtual machines or physical machines)
  for wide expert parallelism (EP) evaluation.
 * **GPUs**: 8x AMD Instinct MI300X/MI325X GPU cards per node.
 * **Networking**: 8x NVIDIA Mellanox ConnectX-7 (CX7) NICs per node, providing
  a dedicated 1:1 mapping between GPUs and network interfaces for optimal
  inter-node communication.
 ## System configuration
 This section outlines infrastructure steps required to prepare your cluster for
 high-performance AI workloads. It covers validating your system's software
 baselines and firmware versions, configuring high-bandwidth backend networking
 for inter-node communication, and establish shared storage to ensure
    a synchronized distributed computing environment.
 ### Verify baseline software
 This setup has been validated using the **AI/ML Ready Image (ROCm 7-based)** on
 Digital Ocean AMD GPU Droplets. The following table outlines the software
 stack versions and appropriate shell commands for verification:
 | Component | Version | Verification command |
 | :--- | :--- | :--- |
 | **OS** | Ubuntu 24.04.3 LTS | `cat /etc/os-release` |
 | **Kernel** | 6.8.0-87-generic | `uname -r` |
 | **ROCm** | 7.0.2 | `amd-smi version` |
 | **PLDM bundle (firmware) for MI300X** | 01.25.03.12 | [Verify BKC](#verify-best-known-configuration-bkc) |
 | **PLDM bundle (firmware) for MI325X** | 01.25.03.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
 | **CX7 Firmware** | 28.46.3048 | `dkms status` |
 | **CX7 Driver** | 24.10-3.2.5 | `dkms status` |
 | **DOCA** | 2.9.3 | `dpkg -l \| grep doca` |
 ### Verify best known configuration (BKC)
 The BKC defines a validated configuration of GPU firmware, baseboard firmware,
 ROCm user space components, the AMD GPU Driver, and virtualization tooling.
 These components are tested together to attain best performance and compatibility.
 While AMD publishes the AMD GPU driver and ROCm user space components, your
 server OEM or infrastructure provider distributes the firmware packages. AMD
 supplies those firmware images (PLDM bundles), which the OEM integrates and
 distributes.
 To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
 Redfish API:
 1. Prepare credentials: Identify your BMC IP, username, and password.
 2. Run Redfish queries: Use the following commands to check the active
   firmware inventory.
     ``` bash
     # Define BMC connection variables
     BMC_IP="<BMC_IP>"
     AUTH="<username>:<password>"
     # Query active BKC bundle version
     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
          -u "${AUTH}" -k | json_pp
     # Query active IFWI (Integrated Firmware Image)
     curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
          -u "${AUTH}" -k | json_pp
     ```
 ### Run basic system health checks
 Before proceeding with software deployment, verify that all cluster nodes
 comply with the [MI300X Basic Health
 Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi300x.html#basic-health-checks)
 or [MI325X Basic Health
 Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi325x.html#basic-health-checks).
 Key requirements include specific kernel boot arguments, minimum system memory
 thresholds, PCIe Gen5 link stability, and so on.
 ### Configure your backend network (netplan)
 Configure the backend NICs for high-bandwidth inter-node communication. Suppose
 the GPU’s eight network interface controllers (NICs) are eth2 to eth9. Each NIC
 must have its own subnet that is disjoint from the others. For example, `eth2`
 could use `192.168.50.0/24`, `eth3` could use `192.168.51.0/24`, and so on.
 Each node needs a unique IP address on each subnet. You should use the same
 final octet in each subnet for a given node. For example, one node would have
 the addresses `192.168.50.2`, `192.168.51.2`, and so on. Another node might
 have `192.168.50.3`, `192.168.51.3`, and so on. Ensure MTU is set to `4200`.
 ```{note}
 Ensure you identify the correct interface names for your system using ip link
 before applying this configuration.
 ```
 For example, your `/etc/netplan/50-backend.yaml` might include something like
 the following:
 ```yaml
 eth2:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.50.2/24
  mtu: 4200
 eth3:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.51.2/24
  mtu: 4200
 eth4:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.52.2/24
  mtu: 4200
 eth5:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.53.2/24
  mtu: 4200
 eth6:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.54.2/24
  mtu: 4200
 eth7:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.55.2/24
  mtu: 4200
 eth8:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.56.2/24
  mtu: 4200
 eth9:
  dhcp4: false
  dhcp6: false
  link-local: []
  addresses:
    - 192.168.57.2/24
  mtu: 4200
 ```
 To apply the configuration, use the following command.
 ```bash
 sudo netplan apply
 ```
 To verify your configuration, use the following command.
 ```bash
 sudo apt install -y net-tools && ip -br a
 ```
 ### Configure your network file system (NFS)
 Setting up a shared NFS volume facilitates centralized storage for models,
 recipes, and logs across the cluster. Use the following commands to install the
 necessary client tools and mount the remote directory.
 ```{important}
 Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
 server details and desired local mount path.
 ```
 ``` bash
 sudo apt update && sudo apt install -y nfs-common
 sudo mkdir -p /mount/point
 sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
 echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
 ```
 ### Configure static hostname resolution for backend initialization (optional)
 If the high-speed RDMA/IB interfaces are used for the initial distributed
 coordination (such as `MASTER_ADDR`), you must configure static hostname
 resolution. This ensures that cluster host names resolve to the backend network
 IPs rather than the management or local loopback addresses.
 Follow these steps to configure static hostname resolution:
 1. Edit `/etc/hosts` on all nodes: for example, using `sudo vim /etc/hosts`.
 2. Add the backend IP and hostname mappings.
 3. Comment out any default local mappings (such as `127.0.1.1`) for the current
   hostname to avoid resolution conflicts.
 For example, your `/etc/hosts` entries might look like:
 ```text
 # Map host names to backend network IPs
 192.168.50.2 mori_test_01
 192.168.50.3 mori_test_02
 # Comment out the default entry to ensure resolution via the backend IP
 # 127.0.1.1 mori_test_01 mori_test_01
 ```
 ## Software installation
 Next, install the essential software stack required to operate the AMD Instinct
 GPUs and high-speed networking components. Follow these steps to deploy the
 NVIDIA DOCA drivers for Mellanox ConnectX-7 NICs, the ROCm software stack, and
 the necessary kernel modules to enable hardware acceleration.
 ### Install CX7 driver and firmware
 1. Download and install the `DOCA 2.9.3` driver following the instructions in
   [NVIDIA DOCA 2.9.3
   Downloads](https://developer.nvidia.com/doca-2-9-3-download-archive?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-all&Distribution=Ubuntu&version=24.04&installer_type=deb_local).
 2. Download the appropriate firmware for your hardware PSID from the [NVIDIA
   official website](https://network.nvidia.com/support/firmware/connectx7/)
   and flash the device.
 3. To verify driver and firmware versions, use the following command. Replace
   `IB Device` with your specific backend interface.
   ```bash
   ethtool -i <IB Device>
   ```
 ### Install ROCm
 Use the following commands to quickly install ROCm 7.0.2 on Ubuntu 24.04:
 ``` bash
 wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
 sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
 sudo apt update
 sudo apt install python3-setuptools python3-wheel
 sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
 sudo apt install rocm
 ```
 For detailed installation instructions, refer to the [ROCm 7.0.2
 documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#rocm-installation).
 ### Install AMD GPU Driver (amdgpu)
 Use the following commands to quickly install the AMD GPU Driver (ROCm 7.0.2) on Ubuntu 24.04:
 ``` bash
 wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
 sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
 sudo apt update
 sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
 sudo apt install amdgpu-dkms
 ```
 For detailed installation instructions, refer to the [ROCm 7.0.2
 documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#amdgpu-driver-installation).
 ## Network verification and testing
 Before deploying the inference engine, validate the health and performance of
 the cluster interconnects.
 ### Verify network connectivity
 Verify that all network interfaces are reachable across the cluster nodes.
 Assuming `eth0` is the management interface, `eth1` is for the VPC, and `eth2`
 through `eth9` are the dedicated RoCE backend interfaces, use the following
 loop to test reachability to a remote node (for instance, a target node with
 host IP suffix `.3`).
 ```bash
 # Test connectivity for RoCE subnets 192.168.50.x through 192.168.57.x
 for i in {0..7}; do ping -c 1 192.168.5${i}.3; done
 ```
 ### Validate your RDMA setup
 Confirm that all eight RDMA network interfaces are in `UP` state. Verify the MTU
 setting of `4096` and ensure each device has a valid GID mapped to its assigned
 IP address.
 ``` bash
 ibv_devinfo -v
 ```
 The output should look something like this:
 ``` bash
 hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.46.3048
        ...
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        ...
                        GID[  0]:               fe80:0000:0000:0000:d894:24ff:fe4a:96e2, RoCE v1
                        GID[  1]:               fe80::d894:24ff:fe4a:96e2, RoCE v2
                        GID[  2]:               0000:0000:0000:0000:0000:ffff:c0a8:3903, RoCE v1
                        GID[  3]:               ::ffff:192.168.57.3, RoCE v2
 ```
 ### Run RDMA bandwidth benchmarks
 Verify the inter-node RDMA performance to ensure the network fabric can
 saturate the link bandwidth.
 #### Install RDMA Performance Tools
 To get started, build the ROCm-optimized `rdma-perftest` test suite from
 source:
 ```bash
 sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
 git clone https://github.com/ROCm/rdma-perftest
 cd rdma-perftest/
 ./autogen.sh
 ./configure --enable-rocm --with-rocm=/opt/rocm
 make -j$(nproc)
 sudo make install
 ```
 #### Run a bandwidth test (GPU memory)
 Perform a bandwidth test using ROCm GPU memory between two nodes. One acts
 as a server and the other acts as a client. For 400G interfaces, the expected
 peak throughput is approximately 390 Gbps. Replace `<SERVER_IP>` with the
 appropriate IP.
 ```bash
 # On Server Node
 ./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
 # On Client Node
 ./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
 ```
 ## vLLM serving and MoRI unit tests
 ### Install Docker Engine
 Install the Docker engine to manage the containerized vLLM and MoRI serving
 environments.
 ```bash
 sudo apt update && sudo apt install -y docker.io
 ```
 ### Download the DeepSeek PTPC model
 This guide uses the
 [DeepSeek-R1-FP8-Dynamic](https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic)
 model optimized for PTPC. Use the following commands to install the Hugging
 Face CLI and download the model to your shared NFS directory:
 ```bash
 # Set up a virtual environment and install the Hugging Face CLI
 sudo apt update && sudo apt install -y python3-venv
 python3 -m venv ~/venvs/hf
 source ~/venvs/hf/bin/activate
 pip install huggingface_hub
 # Download the model to the shared NFS mount point
 huggingface-cli download --token <your_hf_token> \
    EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    --local-dir /mount/point/models/EmbeddedLLM/deepseek-r1-FP8-Dynamic
 ```
 ### Launch the serving container
 Deploy the vLLM MoRI serving Docker container on each node.
 ```bash
 CONTAINER_NAME=vllm_mori
 IMAGE_NAME=aigmkt/vllm:mori_rocm6.4.1_20251105
 docker run -it \
    --rm \
    --device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
    --network host --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v /mount/point/models:/models \
    --shm-size 128G \
    --name ${CONTAINER_NAME} \
    ${IMAGE_NAME} /bin/bash
 ```
 ### Run MoRI inter-node unit tests
 Before starting the vLLM service, run the MoRI unit test to verify that the
 inter-node communication backend is correctly configured.
 The key configuration variables are:
 * `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
 * `<MASTER_IP>`: The IP address of the primary node's backend interface.
 ```{note}
 You can find reference performance data in the [ROCm/MoRI
 repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
 ```
 ```bash
 # Set up environment inside the container
 cd /app/mori
 export PYTHONPATH=/app/mori:$PYTHONPATH
 export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
 # Node 0 (Primary)
 torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
    --master_addr="<MASTER_IP>" --master_port=1234 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
    --cmd bench --kernel-type v1
 # Node 1 (Secondary)
 torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
    --master_addr="<MASTER_IP>" --master_port=1234 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
    --cmd bench --kernel-type v1
 ```
 ### Deploy and serve the model
 To deploy DeepSeek-R1 (PTPC) with Expert Parallelism 16 (EP16) across two
 nodes, use the following serving scripts.
 #### Create serving scripts
 Create the following scripts inside the container on each node.
 * Node 0 (master node): `ep16_node0.sh`
   ```bash
   #!/bin/bash
   # Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
   export VLLM_ROCM_USE_AITER=1
   export VLLM_ROCM_USE_AITER_MOE=1
   export VLLM_LOGGING_LEVEL=INFO
   export VLLM_USE_V1=1
   export VLLM_ROCM_USE_AITER_MLA=1
   export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
   export VLLM_ALL2ALL_BACKEND=mori
   vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
       -dp 16 \
       --enable-expert-parallel \
       --data-parallel-size-local 8 \
       --data-parallel-address ${IP} \
       --data-parallel-rpc-port 1212 \
       --served-model-name deepseek \
       --port 8777 \
       --block-size 1 \
       --distributed-executor-backend mp \
       --gpu-memory-utilization 0.8 \
       --max-model-len 8192 \
       --max-num-batched-tokens 4096 \
       --max-num-seqs 4096 \
       --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
       --cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
       --kv-cache-dtype fp8 \
       --no-enable-prefix-caching \
       --trust-remote-code 2>&1 | tee serving_node0_ep16.log
    ```
 * Node 1: `ep16_node1.sh`
   ```bash
   #!/bin/bash
   # Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
   export VLLM_ROCM_USE_AITER=1
   export VLLM_ROCM_USE_AITER_MOE=1
   export VLLM_LOGGING_LEVEL=INFO
   export VLLM_USE_V1=1
   export VLLM_ROCM_USE_AITER_MLA=1
   export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
   export VLLM_ALL2ALL_BACKEND=mori
   vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
           -dp 16 \
           --enable-expert-parallel \
           --headless \
           --data-parallel-size-local 8 \
           --data-parallel-start-rank 8 \
           --data-parallel-address ${IP} \
           --data-parallel-rpc-port 1212 \
           --served-model-name deepseek \
           --port 8777 \
           --block-size 1 \
           --distributed-executor-backend mp \
           --gpu_memory_utilization 0.8 \
           --max-model-len 8192 \
           --max_num_batched_token 4096 \
           --max-num-seqs 4096 \
           --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
           --cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
           --kv-cache-dtype fp8 \
           --no-enable-prefix-caching \
           --trust-remote-code 2>&1 | tee serving_node1_ep16.log
    ```
 #### Run the serving scripts
 Run the scripts on each node to launch the distributed serving instance.
 Replace `<MASTER_IP>` with the backend network IP of Node 0.
 ```bash
 # On Node 0 (Primary)
 export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
 export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
 IP=<MASTER_IP> bash ep16_node0.sh
 # On Node 1 (Secondary)
 export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
 export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
 IP=<MASTER_IP> bash ep16_node1.sh
 ```
 ## Reproducing performance
 This section details how to reproduce the performance metrics published in the
 AMD ROCm Blog: [Practical, Fault-Robust Distributed Inference for DeepSeek on
 AMD
 MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
 ### Configuration for EP16 (16 GPUs)
 To achieve the reported throughput, expert parallelism 16 (EP16) is used across
 the decode nodes.
 #### Benchmark target
 * Decode throughput: ~12.4k output tokens/s per node.
 ### Performance reproduction commands
 Use the following configurations to reproduce published performance metrics.
 #### Decode benchmark
 To reproduce the 12.4k output tokens/s, use the following configuration:
 ```bash
 #!/bin/bash
 MAX_CONCURRENCY=${1:-3072}
 TIMES=2
 NUM_PROMPTS=$((MAX_CONCURRENCY*TIMES))
 vllm bench serve \
    --max-concurrency $MAX_CONCURRENCY \
    --num-prompts $NUM_PROMPTS \
    --model /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
    --served-model-name deepseek \
    --port 8777 \
    --ignore-eos \
    --trust-remote-code \
    --dataset-name random \
    --seed 2025 \
    --random-input-len 2048 \
    --random-output-len 1024 2>&1 | tee bench_decode_${MAX_CONCURRENCY}_isl_2k_osl_1k.log
 ```
 To calculate the per-node throughput for comparison with the blog data, take
 the reported **Peak output token throughput (tok/s)** from the benchmark
 results and divide it by the total number of nodes in the cluster.
 ## Troubleshooting
 The following section outlines common issues and their solutions.
 ### Bandwidth test fails with error
 1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`.
    ``` bash
    which ib_write_bw
    ```
 2. Confirm the `SERVER_IP` is accessible.
    ``` bash
    ping <SERVER_IP>
    ```
 3. Check system logs, use `dmesg` for kernel-level errors.
    ``` bash
    sudo dmesg -T | grep -i 'error|warn|fail|exception'
    ```
 ### vLLM EP 16 with MoRI backend fails to launch
 1. Error: `Waiting for init message from front-end.` Check the connectivity of the `IP`. Disable firewall/selinux or allow traffic for port `1212`.
 2. Verify server name resolution. Ensure server names are correctly mapped in `/etc/hosts`.
 3. Confirm whether environment variable `GLOO_SOCKET_IFNAME` is set before running the vLLM serving script.
--- a/docs/how-to/rocm-for-ai/inference/index.rst
+++ b/docs/how-to/rocm-for-ai/inference/index.rst
@@ -26,6 +26,12 @@ training, fine-tuning, and inference. It leverages popular machine learning fram
 - :doc:`SGLang inference performance testing <benchmark-docker/sglang>`
 - :doc:`vLLM distributed inference with MoRI <benchmark-docker/vllm-mori-distributed>`
 - :doc:`SGLang distributed inference with MoRI <benchmark-docker/sglang-mori-distributed>`
 - :doc:`SGLang distributed inference with Mooncake <benchmark-docker/sglang-distributed>`
 - :doc:`xDiT diffusion inference <xdit-diffusion-inference>`
 - :doc:`Deploying your model <deploy-your-model>`
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst
@@ -52,7 +52,7 @@ accelerate training workloads:
              - {{ component_version }}
            {% endfor %}
-   .. _amd-megatron-lm-model-support-v25.11:
+   .. _amd-megatron-lm-model-support-v26.01:
   Supported models
   ================
@@ -97,7 +97,7 @@ accelerate training workloads:
   Some models, such as Llama, require an external license agreement through
   a third party (for example, Meta).
-.. _amd-megatron-lm-performance-measurements-v25.11:
+.. _amd-megatron-lm-performance-measurements-v26.01:
 Performance measurements
 ========================
@@ -129,7 +129,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
 <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
 system's configuration.
-.. _mi300x-amd-megatron-lm-training-v25.11:
+.. _mi300x-amd-megatron-lm-training-v26.01:
 Environment setup
 =================
@@ -138,7 +138,7 @@ Use the following instructions to set up the environment, configure the script t
 reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
 image.
-.. _amd-megatron-lm-requirements-v25.11:
+.. _amd-megatron-lm-requirements-v26.01:
 Download the Docker image
 -------------------------
@@ -190,7 +190,7 @@ Download the Docker image
 The Docker container hosts a verified commit of
 `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__.
-.. _amd-megatron-lm-environment-setup-v25.11:
+.. _amd-megatron-lm-environment-setup-v26.01:
 Configuration
 =============
@@ -200,39 +200,39 @@ Configuration
   Update the ``train_llama3.sh`` configuration script in the ``examples/llama``
   directory of
   `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
-   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
+   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
 .. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
   Update the ``train_llama2.sh`` configuration script in the ``examples/llama``
   directory of
   `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
-   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
+   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
 .. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
   Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3``
   directory of
   `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run.
-   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
+   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
 .. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
   Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2``
   directory of
   `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run.
-   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
+   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
 .. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
   Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral``
   directory of
   `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run.
-   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
+   Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
 .. note::
-   See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v25.11>` for more information on configuration options.
+   See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v26.01>` for more information on configuration options.
 Multi-node configuration
 ------------------------
@@ -240,7 +240,7 @@ Multi-node configuration
 Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
 training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands.
-.. _amd-megatron-lm-tokenizer-v25.11:
+.. _amd-megatron-lm-tokenizer-v26.01:
 Tokenizer
 ---------
@@ -377,7 +377,7 @@ Download the dataset
   ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer.
   Remember to either pre-download the tokenizer or setup Hugging Face access
-   otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v25.11>` section.
+   otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v26.01>` section.
   .. note::
@@ -479,13 +479,13 @@ Download the dataset
   Ensure that the files are accessible inside the Docker container.
-.. _amd-megatron-lm-run-training-v25.11:
+.. _amd-megatron-lm-run-training-v26.01:
 Run training
 ============
 Use the following example commands to set up the environment, configure
-:ref:`key options <amd-megatron-lm-benchmark-test-vars-v25.11>`, and run training on
+:ref:`key options <amd-megatron-lm-benchmark-test-vars-v26.01>`, and run training on
 MI300X Series GPUs with the AMD Megatron-LM environment.
 Before starting training, export the following environment variables.
@@ -920,7 +920,7 @@ Single node training
          RECOMPUTE_ACTIVATIONS=full \
          CKPT_FORMAT=torch_dist
-.. _amd-megatron-lm-multi-node-examples-v25.11:
+.. _amd-megatron-lm-multi-node-examples-v26.01:
 Multi-node training examples
 ----------------------------
@@ -971,7 +971,7 @@ training on 16 nodes, try the following command:
   sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh
-.. _amd-megatron-lm-benchmark-test-vars-v25.11:
+.. _amd-megatron-lm-benchmark-test-vars-v26.01:
 Key options
 -----------
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-history.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-history.rst
@@ -16,14 +16,23 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
     - Components
     - Resources
-   * - v25.11
+   * - v26.1 (latest)
     -
       * ROCm 7.1.0
       * PyTorch 2.10.0.dev20251112+rocm7.1
     -
       * :doc:`Primus Megatron documentation <../primus-megatron>`
       * :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
-       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
+       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
   * - v25.11
     -
       * ROCm 7.1.0
       * PyTorch 2.10.0.dev20251112+rocm7.1
     -
       * :doc:`Primus Megatron documentation <primus-megatron-v25.11>`
       * :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.10>`
       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
   * - v25.10
     -
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10.rst
@@ -37,7 +37,7 @@ GPUs containing essential components, including PyTorch, ROCm libraries, and
 Megatron-LM utilities. It contains the following software components to
 accelerate training workloads:
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
   .. tab-set::
@@ -146,7 +146,7 @@ image.
 Download the Docker image
 -------------------------
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
   {% set docker = data.docker %}
   1. Use the following command to pull the Docker image from Docker Hub.
@@ -811,7 +811,7 @@ Single node training
      Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault
      for large iterations.
      For stability, it's recommended to use Primus for this workload.
-      See :doc:`primus-megatron`.
+      See :doc:`../primus-megatron`.
 .. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.11.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.11.rst
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10.rst
@@ -25,10 +25,10 @@ model training. Performance acceleration is powered by `Primus Turbo
   <https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
   deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
   The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
-   including Megatron-LM and :doc:`torchtitan <primus-pytorch>`.
+   including Megatron-LM and :doc:`torchtitan <../primus-pytorch>`.
   Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM
-   training <megatron-lm>` workflow. To learn how to migrate workloads from
+   training <../megatron-lm>` workflow. To learn how to migrate workloads from
   Megatron-LM to Primus with Megatron, see
   :doc:`megatron-lm-primus-migration-guide`.
@@ -36,7 +36,7 @@ AMD provides a ready-to-use Docker images for MI355X, MI350X,
 MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and
 Megatron-LM.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   .. tab-set::
@@ -63,7 +63,7 @@ The following models are pre-optimized for performance on AMD Instinct GPUs.
 Some instructions, commands, and training examples in this documentation
 might vary by model -- select one to get started.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. raw:: html
@@ -120,7 +120,7 @@ system's configuration.
 Environment setup
 =================
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   Use the following instructions to set up the environment, configure the script to train models, and
   reproduce the benchmark results on AMD Instinct GPUs.
@@ -129,7 +129,7 @@ Environment setup
 Pull the Docker image
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   {% set docker = data.docker %}
@@ -175,7 +175,7 @@ Configuration
 Primus defines a training configuration in YAML for each model in
 `examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   {% for model_group in model_groups %}
@@ -805,7 +805,7 @@ To run training on multiple nodes, you can use the
 `run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
 to launch the multi-node workload. Use the following steps to setup your environment:
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
   {% set docker = data.docker %}
   .. code-block:: shell
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.11.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.11.rst
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10.rst
@@ -24,17 +24,17 @@ Primus now supports the PyTorch torchtitan backend.
   <https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
   deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
   The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
-   including torchtitan and :doc:`Megatron-LM <primus-megatron>`.
+   including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
   Primus with the PyTorch torchtitan backend is designed to replace the
-   :doc:`ROCm PyTorch training <pytorch-training>` workflow. See
+   :doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
-   :doc:`pytorch-training` to see steps to run workloads without Primus.
+   :doc:`../pytorch-training` to see steps to run workloads without Primus.
 AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
 MI300X GPUs containing essential components for Primus and PyTorch training
 with Primus Turbo optimizations.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
   .. tab-set::
@@ -61,7 +61,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325
 Some instructions, commands, and training recommendations in this documentation might
 vary by model -- select one to get started.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. raw:: html
@@ -96,7 +96,7 @@ vary by model -- select one to get started.
 .. seealso::
   For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
-   see the documentation :doc:`pytorch-training` (without Primus)
+   see the documentation :doc:`../pytorch-training` (without Primus)
 .. _amd-primus-pytorch-performance-measurements-v2510:
@@ -122,7 +122,7 @@ doesn’t test configurations and run conditions outside those described.
 Pull the Docker image
 =====================
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
   Use the following command to pull the Docker image from Docker Hub.
@@ -134,11 +134,11 @@ Run training
 ============
 Once the setup is complete, choose between the following two workflows to start benchmarking training.
-For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
+For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
 For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
 tweak some configurations (such as batch sizes).
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
   {% set docker = data.docker %}
   {% set model_groups = data.model_groups %}
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.11.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.11.rst
@@ -0,0 +1,422 @@
 :orphan:
 .. meta::
   :description: How to train a model using PyTorch for ROCm.
   :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
 ****************************************
 Training a model with Primus and PyTorch
 ****************************************
 .. caution::
   This documentation does not reflect the latest version of ROCm Primus PyTorch training
   performance benchmark documentation. See :doc:`../primus-pytorch` for the latest version.
 `Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
 LLM training framework designed to streamline training. It streamlines LLM
 training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
 Primus now supports the PyTorch torchtitan backend.
 .. note::
   For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
   <https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
   deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
   The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
   including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
   Primus with the PyTorch torchtitan backend is designed to replace the
   :doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
   :doc:`../pytorch-training` to see steps to run workloads without Primus.
 AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
 MI300X GPUs containing essential components for Primus and PyTorch training
 with Primus Turbo optimizations.
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
   .. tab-set::
      .. tab-item:: {{ data.docker.pull_tag }}
         :sync: {{ data.docker.pull_tag }}
         .. list-table::
            :header-rows: 1
            * - Software component
              - Version
            {% for component_name, component_version in data.docker.components.items() %}
            * - {{ component_name }}
              - {{ component_version }}
            {% endfor %}
 .. _amd-primus-pytorch-model-support-v25.11:
 Supported models
 ================
 The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
 Some instructions, commands, and training recommendations in this documentation might
 vary by model -- select one to get started.
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. raw:: html
      <div id="vllm-benchmark-ud-params-picker" class="container-fluid">
         <div class="row gx-0">
            <div class="col-2 me-1 px-2 model-param-head">Model</div>
            <div class="row col-10 pe-0">
      {% for model_group in model_groups %}
               <div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
      {% endfor %}
            </div>
         </div>
         <div class="row gx-0 pt-1">
            <div class="col-2 me-1 px-2 model-param-head">Variant</div>
            <div class="row col-10 pe-0">
      {% for model_group in model_groups %}
         {% set models = model_group.models %}
         {% for model in models %}
            {% if models|length % 3 == 0 %}
               <div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
            {% else %}
               <div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
            {% endif %}
         {% endfor %}
      {% endfor %}
            </div>
         </div>
      </div>
 .. seealso::
   For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
   see the documentation :doc:`../pytorch-training` (without Primus)
 .. _amd-primus-pytorch-performance-measurements-v25.11:
 System validation
 =================
 Before running AI workloads, it's important to validate that your AMD hardware is configured
 correctly and performing optimally.
 If you have already validated your system settings, including aspects like NUMA auto-balancing, you
 can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
 optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
 before starting training.
 To test for optimal performance, consult the recommended :ref:`System health benchmarks
 <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
 system's configuration.
 This Docker image is optimized for specific model configurations outlined
 below. Performance can vary for other training workloads, as AMD
 doesn’t test configurations and run conditions outside those described.
 Pull the Docker image
 =====================
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
   Use the following command to pull the Docker image from Docker Hub.
   .. code-block:: shell
      docker pull {{ data.docker.pull_tag }}
 Run training
 ============
 Once the setup is complete, choose between the following two workflows to start benchmarking training.
 For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
 For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
 tweak some configurations (such as batch sizes).
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
   {% set docker = data.docker %}
   {% set model_groups = data.model_groups %}
   .. tab-set::
      .. tab-item:: MAD-integrated benchmarking
   {% for model_group in model_groups %}
      {% for model in model_group.models %}
         .. container:: model-doc {{ model.mad_tag }}
            The following run command is tailored to {{ model.model }}.
            See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
            1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
               directory and install the required packages on the host machine.
               .. code-block:: shell
                  git clone https://github.com/ROCm/MAD
                  cd MAD
                  pip install -r requirements.txt
            2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
               using one node with the {{ model.precision }} data type on the host machine.
               .. code-block:: shell
                  export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
                  madengine run \
                      --tags {{ model.mad_tag }} \
                      --keep-model-dir \
                      --live-output \
                      --timeout 28800
               MAD launches a Docker container with the name
               ``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
               model are collected in ``~/MAD/perf.csv``.
      {% endfor %}
   {% endfor %}
      .. tab-item:: Primus benchmarking
   {% for model_group in model_groups %}
      {% for model in model_group.models %}
         .. container:: model-doc {{ model.mad_tag }}
            The following run commands are tailored to {{ model.model }}.
            See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
            .. rubric:: Download the Docker image and required packages
            1. Pull the ``{{ docker.pull_tag }}`` Docker image from Docker Hub.
               .. code-block:: shell
                  docker pull {{ docker.pull_tag }}
            2. Run the Docker container.
               .. code-block:: shell
                  docker run -it \
                      --device /dev/dri \
                      --device /dev/kfd \
                      --network host \
                      --ipc host \
                      --group-add video \
                      --cap-add SYS_PTRACE \
                      --security-opt seccomp=unconfined \
                      --privileged \
                      -v $HOME:$HOME \
                      -v $HOME/.ssh:/root/.ssh \
                      --shm-size 64G \
                      --name training_env \
                      {{ docker.pull_tag }}
               Use these commands if you exit the ``training_env`` container and need to return to it.
               .. code-block:: shell
                  docker start training_env
                  docker exec -it training_env bash
               The Docker container hosts verified commit ``c4c083de`` of the `Primus
               <https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
            .. rubric:: Prepare training datasets and dependencies
            The following benchmarking examples require downloading models and datasets
            from Hugging Face. To ensure successful access to gated repos, set your
            ``HF_TOKEN``.
            .. code-block:: shell
               export HF_TOKEN=$your_personal_hugging_face_access_token
            .. rubric:: Pretraining
            To get started, navigate to the ``Primus`` directory in your container.
            .. code-block::
               cd /workspace/Primus
            Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
            included with Primus with the appropriate options.
            .. rubric:: Benchmarking examples
            .. container:: model-doc primus_pyt_train_llama-3.1-8b
               Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
               .. tab-set::
                  .. tab-item:: MI355X and MI350X
                     :sync: MI355X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh --training.local_batch_size 6
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh
               To train Llama 3.1 8B with FP8 precision, use the following command.
               .. tab-set::
                  .. tab-item:: MI355X and MI350X
                     :sync: MI355X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh --training.local_batch_size 7
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh
            .. container:: model-doc primus_pyt_train_llama-3.1-70b
               Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
               .. tab-set::
                  .. tab-item:: MI355X and MI350X
                     :sync: MI355X and MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh --training.local_batch_size 6
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
                        bash examples/run_pretrain.sh
               To train Llama 3.1 70B with FP8 precision, use the following command.
               .. tab-set::
                  .. tab-item:: MI355X and MI350X
                     :sync: MI355X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh --training.local_batch_size 5
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
                        bash examples/run_pretrain.sh
            .. container:: model-doc primus_pyt_train_deepseek-v3-16b
               Use the following command to run train DeepSeek V3 16B with BF16 precision using Primus torchtitan.
               .. tab-set::
                  .. tab-item:: MI355X and MI350X
                     :sync: MI355X and MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
                        bash examples/run_pretrain.sh
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
                        bash examples/run_pretrain.sh --training.local_batch_size 10
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
                        EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
                        bash examples/run_pretrain.sh
      {% endfor %}
   {% endfor %}
 Further reading
 ===============
 - For an introduction to Primus, see `Primus: A Lightweight, Unified Training
  Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
 - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
 - To learn more about system settings and management practices to configure your system for
  AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
 - For a list of other ready-made Docker images for AI with ROCm, see
  `AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
 Previous versions
 =================
 See :doc:`pytorch-training-history` to find documentation for previous releases
 of the ``ROCm/pytorch-training`` Docker image.
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history.rst
@@ -16,21 +16,30 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
     - Components
     - Resources
   * - v26.1 (latest)
     -
       * ROCm 7.1.0
       * PyTorch 2.10.0.dev20251112+rocm7.1
     -
       * :doc:`Primus PyTorch training documentation <../primus-megatron>`
       * :doc:`PyTorch training (legacy) documentation <../megatron-lm>`
       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
   * - v25.11
     -
       * ROCm 7.1.0
       * PyTorch 2.10.0.dev20251112+rocm7.1
     -
-       * :doc:`Primus PyTorch Training documentation <../primus-pytorch>`
+       * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.11>`
-       * :doc:`PyTorch training (legacy) documentation <../pytorch-training>`
+       * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.11>`
-       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
+       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
   * - v25.10
     -
       * ROCm 7.1.0
       * PyTorch 2.10.0.dev20251112+rocm7.1
     -
-       * :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.10>`
+       * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.10>`
       * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.10>`
       * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
@@ -40,7 +49,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
       * Primus 0.3.0
       * PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
     -
-       * :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.9>`
+       * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.9>`
       * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.9>`
       * `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
       * `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
@@ -50,7 +59,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
       * ROCm 6.4.3
       * PyTorch 2.8.0a0+gitd06a406
     -
-       * :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.8>`
+       * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.8>`
       * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>`
       * `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10.rst
@@ -30,7 +30,7 @@ environment for fine-tuning and pretraining a model on AMD Instinct MI325X
 and MI300X GPUs. It includes the following software components to accelerate
 training workloads:
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
   .. tab-set::
@@ -58,7 +58,7 @@ MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
 training recommendations in this documentation might vary by model -- select
 one to get started.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. raw:: html
@@ -94,7 +94,7 @@ one to get started.
 The following table lists supported training modes per model.
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. dropdown:: Supported training modes
@@ -164,7 +164,7 @@ doesn’t test configurations and run conditions outside those described.
 Run training
 ============
-.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
   {% set docker = data.docker %}
   {% set model_groups = data.model_groups %}
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.11.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.11.rst
@@ -0,0 +1,669 @@
 :orphan:
 .. meta::
   :description: How to train a model using PyTorch for ROCm.
   :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
 **************************************
 Training a model with PyTorch on ROCm
 **************************************
 .. caution::
   This documentation does not reflect the latest version of ROCm PyTorch training
   performance benchmark documentation. See :doc:`../pytorch-training` for the latest version.
 .. note::
   For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
   <https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
   deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
   The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
   including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
   See :doc:`../primus-pytorch` for details.
 PyTorch is an open-source machine learning framework that is widely used for
 model training with GPU-optimized components for transformer-based models.
 The PyTorch for ROCm training Docker image provides a prebuilt optimized
 environment for fine-tuning and pretraining a model on AMD Instinct MI325X
 and MI300X GPUs. It includes the following software components to accelerate
 training workloads:
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
   .. tab-set::
      .. tab-item:: {{ data.docker.pull_tag }}
         :sync: {{ data.docker.pull_tag }}
         .. list-table::
            :header-rows: 1
            * - Software component
              - Version
            {% for component_name, component_version in data.docker.components.items() %}
            * - {{ component_name }}
              - {{ component_version }}
            {% endfor %}
 .. _amd-pytorch-training-model-support-v25.11:
 Supported models
 ================
 The following models are pre-optimized for performance on the AMD Instinct
 MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
 training recommendations in this documentation might vary by model -- select
 one to get started.
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. raw:: html
      <div id="vllm-benchmark-ud-params-picker" class="container-fluid">
         <div class="row gx-0">
            <div class="col-2 me-1 px-2 model-param-head">Model</div>
            <div class="row col-10 pe-0">
      {% for model_group in model_groups %}
               <div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
      {% endfor %}
            </div>
         </div>
         <div class="row gx-0 pt-1">
            <div class="col-2 me-1 px-2 model-param-head">Variant</div>
            <div class="row col-10 pe-0">
      {% for model_group in model_groups %}
         {% set models = model_group.models %}
         {% for model in models %}
            {% if models|length % 3 == 0 %}
               <div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
            {% else %}
               <div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
            {% endif %}
         {% endfor %}
      {% endfor %}
            </div>
         </div>
      </div>
 .. _amd-pytorch-training-supported-training-modes-v25.11:
 The following table lists supported training modes per model.
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
   {% set model_groups = data.model_groups %}
   .. dropdown:: Supported training modes
      .. list-table::
         :header-rows: 1
         * - Model
           - Supported training modes
      {% for model_group in model_groups %}
         {% set models = model_group.models %}
         {% for model in models %}
         {% if model.training_modes %}
         * - {{ model.model }}
           - ``{{ model.training_modes | join('``, ``') }}``
         {% endif %}
         {% endfor %}
      {% endfor %}
      .. note::
         Some model and fine-tuning combinations are not listed. This is
         because the `upstream torchtune repository <https://github.com/pytorch/torchtune>`__
         doesn't provide default YAML configurations for them.
         For advanced usage, you can create a custom configuration to enable
         unlisted fine-tuning methods by using an existing file in the
         ``/workspace/torchtune/recipes/configs`` directory as a template.
 .. _amd-pytorch-training-performance-measurements-v25.11:
 Performance measurements
 ========================
 To evaluate performance, the
 `Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
 page provides reference throughput and latency measurements for training
 popular AI models.
 .. note::
   The performance data presented in
   `Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
   should not be interpreted as the peak performance achievable by AMD
   Instinct MI325X and MI300X GPUs or ROCm software.
 System validation
 =================
 Before running AI workloads, it's important to validate that your AMD hardware is configured
 correctly and performing optimally.
 If you have already validated your system settings, including aspects like NUMA auto-balancing, you
 can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
 optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
 before starting training.
 To test for optimal performance, consult the recommended :ref:`System health benchmarks
 <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
 system's configuration.
 This Docker image is optimized for specific model configurations outlined
 below. Performance can vary for other training workloads, as AMD
 doesn’t test configurations and run conditions outside those described.
 Run training
 ============
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
   {% set docker = data.docker %}
   {% set model_groups = data.model_groups %}
   Once the setup is complete, choose between two options to start benchmarking training:
   .. tab-set::
      .. tab-item:: MAD-integrated benchmarking
   {% for model_group in model_groups %}
      {% for model in model_group.models %}
         .. container:: model-doc {{ model.mad_tag }}
            The following run command is tailored to {{ model.model }}.
            See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
            1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
               directory and install the required packages on the host machine.
               .. code-block:: shell
                  git clone https://github.com/ROCm/MAD
                  cd MAD
                  pip install -r requirements.txt
            2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
               using one node with the {{ model.precision }} data type on the host machine.
               .. code-block:: shell
                  export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
                  madengine run \
                      --tags {{ model.mad_tag }} \
                      --keep-model-dir \
                      --live-output \
                      --timeout 28800
               MAD launches a Docker container with the name
               ``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
               model are collected in ``~/MAD/perf.csv``.
      {% endfor %}
   {% endfor %}
      .. tab-item:: Standalone benchmarking
   {% for model_group in model_groups %}
      {% for model in model_group.models %}
         .. container:: model-doc {{ model.mad_tag }}
            The following commands are tailored to {{ model.model }}.
            See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
      {% endfor %}
   {% endfor %}
         .. rubric:: Download the Docker image and required packages
         1. Use the following command to pull the Docker image from Docker Hub.
            .. code-block:: shell
               docker pull {{ docker.pull_tag }}
         2. Launch the Docker container.
            .. code-block:: shell
               docker run -it \
                   --device /dev/dri \
                   --device /dev/kfd \
                   --network host \
                   --ipc host \
                   --group-add video \
                   --cap-add SYS_PTRACE \
                   --security-opt seccomp=unconfined \
                   --privileged \
                   -v $HOME:$HOME \
                   -v $HOME/.ssh:/root/.ssh \
                   --shm-size 64G \
                   --name training_env \
                   {{ docker.pull_tag }}
            Use these commands if you exit the ``training_env`` container and need to return to it.
            .. code-block:: shell
               docker start training_env
               docker exec -it training_env bash
         3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
            repository and navigate to the benchmark scripts directory
            ``/workspace/MAD/scripts/pytorch_train``.
            .. code-block:: shell
               git clone https://github.com/ROCm/MAD
               cd MAD/scripts/pytorch_train
         .. rubric:: Prepare training datasets and dependencies
         1. The following benchmarking examples require downloading models and datasets
            from Hugging Face. To ensure successful access to gated repos, set your
            ``HF_TOKEN``.
            .. code-block:: shell
               export HF_TOKEN=$your_personal_hugging_face_access_token
         2. Run the setup script to install libraries and datasets needed for benchmarking.
            .. code-block:: shell
               ./pytorch_benchmark_setup.sh
            .. container:: model-doc pyt_train_llama-3.1-8b
               ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B:
               .. list-table::
                  :header-rows: 1
                  * - Library
                    - Reference
                  * - ``accelerate``
                    - `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
                  * - ``datasets``
                    - `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
            .. container:: model-doc pyt_train_llama-3.1-70b
               ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B:
               .. list-table::
                  :header-rows: 1
                  * - Library
                    - Reference
                  * - ``datasets``
                    - `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
                  * - ``torchdata``
                    - `TorchData <https://meta-pytorch.org/data/beta/index.html#torchdata>`__
                  * - ``tomli``
                    - `Tomli <https://pypi.org/project/tomli/>`__
                  * - ``tiktoken``
                    - `tiktoken <https://github.com/openai/tiktoken>`__
                  * - ``blobfile``
                    - `blobfile <https://pypi.org/project/blobfile/>`__
                  * - ``tabulate``
                    - `tabulate <https://pypi.org/project/tabulate/>`__
                  * - ``wandb``
                    - `Weights & Biases <https://github.com/wandb/wandb>`__
                  * - ``sentencepiece``
                    - `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
                  * - ``tensorboard``
                    - `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
            .. container:: model-doc pyt_train_flux
               ``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX:
               .. list-table::
                  :header-rows: 1
                  * - Library
                    - Reference
                  * - ``accelerate``
                    - `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
                  * - ``datasets``
                    - `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`__ 3.2.0
                  * - ``sentencepiece``
                    - `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
                  * - ``tensorboard``
                    - `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
                  * - ``csvkit``
                    - `csvkit <https://csvkit.readthedocs.io/en/latest/>`__ 2.0.1
                  * - ``deepspeed``
                    - `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`__ 0.16.2
                  * - ``diffusers``
                    - `Hugging Face Diffusers <https://huggingface.co/docs/diffusers/en/index>`__ 0.31.0
                  * - ``GitPython``
                    - `GitPython <https://github.com/gitpython-developers/GitPython>`__ 3.1.44
                  * - ``opencv-python-headless``
                    - `opencv-python-headless <https://pypi.org/project/opencv-python-headless/>`__ 4.10.0.84
                  * - ``peft``
                    - `PEFT <https://huggingface.co/docs/peft/en/index>`__ 0.14.0
                  * - ``protobuf``
                    - `Protocol Buffers <https://github.com/protocolbuffers/protobuf>`__ 5.29.2
                  * - ``pytest``
                    - `PyTest <https://docs.pytest.org/en/stable/>`__ 8.3.4
                  * - ``python-dotenv``
                    - `python-dotenv <https://pypi.org/project/python-dotenv/>`__ 1.0.1
                  * - ``seaborn``
                    - `Seaborn <https://seaborn.pydata.org/>`__ 0.13.2
                  * - ``transformers``
                    - `Transformers <https://huggingface.co/docs/transformers/en/index>`__ 4.47.0
            ``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face:
            * `frank-chieng/chinese_architecture_siheyuan <https://huggingface.co/datasets/frank-chieng/chinese_architecture_siheyuan>`__
   {% for model_group in model_groups %}
      {% for model in model_group.models %}
         {% set training_modes = model.training_modes %}
         {% set training_mode_descs = {
            "pretrain": "Benchmark pre-training.",
            "HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision."
         } %}
         {% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %}
         {% if available_modes %}
         .. container:: model-doc {{ model.mad_tag }}
            .. rubric:: Pretraining
            To start the pre-training benchmark, use the following command with the
            appropriate options. See the following list of options and their descriptions.
            {% if model.mad_tag == "pyt_train_dlrm" %}
            1. Go to the DLRM directory.
               .. code-block:: shell
                  cd /workspace/DLRMBenchmark
            2. To run the single node training benchmark for DLRM-v2 with TF32 precision,
               run the following script.
               .. code-block:: shell
                  ./launch_training_single_node.sh
               To run with MAD within the Docker container, use the following command.
               .. code-block:: shell
                  ./pytorch_benchmark_report.sh -t pretrain -m DLRM
            {% else %}
            .. code-block:: shell
               ./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
                   -m {{ model.model_repo }} \
                   -p $datatype \
                   -s $sequence_length
            {% if model.mad_tag == "pyt_train_flux" %}
            .. container:: model-doc {{ model.mad_tag }}
               .. note::
                  Currently, FLUX models are not supported out-of-the-box on this Docker.
                  To use FLUX, refer to ``rocm/pytorch-training`` Docker: :doc:`pytorch-training-v25.6`
                  Occasionally, downloading the Flux dataset might fail. In the event of this
                  error, manually download it from Hugging Face at
                  `black-forest-labs/FLUX.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
                  and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access
                  the required dataset.
            {% endif %}
            .. list-table::
               :header-rows: 1
               * - Name
                 - Options
                 - Description
               {% for mode in available_modes %}
               * - {% if loop.first %}``$training_mode``{% endif %}
                 - ``{{ mode }}``
                 - {{ training_mode_descs[mode] }}
               {% endfor %}
               * - ``$datatype``
                 - ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
                 - Only Llama 3.1 8B supports FP8 precision.
               * - ``$sequence_length``
                 - Sequence length for the language model.
                 - Between 2048 and 8192. 8192 by default.
            {% endif %}
         {% endif %}
         {% set training_modes = model.training_modes %}
         {% set training_mode_descs = {
            "posttrain": "Benchmark post-training.",
         } %}
         {% set available_modes = training_modes | select("in", ["posttrain"]) | list %}
         {% if available_modes %}
         .. container:: model-doc {{ model.mad_tag }}
            .. rubric:: Post-training
            To start the post-training benchmark, use the following command with the
            appropriate options. See the following list of options and their descriptions.
            .. code-block:: shell
               ./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
                   -m {{ model.model_repo }} \
                   -p $datatype \
                   -s $sequence_length
            .. list-table::
               :header-rows: 1
               * - Name
                 - Options
                 - Description
               {% for mode in available_modes %}
               * - {% if loop.first %}``$training_mode``{% endif %}
                 - ``{{ mode }}``
                 - {{ training_mode_descs[mode] }}
               {% endfor %}
               * - ``$datatype``
                 - ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
                 - Only Llama 3.1 8B supports FP8 precision.
               * - ``$sequence_length``
                 - Sequence length for the language model.
                 - Between 2048 and 8192. 8192 by default.
         {% endif %}
         {% set training_mode_descs = {
            "finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).",
            "finetune_lora": "LoRA fine-tuning (BF16 supported).",
            "finetune_qlora": "QLoRA fine-tuning (BF16 supported).",
            "HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.",
         } %}
         {% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %}
         {% if available_modes %}
         .. container:: model-doc {{ model.mad_tag }}
            .. rubric:: Fine-tuning
            To start the fine-tuning benchmark, use the following command with the
            appropriate options. See the following list of options and their descriptions.
            See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`.
            .. code-block:: shell
               ./pytorch_benchmark_report.sh -t $training_mode \
                   -m {{ model.model_repo }} \
                   -p $datatype \
                   -s $sequence_length
            .. list-table::
               :header-rows: 1
               * - Name
                 - Options
                 - Description
               {% for mode in available_modes %}
               * - {% if loop.first %}``$training_mode``{% endif %}
                 - ``{{ mode }}``
                 - {{ training_mode_descs[mode] }}
               {% endfor %}
               * - ``$datatype``
                 - ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %}
                 - All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %}
               * - ``$sequence_length``
                 - Between 2048 and 16384.
                 - Sequence length for the language model.
            {% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %}
            .. note::
               For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B),
               use the following torchtune commit for compatibility:
               .. code-block:: shell
                  git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e
            {% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %}
            .. note::
               You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of
               input tensor should be smaller than max_seq_len (4096)``.
               This error indicates that an input sequence is longer than the model's maximum context window.
               Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096
               tokens in this case). You can resolve this by truncating the input or splitting
               it into smaller chunks before passing it to the model.
               Note on reproducibility: The results in this guide are based on
               commit ``b4c98ac`` from the upstream
               `<https://github.com/pytorch/torchtune>`__ repository. For the
               latest updates, you can use the main branch.
            {% endif %}
         {% endif %}
      {% endfor %}
   {% endfor %}
            .. rubric:: Benchmarking examples
            For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
 .. _amd-pytorch-training-multinode-examples-v25.11:
 Multi-node training
 -------------------
 Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
 training. See :ref:`rocm-for-ai-multi-node-setup-pyt-train-example` for example Slurm run commands.
 Pre-training
 ~~~~~~~~~~~~
 Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B.
 To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository.
 .. code-block:: shell
   # In the MAD repository
   cd scripts/pytorch_train
   sbatch run_slurm_train.sh
 Fine-tuning
 ~~~~~~~~~~~
 Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B.
 To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository.
 .. code-block:: shell
   huggingface-cli login # Get access to HF Llama model space
   huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally
   # In the MAD repository
   cd scripts/pytorch_train
   sbatch Torchtune_Multinode.sh
 .. note::
   Information regarding benchmark setup:
   * By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``.
   * You can adjust the torchtune `YAML configuration file
     <https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml>`__
     if you're using a different model.
   * The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``.
   * Set the ``mounting_paths`` inside the SLURM script.
 Once the run is finished, you can find the log files in the ``result_torchtune/`` directory.
 Further reading
 ===============
 - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
 - To learn more about system settings and management practices to configure your system for
  AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
 - For a list of other ready-made Docker images for AI with ROCm, see
  `AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
 Previous versions
 =================
 See :doc:`pytorch-training-history` to find documentation for previous releases
 of the ``ROCm/pytorch-training`` Docker image.
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst
@@ -47,7 +47,7 @@ Megatron-LM.
              - {{ component_version }}
            {% endfor %}
-.. _amd-primus-megatron-lm-model-support-v25.11:
+.. _amd-primus-megatron-lm-model-support-v26.01:
 Supported models
 ================
@@ -108,7 +108,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
 <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
 system's configuration.
-.. _mi300x-amd-primus-megatron-lm-training-v25.11:
+.. _mi300x-amd-primus-megatron-lm-training-v26.01:
 Environment setup
 =================
@@ -118,7 +118,7 @@ Environment setup
   Use the following instructions to set up the environment, configure the script to train models, and
   reproduce the benchmark results on AMD Instinct GPUs.
-.. _amd-primus-megatron-lm-requirements-v25.11:
+.. _amd-primus-megatron-lm-requirements-v26.01:
 Pull the Docker image
@@ -157,16 +157,16 @@ Pull the Docker image
         docker start primus_training_env
         docker exec -it primus_training_env bash
-The Docker container hosts verified commit ``c4c083de`` of the `Primus
+The Docker container hosts verified commit ``9c529cd4`` of the `Primus
-<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
+<https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167>`__ repository.
-.. _amd-primus-megatron-lm-environment-setup-v25.11:
+.. _amd-primus-megatron-lm-environment-setup-v26.01:
 Configuration
 =============
 Primus defines a training configuration in YAML for each model in
-`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/examples/megatron/configs>`__.
+`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/examples/megatron/configs>`__.
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
@@ -207,7 +207,7 @@ You can use either mock data or real data for training.
  Ensure that the files are accessible inside the Docker container.
-.. _amd-primus-megatron-lm-tokenizer-v25.11:
+.. _amd-primus-megatron-lm-tokenizer-v26.01:
 Tokenizer
 ---------
@@ -220,15 +220,7 @@ right permissions to access the tokenizer for each model.
   # Export your HF_TOKEN in the workspace
   export HF_TOKEN=<your_hftoken>
-.. note::
+.. _amd-primus-megatron-lm-run-training-v26.01:
   In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
   3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and
   ``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model
   <https://github.com/AMD-AGI/Primus/blob/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs/llama3.1_8B-pretrain.yaml>`__
   definition.
 .. _amd-primus-megatron-lm-run-training-v25.11:
 Run training
 ============
@@ -252,7 +244,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.3 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run pre-training for Llama 3.3 70B BF16, run:
@@ -263,8 +255,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.3_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -276,14 +270,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.3_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.1 8B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run pre-training for Llama 3.1 8B FP8, run:
@@ -294,8 +290,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_8B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -307,8 +305,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_8B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
   For Llama 3.1 8B BF16, use the following command:
@@ -319,8 +319,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama3.1_BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_8B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -332,14 +334,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_8B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.1 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run pre-training for Llama 3.1 70B BF16, run:
@@ -350,8 +354,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -363,8 +369,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
   To run the training on a single node for Llama 3.1 70B FP8, use the following command.
@@ -381,8 +389,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama3.1_70B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -394,18 +404,20 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh \
+              --log_file /tmp/primus_llama3.1_70B_fp8_proxy.log \
-                --train_iters 50 \
+              -- train pretrain \
-                --num_layers 40 \
+              --config examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
-                --fp8 hybrid \
+              --train_iters 50 \
-                --no_fp8_weight_transpose_cache true
+              --num_layers 40 \
              --fp8 hybrid \
              --no_fp8_weight_transpose_cache true
 .. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 7B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run pre-training for Llama 2 7B FP8, run:
@@ -416,8 +428,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_7B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -429,8 +443,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_7B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml
   To run pre-training for Llama 2 7B BF16, run:
@@ -441,8 +457,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_7B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -454,14 +472,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_7B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run pre-training for Llama 2 70B BF16, run:
@@ -472,8 +492,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -485,14 +507,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash ./examples/run_pretrain.sh
+              --log_file /tmp/primus_llama2_70B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to DeepSeek-V3.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
   use the following command:
@@ -504,13 +528,15 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh \
+              --log_file /tmp/primus_deepseek_v3_proxy.log \
-                --num_layers 3 \
+              -- train pretrain \
-                --moe_layer_freq 1 \
+              --config examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \
-                --train_iters 50 \
+              --num_layers 3 \
-                --micro_batch_size 8 \
+              --moe_layer_freq 1 \
-                --global_batch_size 64
+              --train_iters 50 \
              --micro_batch_size 8 \
              --global_batch_size 64
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -522,17 +548,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh \
+              --log_file /tmp/primus_deepseek_v3_proxy.log \
-                --num_layers 3 \
+              -- train pretrain \
-                --moe_layer_freq 1 \
+              --config examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \
-                --train_iters 50
+              --num_layers 3 \
              --moe_layer_freq 1 \
              --micro_batch_size 3 \
              --global_batch_size 192 \
              --train_iters 50
 .. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to DeepSeek-V2-Lite.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
   use the following command:
@@ -544,8 +574,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_deepseek_v2_lite.log \
              -- train pretrain \
              --config examples/megatron/configs//MI355X/deepseek_v2_lite-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -557,14 +589,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_deepseek_v2_lite.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Mixtral 8x7B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
   use the following command:
@@ -576,8 +610,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_mixtral_8x7B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -589,15 +625,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh \
+              --log_file /tmp/primus_mixtral_8x7B.log \
-                --train_iters 50
+              -- train pretrain \
              --config examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Mixtral 8x22B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
   use the following command:
@@ -609,8 +646,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_mixtral_8x22B_proxy.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -622,19 +661,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh \
+              --log_file /tmp/primus_mixtral_8x22B_proxy.log \
-                --train_iters 50 \
+              -- train pretrain \
-                --num_layers 4 \
+              --config examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
-                --pipeline_model_parallel_size 1 \
+              --num_layers 4 \
-                --micro_batch_size 1 \
+              --pipeline_model_parallel_size 1 \
-                --global_batch_size 16
+              --micro_batch_size 1 \
              --global_batch_size 16 \
              --train_iters 50
 .. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Qwen 2.5 7B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run training on a single node for Qwen 2.5 7B BF16, use the following
   command:
@@ -646,8 +687,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_qwen2.5_7B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -659,8 +702,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_qwen2.5_7B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml
   For FP8, use the following command.
@@ -671,8 +716,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_qwen2.5_7B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -684,14 +731,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_qwen2.5_7B_fp8.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml
 .. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Qwen 2.5 72B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
@@ -702,11 +751,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
         .. code-block:: shell
-            EXP=examples/megatron/configs/MI355X/qwen2.5_72B-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh \
+              --log_file /tmp/primus_qwen2.5_72B.log \
-                --train_iters 50 \
+              -- train pretrain \
-                --micro_batch_size 16 \
+              --config examples/megatron/configs/MI355X/qwen2.5_72B-BF16-pretrain.yaml
                --global_batch_size 256
      .. tab-item:: MI300X
         :sync: MI325X and MI300X
@@ -718,10 +766,12 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
            export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
            export NVTE_CK_IS_V3_ATOMIC_FP32=1
-            EXP=examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml \
+            bash runner/primus-cli direct \
-            bash examples/run_pretrain.sh
+              --log_file /tmp/primus_qwen2.5_72B.log \
              -- train pretrain \
              --config examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml
-.. _amd-primus-megatron-multi-node-examples-v25.11:
+.. _amd-primus-megatron-multi-node-examples-v26.01:
 Multi-node training examples
 ----------------------------
@@ -730,7 +780,7 @@ Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure y
 training.
 To run training on multiple nodes, you can use the
-`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
+`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/9c529cd4a934a68a880ede036c3e97b792e38167/examples/run_slurm_pretrain.sh>`__
 to launch the multi-node workload. Use the following steps to setup your environment:
 .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
@@ -763,13 +813,13 @@ to launch the multi-node workload. Use the following steps to setup your environ
   * If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
   * To find your network interface, you can use ``ip a``.
   * To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB  devices.
-   * Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v25.11`) as appropriate.
+   * Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v26.01`) as appropriate.
 .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.1 8B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Llama 3.1 8B FP8 on 8 nodes, run:
@@ -786,7 +836,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 7B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Llama 2 7B FP8 on 8 nodes, run:
@@ -803,7 +853,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.1 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Llama 3.1 70B FP8 on 8 nodes, run:
@@ -833,7 +883,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Llama 2 70B FP8 on 8 nodes, run:
@@ -863,7 +913,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 3.3 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Llama 3.3 70B FP8 on 8 nodes, run:
@@ -893,7 +943,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Mixtral 8x7B BF16 on 8 nodes, run:
@@ -911,7 +961,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
   Once setup is complete, run the appropriate training command.
   The following run commands are tailored to Llama 2 70B.
-   See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
+   See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
   To train Qwen2.5 72B FP8 on 8 nodes, run:
@@ -926,7 +976,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
          --global_batch_size 512 \
          --recompute_num_layers 80 \
-.. _amd-primus-megatron-lm-benchmark-test-vars-v25.11:
+.. _amd-primus-megatron-lm-benchmark-test-vars-v26.01:
 Key options
 -----------
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst
@@ -45,7 +45,7 @@ with Primus Turbo optimizations.
              - {{ component_version }}
            {% endfor %}
-.. _amd-primus-pytorch-model-support-v25.11:
+.. _amd-primus-pytorch-model-support-v26.01:
 Supported models
 ================
@@ -91,7 +91,7 @@ vary by model -- select one to get started.
   For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
   see the documentation :doc:`pytorch-training` (without Primus)
-.. _amd-primus-pytorch-performance-measurements-v25.11:
+.. _amd-primus-pytorch-performance-measurements-v26.01:
 System validation
 =================
@@ -146,7 +146,7 @@ tweak some configurations (such as batch sizes).
         .. container:: model-doc {{ model.mad_tag }}
            The following run command is tailored to {{ model.model }}.
-            See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
+            See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
            1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
               directory and install the required packages on the host machine.
@@ -184,7 +184,7 @@ tweak some configurations (such as batch sizes).
         .. container:: model-doc {{ model.mad_tag }}
            The following run commands are tailored to {{ model.model }}.
-            See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
+            See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
            .. rubric:: Download the Docker image and required packages
@@ -220,8 +220,8 @@ tweak some configurations (such as batch sizes).
                  docker start training_env
                  docker exec -it training_env bash
-               The Docker container hosts verified commit ``c4c083de`` of the `Primus
+               The Docker container hosts verified commit ``9c529cd4`` of the `Primus
-               <https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
+               <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/>`__ repository.
            .. rubric:: Prepare training datasets and dependencies
@@ -257,24 +257,31 @@ tweak some configurations (such as batch sizes).
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_8B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh --training.local_batch_size 6 
+                          --log_file /tmp/primus_llama3.1_8B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
                          --training.local_batch_size 6
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_8B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
               To train Llama 3.1 8B with FP8 precision, use the following command.
@@ -285,24 +292,31 @@ tweak some configurations (such as batch sizes).
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_8B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh --training.local_batch_size 7 
+                          --log_file /tmp/primus_llama3.1_8B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
                          --training.local_batch_size 7
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_8B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
            .. container:: model-doc primus_pyt_train_llama-3.1-70b
@@ -315,24 +329,31 @@ tweak some configurations (such as batch sizes).
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_70B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh --training.local_batch_size 6 
+                          --log_file /tmp/primus_llama3.1_70B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
                          --training.local_batch_size 6
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_70B.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
               To train Llama 3.1 70B with FP8 precision, use the following command.
@@ -343,24 +364,31 @@ tweak some configurations (such as batch sizes).
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_70B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh --training.local_batch_size 5 
+                          --log_file /tmp/primus_llama3.1_70B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
                          --training.local_batch_size 5
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_llama3.1_70B_fp8.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml
            .. container:: model-doc primus_pyt_train_deepseek-v3-16b
@@ -373,24 +401,31 @@ tweak some configurations (such as batch sizes).
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_deepseek_v3_16b.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml
                  .. tab-item:: MI325X
                     :sync: MI325X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh --training.local_batch_size 10 
+                          --log_file /tmp/primus_deepseek_v3_16b.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
                          --training.local_batch_size 10
                  .. tab-item:: MI300X
                     :sync: MI300X
                     .. code-block:: shell
-                        EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
+                        bash runner/primus-cli direct \
-                        bash examples/run_pretrain.sh
+                          --log_file /tmp/primus_deepseek_v3_16b.log \
                          -- train pretrain \
                          --config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml
      {% endfor %}
   {% endfor %}
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst
@@ -43,7 +43,7 @@ training workloads:
              - {{ component_version }}
            {% endfor %}
-.. _amd-pytorch-training-model-support-v25.11:
+.. _amd-pytorch-training-model-support-v26.01:
 Supported models
 ================
@@ -85,7 +85,7 @@ one to get started.
         </div>
      </div>
-.. _amd-pytorch-training-supported-training-modes-v25.11:
+.. _amd-pytorch-training-supported-training-modes-v26.01:
 The following table lists supported training modes per model.
@@ -120,7 +120,7 @@ The following table lists supported training modes per model.
         unlisted fine-tuning methods by using an existing file in the
         ``/workspace/torchtune/recipes/configs`` directory as a template.
-.. _amd-pytorch-training-performance-measurements-v25.11:
+.. _amd-pytorch-training-performance-measurements-v26.01:
 Performance measurements
 ========================
@@ -176,7 +176,7 @@ Run training
         .. container:: model-doc {{ model.mad_tag }}
            The following run command is tailored to {{ model.model }}.
-            See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
+            See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
            1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
               directory and install the required packages on the host machine.
@@ -214,7 +214,7 @@ Run training
         .. container:: model-doc {{ model.mad_tag }}
            The following commands are tailored to {{ model.model }}.
-            See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
+            See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
      {% endfor %}
   {% endfor %}
@@ -409,6 +409,10 @@ Run training
            {% if model.mad_tag == "pyt_train_dlrm" %}
            .. note::
               DLRM is supported on MI300X, MI325X, MI350X, and MI355X GPUs.
            1. Go to the DLRM directory.
               .. code-block:: shell
@@ -532,7 +536,7 @@ Run training
            To start the fine-tuning benchmark, use the following command with the
            appropriate options. See the following list of options and their descriptions.
-            See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`.
+            See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v26.01>`.
            .. code-block:: shell
@@ -597,7 +601,7 @@ Run training
            For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
-.. _amd-pytorch-training-multinode-examples-v25.11:
+.. _amd-pytorch-training-multinode-examples-v26.01:
 Multi-node training
 -------------------
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -119,6 +119,10 @@ subtrees:
            title: PyTorch inference performance testing
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
            title: SGLang inference performance testing
          - file: how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
            title: vLLM distributed inference with MoRI
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
            title: SGLang distributed inference with MoRI
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
            title: SGLang distributed inference with Mooncake
          - file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
Author	SHA1	Message	Date
peterjunpark	811188dc13	Update Primus docs for 26.1 release (#5911 ) (#5918 ) * archive previous versions update conf fix fix docker hub url fix * update history pages * update docker info * update configs * update primus commit (cherry picked from commit `d8b6ee47e3`)	2026-01-30 12:54:26 -05:00
peterjunpark	ec36bc9971	Publish vLLM / SGLang + MoRI distributed inference cookbooks (#5912 ) (#5913 ) * add recipes * clean up update clean up fix * update sglang docker instructions docker image tag add user to docker group fix * update pldm/bkc * update pldm/bkc * add bkc note * update bkc notes * update article info * update wordlist * fix linting issues * fix linting issues * fix linting * fix ref (cherry picked from commit `d1165b7359`)	2026-01-29 11:42:03 -05:00
Pratik Basyal	cd208e7d74	PLDM Note change 720 (#5894 ) * Note change * Minor change	2026-01-23 10:32:00 -05:00
Pratik Basyal	af8ea73581	720 reference link update and note fixes [Develop] (#5883 ) (#5884 ) * Links updated to 7.2.0 * COmpatibility note fixed	2026-01-22 12:21:46 -05:00
Pratik Basyal	f1c86d7d29	720 Post GA Known Issues update (#5879 ) * 7.2.0 Known issues and PLDM table updated (#5877) * Known issues and PLDM table updated * JAX workload known issues added * Minor changes * Minor update	2026-01-21 17:29:18 -05:00
Alex Xu	370816001e	Merge branch 'roc-7.2.x' into docs/7.2.0	2026-01-21 15:29:08 -05:00
Swati Rawat	d5994da509	Merge pull request #5872 from SwRaw/swaraw_cherrypick Cherrypicking replacement of rocm-smi with amd-smi from ROCm internal	2026-01-21 19:10:51 +05:30
srawat	c02f86c0e7	Update prerequisite-system-validation.rst	2026-01-21 17:43:10 +05:30
srawat	d3523c24d3	replace rocm-smi reference with amd-smi	2026-01-21 17:40:26 +05:30
Swati Rawat	1980239b81	Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:31:41 +05:30
Swati Rawat	c75fd6f532	Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:31:05 +05:30
Swati Rawat	72cb598190	Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:30:33 +05:30
Swati Rawat	9b55b77aaa	Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:29:45 +05:30
Swati Rawat	8267303e1d	Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:29:04 +05:30
Swati Rawat	86d2c4e891	Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst Co-authored-by: peterjunpark <git@peterjunpark.com>	2026-01-21 17:28:23 +05:30
srawat	2977e35330	Update single-gpu-fine-tuning-and-inference.rst	2026-01-21 17:27:13 +05:30
srawat	e95955f572	Update multi-gpu-fine-tuning-and-inference.rst	2026-01-21 17:27:13 +05:30