Merge branch 'develop' into docs/xdit-diffusion-v25-12

Spelling, added 'js'
Simplify yaml file and cleanup main rst page.
2026-01-09 14:48:06 -05:00 · 2025-12-16 12:56:10 +01:00 · 2025-12-15 11:58:20 +01:00 · 2025-12-15 11:53:46 +01:00 · 2025-12-15 08:26:54 +01:00 · 2025-12-12 15:53:51 +01:00
14 changed files with 568 additions and 34 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -36,6 +36,7 @@ Andrej
 Arb
 Autocast
 autograd
+Backported
 BARs
 BatchNorm
 BLAS
@@ -138,6 +139,7 @@ ESXi
 EP
 EoS
 etcd
+equalto
 fas
 FBGEMM
 FiLM
@@ -202,9 +204,11 @@ GenAI
 GenZ
 GitHub
 Gitpod
+hardcoded
 HBM
 HCA
 HGX
+HLO
 HIPCC
 hipDataType
 HIPExtension
@@ -226,6 +230,8 @@ href
 Hyperparameters
 HybridEngine
 Huggingface
+Hunyuan
+HunyuanVideo
 IB
 ICD
 ICT
@@ -258,6 +264,7 @@ Ioffe
 JAX's
 JAXLIB
 Jinja
+js
 JSON
 Jupyter
 KFD
@@ -329,6 +336,7 @@ MoEs
 Mooncake
 Mpops
 Multicore
+multihost
 Multithreaded
 mx
 MXFP
@@ -541,6 +549,7 @@ UAC
 UC
 UCC
 UCX
+ud
 UE
 UIF
 UMC
@@ -852,6 +861,7 @@ pallas
 parallelization
 parallelizing
 param
+params
 parameterization
 passthrough
 pe
@@ -898,6 +908,7 @@ querySelectorAll
 queueing
 qwen
 radeon
+rc
 rccl
 rdc
 rdma
@@ -959,6 +970,7 @@ scalability
 scalable
 scipy
 seealso
+selectattr
 selectedTag
 sendmsg
 seqs
@@ -1020,6 +1032,7 @@ uncacheable
 uncorrectable
 underoptimized
 unhandled
+unfused
 uninstallation
 unmapped
 unsqueeze
@@ -1062,6 +1075,8 @@ writebacks
 wrreq
 wzo
 xargs
+xdit
+xDiT
 xGMI
 xPacked
 xz
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -39,7 +39,11 @@ for a complete overview of this release.
  - VMs were incorrectly reporting `AMDSMI_STATUS_API_FAILED` when unable to get the power cap within the `amdsmi_get_power_info`.
  - The API now returns `N/A` or `UINT_MAX` for values that can't be retrieved, instead of failing.

- Fixed output for `amd-smi xgmi -l --json`.  
+- Fixed output for `amd-smi xgmi -l --json`.
+
+```{note}
+See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-711) for details, examples, and in-depth descriptions.
+```

 ### **Composable Kernel** (1.1.0)

--- a/RELEASE.md
+++ b/RELEASE.md
@@ -100,12 +100,13 @@ firmware, AMD GPU drivers, and the ROCm user space software.
              01.25.16.03<br>
              01.25.15.04
          </td>
-          <td rowspan="2" style="vertical-align: middle;">
+          <td>
              30.20.1<br>
              30.20.0<br>
              30.10.2<br>
              30.10.1<br>
-              30.10</td>
+              30.10
+            </td>
          <td rowspan="3" style="vertical-align: middle;">8.6.0.K</td>
      </tr>
      <tr>
@@ -114,6 +115,13 @@ firmware, AMD GPU drivers, and the ROCm user space software.
              01.25.16.03<br>
              01.25.15.04
          </td>
+          <td>
+              30.20.1<br>
+              30.20.0<br>
+              30.10.2<br>
+              30.10.1<br>
+              30.10
+            </td>
      </tr>
      <tr>
          <td>MI325X<a href="#footnote1"><sup>[1]</sup></a></td>
@@ -674,7 +682,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
 - Fixed output for `amd-smi xgmi -l --json`.  

 ```{note}
-See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-710) for details, examples, and in-depth descriptions.
+See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-711) for details, examples, and in-depth descriptions.
 ```

 ### **Composable Kernel** (1.1.0)
@@ -831,7 +839,7 @@ issues related to individual components, review the [Detailed component changes]

 ### RCCL performance degradation on AMD Instinct MI300X GPU with AMD Pollara AI NIC

-If you’re using RCCL on AMD Instinct MI300X GPUs with AMD Pollara AI NIC, you might observe performance degradation for specific collectives and message sizes. The affected collectives are `Scatter`, `AllToAll`, and `AlltoAllv`. It's recommended to avoid using RCCL packaged with ROCm 7.1.1. As a workaround, use the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop), which contains the fix and will be included in a future ROCm release.
+If you’re using RCCL on AMD Instinct MI300X GPUs with AMD Pollara AI NIC, you might observe performance degradation for specific collectives and message sizes. The affected collectives are `Scatter`, `AllToAll`, and `AlltoAllv`. It's recommended to avoid using RCCL packaged with ROCm 7.1.1. As a workaround, use the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop), which contains the fix and will be included in a future ROCm release. See [GitHub issue #5717](https://github.com/ROCm/ROCm/issues/5717).

 ### Segmentation fault in training models using TensorFlow 2.20.0 Docker images

@@ -839,7 +847,7 @@ Training models `tf2_tfm_resnet50_fp16_train` and `tf2_tfm_resnet50_fp32_train`
 might fail with a segmentation fault when run on the TensorFlow 2.20.0 Docker
 image with ROCm 7.1.1. As a workaround, use TensorFlow 2.19.x Docker image for
 training the models in ROCm 7.1.1. This issue will be fixed in a future ROCm
-release.
+release. See [GitHub issue #5718](https://github.com/ROCm/ROCm/issues/5718).

 ### AMD SMI CLI triggers repeated kernel errors on GPUs with partitioning support

@@ -858,27 +866,19 @@ amdgpu 0000:15:00.0: amdgpu: renderD153 partition 1 not valid!
 These repeated kernel logs can clutter the system logs and may cause
 unnecessary concern about GPU health. However, this is a non-functional issue
 and does not affect AMD SMI functionality or GPU performance. This issue will
-be fixed in a future ROCm release.
+be fixed in a future ROCm release. See [GitHub issue #5720](https://github.com/ROCm/ROCm/issues/5720).

 ### Excessive bad page logs in AMD GPU Driver (amdgpu)

-Due to partial data corruption of Electrically Erasable Programmable Read-Only Memory (EEPROM) and limited error handling in the AMD GPU Driver(amdgpu), excessive log output might result when querying the reliability, availability, and serviceability (RAS) bad pages. This issue will be fixed in a future AMD GPU Driver(amdgpu) and ROCm release.
+Due to partial data corruption in the Electrically Erasable Programmable Read-Only Memory (EEPROM) and limited error handling in the AMD GPU Driver (amdgpu), excessive log output might occur when querying the reliability, availability, and serviceability (RAS) bad pages. This issue will be fixed in a future AMD GPU Driver (amdgpu) and ROCm release. See [GitHub issue #5719](https://github.com/ROCm/ROCm/issues/5719).

-### OpenBLAS runtime dependency for hipblastlt-test and hipblaslt-bench
+### Incorrect results in gemm_ex operations for rocBLAS and hipBLAS

-Running `hipblaslt-test` or `hipblaslt-bench` without installing the OpenBLAS development package results in the following error:
-```
-libopenblas.so.0: cannot open shared object file: No such file or directory
-```
-As a workaround, first install `libopenblas-dev` or `libopenblas-deve`, depending on the package manager used. The issue will be fixed in a future ROCm release. See [GitHub issue #5639](https://github.com/ROCm/ROCm/issues/5639).
+Some `gemm_ex` operations with 8-bit input data types (`int8`, `float8`, `bfloat8`) for specific matrix dimensions (K = 1 and number of workgroups > 1) might yield incorrect results. The issue results from incorrect tailloop code that fails to consider workgroup index when calculating valid element size. The issue will be fixed in a future ROCm release. See [GitHub issue #5722](https://github.com/ROCm/ROCm/issues/5722).

-### Reduced precision in gemm_ex operations for rocBLAS and hipBLAS
+### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs

-Some `gemm_ex` operations with `half` or `f32_r` data types might yield 16-bit precision results instead of the expected 32-bit precision when matrix dimensions are m=1 or n=1. The issue results from the optimization that enables `_ex` APIs to use lower precision multiples. It limits the high-precision matrix operations performed in PyTorch with rocBLAS and hipBLAS. The issue will be fixed in a future ROCm release. See [GitHub issue #5640](https://github.com/ROCm/ROCm/issues/5640).
-
-### RCCL profiler plugin failure with AllToAll operations
-
-The RCCL profiler plugin `librccl-profiler.so` might fail with a segmentation fault during `AllToAll` collective operations due to improperly assigned point-to-point task function pointers. This leads to invalid memory access and prevents profiling of `AllToAll` performance. Other operations, like `AllReduce`, are unaffected. It's recommended to avoid using the RCCL profiler plugin with `AllToAll` operations until the fix is available. This issue is resolved in the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop) and will be part of a future ROCm release. See [GitHub issue #5653](https://github.com/ROCm/ROCm/issues/5653).
+If you’re using hipBLASLt on AMD Instinct MI325X GPUs for large FP8 GEMM operations (such as 9728x8192x65536), you might observe a noticeable performance variation. The issue is currently under investigation and will be fixed in a future ROCm release. See [GitHub issue #5734](https://github.com/ROCm/ROCm/issues/5734).

 ## ROCm resolved issues

--- a/docs/compatibility/compatibility-matrix-historical-6.0.csv
+++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv
@@ -30,7 +30,7 @@ ROCm Version,7.1.1,7.1.0,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6
      ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
      ,,,,,,,,,,,,,,,,,,,,,,
      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,,,
-      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8, 2.7","2.8, 2.7, 2.6","2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
+      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8","2.8, 2.7, 2.6","2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
      :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
      :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.7.1,0.7.1,0.6.0,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
      :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -54,7 +54,7 @@ compatibility and system requirements.
      ,gfx908,gfx908,gfx908
      ,,,
      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,,
-      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8, 2.7","2.8, 2.7, 2.6","2.6, 2.5, 2.4, 2.3"
+      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8","2.8, 2.7, 2.6","2.6, 2.5, 2.4, 2.3"
      :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.18.1, 2.17.1, 2.16.2"
      :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.7.1,0.7.1,0.4.35
      :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,2.4.0
--- a/docs/compatibility/ml-compatibility/jax-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/jax-compatibility.rst
@@ -269,6 +269,33 @@ For a complete and up-to-date list of JAX public modules (for example, ``jax.num
  JAX API modules are maintained by the JAX project and is subject to change.
  Refer to the official Jax documentation for the most up-to-date information.

+Key features and enhancements for ROCm 7.1
+===============================================================================
+
+- Enabled compilation of multihost HLO runner Python bindings.
+
+  - Backported multihost HLO runner bindings and some related changes to
+    :code:`FunctionalHloRunner`.
+
+  - Added :code:`requirements_lock_3_12` to enable building for Python 3.12.
+
+- Removed hardcoded NHWC convolution layout for ``fp16`` precision to address the performance drops for ``fp16`` precision on gfx12xx GPUs.
+
+
+- ROCprofiler-SDK integration:
+
+  - Integrated ROCprofiler-SDK (v3) to XLA to improve profiling of GPU events,
+    support both time-based and step-based profiling.
+
+  - Added unit tests for :code:`rocm_collector` and :code:`rocm_tracer`.
+
+- Added Triton unsupported conversion from ``f8E4M3FNUZ`` to ``fp16`` with
+  rounding mode.
+
+- Introduced :code:`CudnnFusedConvDecomposer` to revert fused convolutions
+  when :code:`ConvAlgorithmPicker` fails to find a fused algorithm, and removed
+  unfused fallback paths from :code:`RocmFusedConvRunner`.
+
 Key features and enhancements for ROCm 7.0
 ===============================================================================

--- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
@@ -401,25 +401,25 @@ with ROCm.

 Key features and enhancements for PyTorch 2.9 with ROCm 7.1.1
 ================================================================================
- Scaled Dot Product Attention (SDPA) upgraded to use AOTriton version 0.11b
+- Scaled Dot Product Attention (SDPA) upgraded to use AOTriton version 0.11b.

- Default hipBLASLt support enabled for gfx908 architecture on ROCm 6.3 and later
+- Default hipBLASLt support enabled for gfx908 architecture on ROCm 6.3 and later.

- MIOpen now supports channels last memory format for 3D convolutions and batch normalization
+- MIOpen now supports channels last memory format for 3D convolutions and batch normalization.

- NHWC convolution operations in MIOpen optimized by eliminating unnecessary transpose operations
+- NHWC convolution operations in MIOpen optimized by eliminating unnecessary transpose operations.

- Improved tensor.item() performance by removing redundant synchronization
+- Improved tensor.item() performance by removing redundant synchronization.

- Enhanced performance for element-wise operations and reduction kernels
+- Enhanced performance for element-wise operations and reduction kernels.

- Added support for grouped GEMM operations through fbgemm_gpu generative AI components
+- Added support for grouped GEMM operations through fbgemm_gpu generative AI components.

- Resolved device error in Inductor when using CUDA graph trees with HIP
+- Resolved device error in Inductor when using CUDA graph trees with HIP.

- Corrected logsumexp scaling in AOTriton-based SDPA implementation
+- Corrected logsumexp scaling in AOTriton-based SDPA implementation.

- Added stream graph capture status validation in memory copy synchronization functions
+- Added stream graph capture status validation in memory copy synchronization functions.

 Key features and enhancements for PyTorch 2.8 with ROCm 7.1
 ================================================================================
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -145,6 +145,7 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
@@ -249,3 +250,6 @@ html_context = {
    "granularity_type" : [('Coarse-grained', 'coarse-grained'), ('Fine-grained', 'fine-grained')],
    "scope_type" : [('Device', 'device'), ('System', 'system')]
 }
+
+# Disable figure and table numbering
+numfig = False
--- a/docs/data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
@@ -0,0 +1,91 @@
+docker:
+  pull_tag: rocm/pytorch-xdit:v25.12
+  docker_hub_url: https://hub.docker.com/r/rocm/pytorch-xdit
+  ROCm: 7.10.0
+  whats_new:
+      - "Adds T2V and TI2V support for Wan models."
+      - "Adds support for SD-3.5 T2I model."
+  components:
+    TheRock: 
+      version: 3e3f834
+      url: https://github.com/ROCm/TheRock
+    rccl:
+      version: d23d18f
+      url: https://github.com/ROCm/rccl
+    composable_kernel:
+      version: 2570462
+      url: https://github.com/ROCm/composable_kernel
+    rocm-libraries:
+      version: 0588f07
+      url: https://github.com/ROCm/rocm-libraries
+    rocm-systems:
+      version: 473025a
+      url: https://github.com/ROCm/rocm-systems
+    torch:
+      version: 73adac
+      url: https://github.com/pytorch/pytorch
+    torchvision:
+      version: f5c6c2e
+      url: https://github.com/pytorch/vision
+    triton:
+      version: 7416ffc
+      url: https://github.com/triton-lang/triton
+    accelerate:
+      version: 34c1779
+      url: https://github.com/huggingface/accelerate
+    aiter:
+      version: de14bec
+      url: https://github.com/ROCm/aiter
+    diffusers:
+      version: 40528e9
+      url: https://github.com/huggingface/diffusers
+    xfuser:
+      version: ccba9d5
+      url: https://github.com/xdit-project/xDiT
+    yunchang:
+      version: 2c9b712
+      url: https://github.com/feifeibear/long-context-attention
+  supported_models:
+    - group: Hunyuan Video
+      js_tag: hunyuan
+      models:
+        - model: Hunyuan Video
+          model_repo: tencent/HunyuanVideo
+          revision: refs/pr/18
+          url: https://huggingface.co/tencent/HunyuanVideo
+          github: https://github.com/Tencent-Hunyuan/HunyuanVideo
+          mad_tag: pyt_xdit_hunyuanvideo
+          js_tag: hunyuan_tag
+    - group: Wan-AI
+      js_tag: wan
+      models:
+        - model: Wan2.1
+          model_repo: Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
+          url: https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
+          github: https://github.com/Wan-Video/Wan2.1
+          mad_tag: pyt_xdit_wan_2_1
+          js_tag: wan_21_tag
+        - model: Wan2.2
+          model_repo: Wan-AI/Wan2.2-I2V-A14B-Diffusers
+          url: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers
+          github: https://github.com/Wan-Video/Wan2.2
+          mad_tag: pyt_xdit_wan_2_2
+          js_tag: wan_22_tag
+    - group: FLUX
+      js_tag: flux
+      models:
+        - model: FLUX.1
+          model_repo: black-forest-labs/FLUX.1-dev
+          url: https://huggingface.co/black-forest-labs/FLUX.1-dev
+          github: https://github.com/black-forest-labs/flux
+          mad_tag: pyt_xdit_flux
+          js_tag: flux_1_tag
+    - group: StableDiffusion
+      js_tag: stablediffusion
+      models:
+        - model: stable-diffusion-3.5-large
+          model_repo: stabilityai/stable-diffusion-3.5-large
+          url: https://huggingface.co/stabilityai/stable-diffusion-3.5-large
+          github: https://github.com/Stability-AI/sd3.5
+          mad_tag: pyt_xdit_sd_3_5
+          js_tag: stable_diffusion_3_5_large_tag
--- a/docs/how-to/rocm-for-ai/inference/index.rst
+++ b/docs/how-to/rocm-for-ai/inference/index.rst
@@ -27,3 +27,5 @@ training, fine-tuning, and inference. It leverages popular machine learning fram
 - :doc:`SGLang inference performance testing <benchmark-docker/sglang>`

 - :doc:`Deploying your model <deploy-your-model>`
+
+- :doc:`xDiT diffusion inference <xdit-diffusion-inference>`
--- a/docs/how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
+++ b/docs/how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
@@ -0,0 +1,389 @@
+.. meta::
+   :description: Learn to validate diffusion model video generation on MI300X, MI350X and MI355X accelerators using
+                 prebuilt and optimized docker images.
+   :keywords: xDiT, diffusion, video, video generation, image, image generation, validate, benchmark
+
+************************
+xDiT diffusion inference
+************************
+
+.. _xdit-video-diffusion:
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   The `rocm/pytorch-xdit <{{ docker.docker_hub_url }}>`_ Docker image offers a prebuilt, optimized environment based on `xDiT <https://github.com/xdit-project/xDiT>`_ for
+   benchmarking diffusion model video and image generation on gfx942 and gfx950 series (AMD Instinct™ MI300X, MI325X, MI350X, and MI355X) GPUs.
+   The image runs ROCm **{{docker.ROCm}}** (preview) based on `TheRock <https://github.com/ROCm/TheRock>`_
+   and includes the following components:
+
+   .. dropdown:: Software components
+
+      .. list-table::
+         :header-rows: 1
+
+         * - Software component
+           - Version
+
+         {% for component_name, component_data in docker.components.items() %}
+         * - `{{ component_name }} <{{ component_data.url }}>`_
+           - {{ component_data.version }}
+         {% endfor %}
+
+Follow this guide to pull the required image, spin up a container, download the model, and run a benchmark.
+For preview and development releases, see `amdsiloai/pytorch-xdit <https://hub.docker.com/r/amdsiloai/pytorch-xdit>`_.
+
+What's new
+==========
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   {% for item in docker.whats_new %}
+   * {{ item }}
+   {% endfor %}
+
+.. _xdit-video-diffusion-supported-models:
+
+Supported models
+================
+
+The following models are supported for inference performance benchmarking.
+Some instructions, commands, and recommendations in this documentation might
+vary by model -- select one to get started.
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   .. raw:: html
+
+      <div id="vllm-benchmark-ud-params-picker" class="container-fluid">
+          <div class="row gx-0">
+              <div class="col-2 me-1 px-2 model-param-head">Model</div>
+              <div class="row col-10 pe-0">
+        {% for model_group in docker.supported_models %}
+               <div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.js_tag }}" tabindex="0">{{ model_group.group }}</div>
+        {% endfor %}
+              </div>
+          </div>
+
+          <div class="row gx-0 pt-1">
+              <div class="col-2 me-1 px-2 model-param-head">Variant</div>
+              <div class="row col-10 pe-0">
+        {% for model_group in docker.supported_models %}
+            {% set models = model_group.models %}
+            {% for model in models %}
+                {% if models|length % 3 == 0 %}
+                <div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.js_tag }}" data-param-group="{{ model_group.js_tag }}" tabindex="0">{{ model.model }}</div>
+                {% else %}
+                <div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.js_tag }}" data-param-group="{{ model_group.js_tag }}" tabindex="0">{{ model.model }}</div>
+                {% endif %}
+            {% endfor %}
+        {% endfor %}
+              </div>
+          </div>
+      </div>
+
+   {% for model_group in docker.supported_models %}
+       {% for model in model_group.models %}
+
+   .. container:: model-doc {{ model.js_tag }}
+
+      .. note::
+
+         To learn more about your specific model see the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_
+         or visit the `GitHub page <{{ model.github }}>`__. Note that some models require access authorization before use via an
+         external license agreement through a third party.
+
+       {% endfor %}
+   {% endfor %}
+
+System validation
+=================
+
+Before running AI workloads, it's important to validate that your AMD hardware is configured
+correctly and performing optimally.
+
+If you have already validated your system settings, including aspects like NUMA auto-balancing, you
+can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
+optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
+before starting.
+
+To test for optimal performance, consult the recommended :ref:`System health benchmarks
+<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
+system's configuration.
+
+Pull the Docker image
+=====================
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   For this tutorial, it's recommended to use the latest ``{{ docker.pull_tag }}`` Docker image.
+   Pull the image using the following command:
+
+   .. code-block:: shell
+
+      docker pull {{ docker.pull_tag }}
+
+Validate and benchmark
+======================
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   Once the image has been downloaded you can follow these steps to
+   run benchmarks and generate outputs.
+
+   {% for model_group in docker.supported_models %}
+     {% for model in model_group.models %}
+
+   .. container:: model-doc {{model.js_tag}}
+
+      The following commands are written for {{ model.model }}.
+      See :ref:`xdit-video-diffusion-supported-models` to switch to another available model.
+
+     {% endfor %}
+   {% endfor %}
+
+Choose your setup method
+------------------------
+
+You can either use an existing Hugging Face cache or download the model fresh inside the container.
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   {% for model_group in docker.supported_models %}
+     {% for model in model_group.models %}
+   .. container:: model-doc {{model.js_tag}}
+
+      .. tab-set::
+
+         .. tab-item:: Option 1: Use existing Hugging Face cache
+
+            If you already have models downloaded on your host system, you can mount your existing cache.
+
+            1. Set your Hugging Face cache location.
+
+               .. code-block:: shell
+
+                  export HF_HOME=/your/hf_cache/location
+
+            2. Download the model (if not already cached).
+
+               .. code-block:: shell
+
+                  huggingface-cli download {{ model.model_repo }} {% if model.revision %} --revision {{ model.revision }} {% endif %}
+
+            3. Launch the container with mounted cache.
+
+               .. code-block:: shell
+
+                  docker run \
+                      -it --rm \
+                      --cap-add=SYS_PTRACE \
+                      --security-opt seccomp=unconfined \
+                      --user root \
+                      --device=/dev/kfd \
+                      --device=/dev/dri \
+                      --group-add video \
+                      --ipc=host \
+                      --network host \
+                      --privileged \
+                      --shm-size 128G \
+                      --name pytorch-xdit \
+                      -e HSA_NO_SCRATCH_RECLAIM=1 \
+                      -e OMP_NUM_THREADS=16 \
+                      -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+                      -e HF_HOME=/app/huggingface_models \
+                      -v $HF_HOME:/app/huggingface_models \
+                      {{ docker.pull_tag }}
+
+         .. tab-item:: Option 2: Download inside container
+
+            If you prefer to keep the container self-contained or don't have an existing cache.
+
+            1. Launch the container
+
+               .. code-block:: shell
+
+                  docker run \
+                      -it --rm \
+                      --cap-add=SYS_PTRACE \
+                      --security-opt seccomp=unconfined \
+                      --user root \
+                      --device=/dev/kfd \
+                      --device=/dev/dri \
+                      --group-add video \
+                      --ipc=host \
+                      --network host \
+                      --privileged \
+                      --shm-size 128G \
+                      --name pytorch-xdit \
+                      -e HSA_NO_SCRATCH_RECLAIM=1 \
+                      -e OMP_NUM_THREADS=16 \
+                      -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+                      {{ docker.pull_tag }}
+
+            2. Inside the container, set the Hugging Face cache location and download the model.
+
+               .. code-block:: shell
+
+                  export HF_HOME=/app/huggingface_models
+                  huggingface-cli download {{ model.model_repo }} {% if model.revision %} --revision {{ model.revision }} {% endif %}
+
+               .. warning::
+
+                  Models will be downloaded to the container's filesystem and will be lost when the container is removed unless you persist the data with a volume.
+     {% endfor %}
+   {% endfor %}
+
+Run inference
+=============
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/xdit-inference-models.yaml
+
+   {% set docker = data.docker %}
+
+   {% for model_group in docker.supported_models %}
+     {% for model in model_group.models %}
+
+   .. container:: model-doc {{ model.js_tag }}
+
+      .. tab-set::
+
+         .. tab-item:: MAD-integrated benchmarking
+
+            1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
+               directory and install the required packages on the host machine.
+
+               .. code-block:: shell
+
+                  git clone https://github.com/ROCm/MAD
+                  cd MAD
+                  pip install -r requirements.txt
+
+            2. On the host machine, use this command to run the performance benchmark test on
+               the `{{model.model}} <{{ model.url }}>`_ model using one node.
+
+               .. code-block:: shell
+
+                  export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
+                  madengine run \
+                      --tags {{model.mad_tag}} \
+                      --keep-model-dir \
+                      --live-output
+                     
+            MAD launches a Docker container with the name
+            ``container_ci-{{model.mad_tag}}``. The throughput and serving reports of the
+            model are collected in the following paths: ``{{ model.mad_tag }}_throughput.csv``
+            and ``{{ model.mad_tag }}_serving.csv``.
+
+         .. tab-item:: Standalone benchmarking
+
+            To run the benchmarks for {{ model.model }}, use the following command:
+
+            .. code-block:: shell
+            {% if model.model == "Hunyuan Video" %}
+               cd /app/Hunyuanvideo
+               mkdir results
+
+               torchrun --nproc_per_node=8 run.py \
+                  --model {{ model.model_repo }} \
+                  --prompt "In the large cage, two puppies were wagging their tails at each other." \
+                  --height 720 --width 1280 --num_frames 129 \
+                  --num_inference_steps 50 --warmup_steps 1 --n_repeats 1 \
+                  --ulysses_degree 8 \
+                  --enable_tiling --enable_slicing \
+                  --use_torch_compile \
+                  --bench_output results
+
+            {% endif %}
+            {% if model.model == "Wan2.1" %}
+               cd Wan
+               mkdir results
+
+               torchrun --nproc_per_node=8 /app/Wan/run.py \
+                  --task i2v \
+                  --height 720 \
+                  --width 1280 \
+                  --model {{ model.model_repo }} \
+                  --img_file_path /app/Wan/i2v_input.JPG \
+                  --ulysses_degree 8 \
+                  --seed 42 \
+                  --num_frames 81 \
+                  --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." \
+                  --num_repetitions 1 \
+                  --num_inference_steps 40 \
+                  --use_torch_compile
+
+            {% endif %}
+            {% if model.model == "Wan2.2" %}
+               cd Wan
+               mkdir results
+
+               torchrun --nproc_per_node=8 /app/Wan/run.py \
+                  --task i2v \
+                  --height 720 \
+                  --width 1280 \
+                  --model {{ model.model_repo }} \
+                  --img_file_path /app/Wan/i2v_input.JPG \
+                  --ulysses_degree 8 \
+                  --seed 42 \
+                  --num_frames 81 \
+                  --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside." \
+                  --num_repetitions 1 \
+                  --num_inference_steps 40 \
+                  --use_torch_compile
+
+            {% endif %}
+
+            {% if model.model == "FLUX.1" %}
+               cd Flux
+               mkdir results
+
+               torchrun --nproc_per_node=8 /app/Flux/run.py \
+                  --model {{ model.model_repo }} \
+                  --seed 42 \
+                  --prompt "A small cat" \
+                  --height 1024 \
+                  --width 1024 \
+                  --num_inference_steps 25 \
+                  --max_sequence_length 256 \
+                  --warmup_steps 5 \
+                  --no_use_resolution_binning \
+                  --ulysses_degree 8 \
+                  --use_torch_compile \
+                  --num_repetitions 50
+
+            {% endif %}
+
+            {% if model.model == "stable-diffusion-3.5-large" %}
+               cd StableDiffusion3.5 
+               mkdir results
+
+               torchrun --nproc_per_node=8 /app/StableDiffusion3.5/run.py \
+                  --model {{ model.model_repo }} \
+                  --num_inference_steps 28 \
+                  --prompt "A capybara holding a sign that reads Hello World" \
+                  --use_torch_compile \
+                  --pipefusion_parallel_degree 4 \
+                  --use_cfg_parallel \
+                  --num_repetitions 50 \
+                  --dtype torch.float16 \
+                  --output_path results
+
+            {% endif %}
+
+            The generated video will be stored under the results directory. For the actual benchmark step runtimes, see {% if model.model == "Hunyuan Video" %}stdout.{% elif model.model in ["Wan2.1", "Wan2.2"] %}results/outputs/rank0_*.json{% elif model.model == "FLUX.1" %}results/timing.json{% elif model.model == "stable-diffusion-3.5-large"%}benchmark_results.csv{% endif %}
+
+            {% if model.model == "FLUX.1" %}You may also use ``run_usp.py`` which implements USP without modifying the default diffusers pipeline. {% endif %}
+
+      {% endfor %}
+    {% endfor %}
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -117,6 +117,8 @@ subtrees:
            title: SGLang inference performance testing
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
            title: SGLang distributed inference with Mooncake
+          - file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
+            title: xDiT diffusion inference
          - file: how-to/rocm-for-ai/inference/deploy-your-model.rst
            title: Deploy your model

--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,4 +1,4 @@
-rocm-docs-core==1.29.0
+rocm-docs-core==1.30.1
 sphinx-reredirects
 sphinx-sitemap
 sphinxcontrib.datatemplates==0.11.0
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -187,7 +187,7 @@ requests==2.32.5
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.29.0
+rocm-docs-core==1.30.1
    # via -r requirements.in
 rpds-py==0.29.0
    # via
Author	SHA1	Message	Date
Kristoffer	442e9310f2	Merge branch 'develop' into docs/xdit-diffusion-v25-12	2025-12-16 12:56:10 +01:00
Kristoffer	f6f5b8ba47	Spelling, added 'js'	2025-12-15 11:58:20 +01:00
Kristoffer	a16538df17	Simplify yaml file and cleanup main rst page.	2025-12-15 11:53:46 +01:00
Kristoffer	f69c8974f2	-Diffusers suffix	2025-12-15 08:26:54 +01:00
Kristoffer	e537a31000	Command fixes	2025-12-12 15:53:51 +01:00
Kristoffer	d690a3afd5	Add hyperlinks to components	2025-12-08 16:09:57 +01:00
Istvan Kiss	18515bcc59	JAX key features and enhancements (#5708 ) (#645 ) Co-authored-by: Pratik Basyal <prbasyal@amd.com>	2025-12-04 15:03:39 +01:00
Pratik Basyal	e8fdc34b71	711 hipBLASLT performance decline known issue added (#5730 ) * hipBLASLT performance decline known issue added * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * GitHub Issue added * Ram's feedback incorporated * GitHub Issue added * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-12-03 08:50:25 -05:00
Kristoffer	40446d143f	Docs for v25.12	2025-12-01 13:10:00 +01:00
Pratik Basyal	b4031ef23c	7.1.1 known issues post GA (#5721 ) * rocblas known issues added * Minor change * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Resolved * Update RELEASE.md Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>	2025-11-28 16:34:47 -05:00
dependabot[bot]	d0bd4e6f03	Bump rocm-docs-core from 1.29.0 to 1.30.1 in /docs/sphinx (#5712 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.29.0...v1.30.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.30.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-11-28 08:18:23 -05:00
Jan Stephan	0056b9453e	Remove continuous numbering of tables and figures Signed-off-by: Jan Stephan <jan.stephan@amd.com>	2025-11-28 10:29:01 +01:00
Pratik Basyal	3d1ad79766	Merged cell removed for coloring issue (#5713 )	2025-11-27 19:52:36 -05:00
Kristoffer	065fd5c40b	Make Software Components section use dropdown.	2025-11-27 13:15:09 +01:00
Kristoffer	9b44bd87e2	Add aiter rounding mode in v25-11 'what's new'.	2025-11-27 12:49:36 +01:00
Pratik Basyal	8683bed11b	Known issue from 7.1.0 removed (#5702 )	2025-11-26 12:27:22 -05:00
Pratik Basyal	847cd7c423	Link and PyTorch version updated (#5700 )	2025-11-26 11:52:47 -05:00
Kristoffer	c00802a460	Bump Rocm version, add spellcheck	2025-11-25 12:53:56 +01:00
Kristoffer	849c3c2e3d	Image specific info	2025-11-21 18:27:13 +01:00
Kristoffer	b98fd42bd0	First commit.	2025-11-21 17:33:22 +01:00
peterjunpark	1c253505b6	Apply suggestions from code review	2025-11-18 12:14:09 -05:00
Kristoffer	4bfe13edef	Add MAD-integrated benchmarking.	2025-11-18 16:44:40 +01:00
Kristoffer	072e2d90db	Update dockerhub link from siloai to rocm.	2025-11-14 14:45:55 +02:00
yugang-amd	68fcc294b1	Merge branch 'develop' into docs/xdit-diffusion	2025-11-05 10:38:26 -05:00
yugang-amd	8d6b954e0e	Merge branch 'develop' into docs/xdit-diffusion	2025-11-04 12:28:26 -05:00
Kristoffer	2a90a355f0	Spelling mistakes.	2025-11-04 16:35:46 +01:00
Kristoffer	9c254eb2ac	Suggested changes.	2025-11-04 15:26:40 +01:00
Kristoffer	f19730d4f0	Change repetitions for flux.	2025-10-31 14:42:44 +01:00
yugang-amd	d835d42be1	Merge branch 'develop' into docs/xdit-diffusion	2025-10-31 09:06:35 -04:00
Kristoffer	bd8ac6bc5e	git rm xdit-video-diffusion.rst	2025-10-31 11:43:49 +01:00
Kristoffer	7c0d74355e	Update Flux instructions. Change image tag. Describe as diffusion inference instead of specifically video.	2025-10-31 11:34:02 +01:00
Kristoffer	bd90667c20	Update commands and add FLUX instructions.	2025-10-30 17:21:23 +01:00
Kristoffer	0b4124cdd6	Update to use latest v25.10 image instead of v25.9	2025-10-30 15:19:11 +01:00
yugang-amd	53f4748d0f	Merge branch 'develop' into docs/xdit-diffusion	2025-10-29 13:40:00 -04:00
Kristoffer	f160e4934a	Change TheRock ROCm version.	2025-10-29 10:51:40 +01:00
Kristoffer	913e84fd98	Add sw component versions/commits.	2025-10-23 11:23:07 +02:00
Kristoffer	8d6d00854c	Add System Validation section.	2025-10-23 10:28:10 +02:00
Peter Park	c5a1f783e9	Update .wordlist.txt	2025-10-22 14:38:00 -04:00
Peter Park	1d07995cf5	Update template formatting and fix sphinx warnings	2025-10-22 14:35:56 -04:00
Kristoffer	6db347becd	Merge branch 'develop' into docs/xdit-diffusion	2025-10-16 17:40:47 +02:00
Kristoffer	ef75212807	Add xdit-diffusion ROCm docs page.	2025-10-16 17:27:00 +02:00