JAX key features and enhancements (#5708 )

Co-authored-by: Pratik Basyal <prbasyal@amd.com>
711 hipBLASLT performance decline known issue added (#5730 )
2026-01-09 22:58:17 -05:00 · 2025-12-04 12:34:48 +01:00 · 2025-12-03 08:50:25 -05:00 · 2025-11-28 16:34:47 -05:00 · 2025-11-28 08:18:23 -05:00 · 2025-11-28 10:29:01 +01:00
10 changed files with 73 additions and 34 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -36,6 +36,7 @@ Andrej
 Arb
 Autocast
 autograd
+Backported
 BARs
 BatchNorm
 BLAS
@@ -202,9 +203,11 @@ GenAI
 GenZ
 GitHub
 Gitpod
+hardcoded
 HBM
 HCA
 HGX
+HLO
 HIPCC
 hipDataType
 HIPExtension
@@ -329,6 +332,7 @@ MoEs
 Mooncake
 Mpops
 Multicore
+multihost
 Multithreaded
 mx
 MXFP
@@ -1020,6 +1024,7 @@ uncacheable
 uncorrectable
 underoptimized
 unhandled
+unfused
 uninstallation
 unmapped
 unsqueeze
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -39,7 +39,11 @@ for a complete overview of this release.
  - VMs were incorrectly reporting `AMDSMI_STATUS_API_FAILED` when unable to get the power cap within the `amdsmi_get_power_info`.
  - The API now returns `N/A` or `UINT_MAX` for values that can't be retrieved, instead of failing.

- Fixed output for `amd-smi xgmi -l --json`.  
+- Fixed output for `amd-smi xgmi -l --json`.
+
+```{note}
+See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-711) for details, examples, and in-depth descriptions.
+```

 ### **Composable Kernel** (1.1.0)

--- a/RELEASE.md
+++ b/RELEASE.md
@@ -100,12 +100,13 @@ firmware, AMD GPU drivers, and the ROCm user space software.
              01.25.16.03<br>
              01.25.15.04
          </td>
-          <td rowspan="2" style="vertical-align: middle;">
+          <td>
              30.20.1<br>
              30.20.0<br>
              30.10.2<br>
              30.10.1<br>
-              30.10</td>
+              30.10
+            </td>
          <td rowspan="3" style="vertical-align: middle;">8.6.0.K</td>
      </tr>
      <tr>
@@ -114,6 +115,13 @@ firmware, AMD GPU drivers, and the ROCm user space software.
              01.25.16.03<br>
              01.25.15.04
          </td>
+          <td>
+              30.20.1<br>
+              30.20.0<br>
+              30.10.2<br>
+              30.10.1<br>
+              30.10
+            </td>
      </tr>
      <tr>
          <td>MI325X<a href="#footnote1"><sup>[1]</sup></a></td>
@@ -674,7 +682,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
 - Fixed output for `amd-smi xgmi -l --json`.  

 ```{note}
-See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-710) for details, examples, and in-depth descriptions.
+See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md#amd_smi_lib-for-rocm-711) for details, examples, and in-depth descriptions.
 ```

 ### **Composable Kernel** (1.1.0)
@@ -831,7 +839,7 @@ issues related to individual components, review the [Detailed component changes]

 ### RCCL performance degradation on AMD Instinct MI300X GPU with AMD Pollara AI NIC

-If you’re using RCCL on AMD Instinct MI300X GPUs with AMD Pollara AI NIC, you might observe performance degradation for specific collectives and message sizes. The affected collectives are `Scatter`, `AllToAll`, and `AlltoAllv`. It's recommended to avoid using RCCL packaged with ROCm 7.1.1. As a workaround, use the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop), which contains the fix and will be included in a future ROCm release.
+If you’re using RCCL on AMD Instinct MI300X GPUs with AMD Pollara AI NIC, you might observe performance degradation for specific collectives and message sizes. The affected collectives are `Scatter`, `AllToAll`, and `AlltoAllv`. It's recommended to avoid using RCCL packaged with ROCm 7.1.1. As a workaround, use the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop), which contains the fix and will be included in a future ROCm release. See [GitHub issue #5717](https://github.com/ROCm/ROCm/issues/5717).

 ### Segmentation fault in training models using TensorFlow 2.20.0 Docker images

@@ -839,7 +847,7 @@ Training models `tf2_tfm_resnet50_fp16_train` and `tf2_tfm_resnet50_fp32_train`
 might fail with a segmentation fault when run on the TensorFlow 2.20.0 Docker
 image with ROCm 7.1.1. As a workaround, use TensorFlow 2.19.x Docker image for
 training the models in ROCm 7.1.1. This issue will be fixed in a future ROCm
-release.
+release. See [GitHub issue #5718](https://github.com/ROCm/ROCm/issues/5718).

 ### AMD SMI CLI triggers repeated kernel errors on GPUs with partitioning support

@@ -858,27 +866,19 @@ amdgpu 0000:15:00.0: amdgpu: renderD153 partition 1 not valid!
 These repeated kernel logs can clutter the system logs and may cause
 unnecessary concern about GPU health. However, this is a non-functional issue
 and does not affect AMD SMI functionality or GPU performance. This issue will
-be fixed in a future ROCm release.
+be fixed in a future ROCm release. See [GitHub issue #5720](https://github.com/ROCm/ROCm/issues/5720).

 ### Excessive bad page logs in AMD GPU Driver (amdgpu)

-Due to partial data corruption of Electrically Erasable Programmable Read-Only Memory (EEPROM) and limited error handling in the AMD GPU Driver(amdgpu), excessive log output might result when querying the reliability, availability, and serviceability (RAS) bad pages. This issue will be fixed in a future AMD GPU Driver(amdgpu) and ROCm release.
+Due to partial data corruption in the Electrically Erasable Programmable Read-Only Memory (EEPROM) and limited error handling in the AMD GPU Driver (amdgpu), excessive log output might occur when querying the reliability, availability, and serviceability (RAS) bad pages. This issue will be fixed in a future AMD GPU Driver (amdgpu) and ROCm release. See [GitHub issue #5719](https://github.com/ROCm/ROCm/issues/5719).

-### OpenBLAS runtime dependency for hipblastlt-test and hipblaslt-bench
+### Incorrect results in gemm_ex operations for rocBLAS and hipBLAS

-Running `hipblaslt-test` or `hipblaslt-bench` without installing the OpenBLAS development package results in the following error:
-```
-libopenblas.so.0: cannot open shared object file: No such file or directory
-```
-As a workaround, first install `libopenblas-dev` or `libopenblas-deve`, depending on the package manager used. The issue will be fixed in a future ROCm release. See [GitHub issue #5639](https://github.com/ROCm/ROCm/issues/5639).
+Some `gemm_ex` operations with 8-bit input data types (`int8`, `float8`, `bfloat8`) for specific matrix dimensions (K = 1 and number of workgroups > 1) might yield incorrect results. The issue results from incorrect tailloop code that fails to consider workgroup index when calculating valid element size. The issue will be fixed in a future ROCm release. See [GitHub issue #5722](https://github.com/ROCm/ROCm/issues/5722).

-### Reduced precision in gemm_ex operations for rocBLAS and hipBLAS
+### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs

-Some `gemm_ex` operations with `half` or `f32_r` data types might yield 16-bit precision results instead of the expected 32-bit precision when matrix dimensions are m=1 or n=1. The issue results from the optimization that enables `_ex` APIs to use lower precision multiples. It limits the high-precision matrix operations performed in PyTorch with rocBLAS and hipBLAS. The issue will be fixed in a future ROCm release. See [GitHub issue #5640](https://github.com/ROCm/ROCm/issues/5640).
-
-### RCCL profiler plugin failure with AllToAll operations
-
-The RCCL profiler plugin `librccl-profiler.so` might fail with a segmentation fault during `AllToAll` collective operations due to improperly assigned point-to-point task function pointers. This leads to invalid memory access and prevents profiling of `AllToAll` performance. Other operations, like `AllReduce`, are unaffected. It's recommended to avoid using the RCCL profiler plugin with `AllToAll` operations until the fix is available. This issue is resolved in the {fab}`github`[RCCL `develop` branch](https://github.com/ROCm/rccl/tree/develop) and will be part of a future ROCm release. See [GitHub issue #5653](https://github.com/ROCm/ROCm/issues/5653).
+If you’re using hipBLASLt on AMD Instinct MI325X GPUs for large FP8 GEMM operations (such as 9728x8192x65536), you might observe a noticeable performance variation. The issue is currently under investigation and will be fixed in a future ROCm release. See [GitHub issue #5734](https://github.com/ROCm/ROCm/issues/5734).

 ## ROCm resolved issues

--- a/docs/compatibility/compatibility-matrix-historical-6.0.csv
+++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv
@@ -30,7 +30,7 @@ ROCm Version,7.1.1,7.1.0,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6
      ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
      ,,,,,,,,,,,,,,,,,,,,,,
      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,,,
-      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8, 2.7","2.8, 2.7, 2.6","2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
+      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8","2.8, 2.7, 2.6","2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
      :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
      :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.7.1,0.7.1,0.6.0,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
      :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -54,7 +54,7 @@ compatibility and system requirements.
      ,gfx908,gfx908,gfx908
      ,,,
      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,,
-      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8, 2.7","2.8, 2.7, 2.6","2.6, 2.5, 2.4, 2.3"
+      :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9, 2.8","2.8, 2.7, 2.6","2.6, 2.5, 2.4, 2.3"
      :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.18.1, 2.17.1, 2.16.2"
      :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.7.1,0.7.1,0.4.35
      :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,2.4.0
--- a/docs/compatibility/ml-compatibility/jax-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/jax-compatibility.rst
@@ -269,6 +269,33 @@ For a complete and up-to-date list of JAX public modules (for example, ``jax.num
  JAX API modules are maintained by the JAX project and is subject to change.
  Refer to the official Jax documentation for the most up-to-date information.

+Key features and enhancements for ROCm 7.1
+===============================================================================
+
+- Enabled compilation of multihost HLO runner Python bindings.
+
+  - Backported multihost HLO runner bindings and some related changes to
+    :code:`FunctionalHloRunner`.
+
+  - Added :code:`requirements_lock_3_12` to enable building for Python 3.12.
+
+- Removed hardcoded NHWC convolution layout for ``fp16`` precision to address the performance drops for ``fp16`` precision on gfx12xx GPUs.
+
+
+- ROCprofiler-SDK integration:
+
+  - Integrated ROCprofiler-SDK (v3) to XLA to improve profiling of GPU events,
+    support both time-based and step-based profiling.
+
+  - Added unit tests for :code:`rocm_collector` and :code:`rocm_tracer`.
+
+- Added Triton unsupported conversion from ``f8E4M3FNUZ`` to ``fp16`` with
+  rounding mode.
+
+- Introduced :code:`CudnnFusedConvDecomposer` to revert fused convolutions
+  when :code:`ConvAlgorithmPicker` fails to find a fused algorithm, and removed
+  unfused fallback paths from :code:`RocmFusedConvRunner`.
+
 Key features and enhancements for ROCm 7.0
 ===============================================================================

--- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
+++ b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst
@@ -401,25 +401,25 @@ with ROCm.

 Key features and enhancements for PyTorch 2.9 with ROCm 7.1.1
 ================================================================================
- Scaled Dot Product Attention (SDPA) upgraded to use AOTriton version 0.11b
+- Scaled Dot Product Attention (SDPA) upgraded to use AOTriton version 0.11b.

- Default hipBLASLt support enabled for gfx908 architecture on ROCm 6.3 and later
+- Default hipBLASLt support enabled for gfx908 architecture on ROCm 6.3 and later.

- MIOpen now supports channels last memory format for 3D convolutions and batch normalization
+- MIOpen now supports channels last memory format for 3D convolutions and batch normalization.

- NHWC convolution operations in MIOpen optimized by eliminating unnecessary transpose operations
+- NHWC convolution operations in MIOpen optimized by eliminating unnecessary transpose operations.

- Improved tensor.item() performance by removing redundant synchronization
+- Improved tensor.item() performance by removing redundant synchronization.

- Enhanced performance for element-wise operations and reduction kernels
+- Enhanced performance for element-wise operations and reduction kernels.

- Added support for grouped GEMM operations through fbgemm_gpu generative AI components
+- Added support for grouped GEMM operations through fbgemm_gpu generative AI components.

- Resolved device error in Inductor when using CUDA graph trees with HIP
+- Resolved device error in Inductor when using CUDA graph trees with HIP.

- Corrected logsumexp scaling in AOTriton-based SDPA implementation
+- Corrected logsumexp scaling in AOTriton-based SDPA implementation.

- Added stream graph capture status validation in memory copy synchronization functions
+- Added stream graph capture status validation in memory copy synchronization functions.

 Key features and enhancements for PyTorch 2.8 with ROCm 7.1
 ================================================================================
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -249,3 +249,6 @@ html_context = {
    "granularity_type" : [('Coarse-grained', 'coarse-grained'), ('Fine-grained', 'fine-grained')],
    "scope_type" : [('Device', 'device'), ('System', 'system')]
 }
+
+# Disable figure and table numbering
+numfig = False
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,4 +1,4 @@
-rocm-docs-core==1.29.0
+rocm-docs-core==1.30.1
 sphinx-reredirects
 sphinx-sitemap
 sphinxcontrib.datatemplates==0.11.0
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -187,7 +187,7 @@ requests==2.32.5
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.29.0
+rocm-docs-core==1.30.1
    # via -r requirements.in
 rpds-py==0.29.0
    # via
Author	SHA1	Message	Date
Istvan Kiss	f37007bf42	JAX key features and enhancements (#5708 ) Co-authored-by: Pratik Basyal <prbasyal@amd.com>	2025-12-04 12:34:48 +01:00
Pratik Basyal	e8fdc34b71	711 hipBLASLT performance decline known issue added (#5730 ) * hipBLASLT performance decline known issue added * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * GitHub Issue added * Ram's feedback incorporated * GitHub Issue added * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-12-03 08:50:25 -05:00
Pratik Basyal	b4031ef23c	7.1.1 known issues post GA (#5721 ) * rocblas known issues added * Minor change * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Resolved * Update RELEASE.md Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>	2025-11-28 16:34:47 -05:00
dependabot[bot]	d0bd4e6f03	Bump rocm-docs-core from 1.29.0 to 1.30.1 in /docs/sphinx (#5712 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.29.0 to 1.30.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.29.0...v1.30.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.30.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-11-28 08:18:23 -05:00
Jan Stephan	0056b9453e	Remove continuous numbering of tables and figures Signed-off-by: Jan Stephan <jan.stephan@amd.com>	2025-11-28 10:29:01 +01:00
Pratik Basyal	3d1ad79766	Merged cell removed for coloring issue (#5713 )	2025-11-27 19:52:36 -05:00
Pratik Basyal	8683bed11b	Known issue from 7.1.0 removed (#5702 )	2025-11-26 12:27:22 -05:00
Pratik Basyal	847cd7c423	Link and PyTorch version updated (#5700 )	2025-11-26 11:52:47 -05:00