Update RELEASE.md

Add Optimized entry for HIP graph performance
615 column added 630 (#4637 )
2026-01-09 22:58:17 -05:00 · 2025-05-15 12:03:20 -07:00 · 2025-04-17 11:49:13 -04:00 · 2025-01-27 17:09:25 -05:00 · 2025-01-21 16:12:36 -05:00 · 2025-01-08 10:06:54 -05:00
22 changed files with 883 additions and 250 deletions
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -189,6 +189,7 @@ Jupyter
 KFD
 KFDTest
 KiB
+KMD
 KV
 KVM
 Keras
@@ -591,6 +592,7 @@ hpp
 hsa
 hsakmt
 hyperparameter
+hyperparameters
 iDRAC
 ib_core
 inband
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -232,7 +232,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
            </tr>
            <tr>
                <td><a href="https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs-6.3.0/index.html">MIGraphX</a></td>
-                <td>2.11.0</td>
+                <td>2.10.0&nbsp;&Rightarrow;&nbsp;<a href="#migraphx-2-11-0">2.11.0</a></td>
                <td><a href="https://github.com/ROCm/AMDMIGraphX"><i class="fab fa-github fa-lg"></i></a></td>
            </tr>
            <tr>
@@ -395,7 +395,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
                <th rowspan="7">System management</th>
                <td><a href="https://rocm.docs.amd.com/projects/amdsmi/en/docs-6.3.0/index.html">AMD SMI</a></td>
                <td>24.6.3&nbsp;&Rightarrow;&nbsp;<a href="#amd-smi-24-7-1">24.7.1</a></td>
-                <td><a href="https://github.com/ROCm/rocm-cmake"><i class="fab fa-github fa-lg"></i></a></td>
+                <td><a href="https://github.com/ROCm/amdsmi"><i class="fab fa-github fa-lg"></i></a></td>
            </tr>
            <tr>
                <td><a href="https://rocm.docs.amd.com/projects/rdc/en/docs-6.3.0/index.html">ROCm Data Center Tool</a></td>
@@ -659,7 +659,6 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANG
  - `hipDrvGraphAddMemFreeNode`  creates a memory free node and adds it to a graph.
  - `hipDrvGraphExecMemcpyNodeSetParams`  sets the parameters for a memcpy node in the given graphExec.
  - `hipDrvGraphExecMemsetNodeSetParams`  sets the parameters for a memset node in the given graphExec.
-  - `hipExtHostAlloc` preserves the functionality of `hipHostMalloc`.

 #### Changed

@@ -673,6 +672,11 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANG
 * Optimized multi-threaded dispatches to improve performance.
 * Limited the software batch size to control the number of command submissions for runtime to handle efficiently.
 * Optimizes HSA callback performance when a large number of events are recorded by multiple threads and submitted to multiple GPUs.
+* HIP graph execution performance improvement.
+    - Added the optimized multistream path in graph execution. It uses a fixed number of async streams in the execution
+    - Optimized the launch latency, where commands creation and execution is done at the same time
+    - Optimized the scheduling to use less barriers and waiting signals if the same queue  can be detected
+    - The new path is controlled by a new environment variable, with the options either to use the original path, or to force the number of asynchronous queues for execution.

 #### Resolved issues

@@ -684,12 +688,6 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANG
  preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled
  unconditionally in the next ROCm release.

-#### Upcoming changes
-
-* Deprecated HIP APIs:
-    - `hipHostMalloc` to be replaced by `hipExtHostAlloc`.
-    - `hipHostFree` to be replaced by `hipFreeHost`.
-
 ### **hipBLAS** (2.3.0)

 #### Added
@@ -897,6 +895,78 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANG
  srcLane, width)` function when one of the parameters to the function is undefined along some path
  to the function. See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub.

+### **MIGraphX** (2.11.0)
+
+#### Added
+
+* Initial code to run on Windows
+* Support for `FP8` and `INT4`
+* Support for the Log2 internal operator
+* Support for the GCC 14 compiler
+* The `BitwiseAnd`, `Scan`, `SoftmaxCrossEntropyLoss`, `GridSample`, and `NegativeLogLikelihoodLoss` ONNX operators 
+* The `MatMulNBits`, `QuantizeLinear`/`DequantizeLinear`, `GroupQueryAttention`, `SkipSimplifiedLayerNormalization`, and `SimpliedLayerNormalizationMicrosoft` Contrib operators 
+* Dynamic batch parameter support to `OneHot` operator
+* Split-K as an optional performance improvement
+* Scripts to validate ONNX models from the ONNX Model Zoo
+* GPU Pooling Kernel
+* `--mlir` flag the migraphx-driver program to offload entire module to MLIR
+* Fusing split-reduce with MLIR 
+* Multiple outputs for the MLIR + Pointwise fusions 
+* Pointwise fusions with MLIR across reshape operations
+* `MIGRAPHX_MLIR_DUMP` environment variable to dump MLIR modules to MXRs
+* The `3` option to `MIGRAPHX_TRACE_BENCHMARKING` to print the MLIR program for improved debug output
+* `MIGRAPHX_ENABLE_HIPBLASLT_GEMM` environment variable to call hipBLASLt libraries
+* `MIGRAPHX_VERIFY_DUMP_DIFF` to improve the debugging of accuracy issues  
+* `reduce_any` and `reduce_all` options to the `Reduce` operation via Torch MIGraphX
+* Examples for RNNT, and ControlNet
+
+#### Changed
+
+* Switched to MLIR's 3D Convolution operator.
+* MLIR is now used for Attention operations by default on gfx942 and newer ASICs.
+* Names and locations for VRM specific libraries have changed.
+* Use random mode for benchmarking GEMMs and convolutions.
+* Python version is now printed with an actual version number.
+
+#### Removed
+
+* Disabled requirements for MIOpen and rocBLAS when running on Windows.
+* Removed inaccurate warning messages when using exhaustive-tune.
+* Remove the hard coded path in `MIGRAPHX_CXX_COMPILER` allowing the compiler to be installed in different locations.
+
+#### Optimized
+
+* Improved:
+    * Infrastructure code to enable better Kernel fusions with all supported data types
+    * Subsequent model compile time by creating a cache for already performant kernels
+    * Use of Attention fusion with models
+    * Performance of the Softmax JIT kernel and of the Pooling operator
+    * Tuning operations through a new 50ms delay before running the next kernel
+    * Performance of several convolution-based models through an optimized NHWC layout 
+    * Performance for the `FP8` datatype
+    * GPU utilization
+    * Verification tools
+    * Debug prints
+    * Documentation, including gpu-driver utility documentation
+    * Summary section of the `migraphx-driver perf` command
+* Reduced model compilation time
+* Reordered some compiler passes to allow for more fusions
+* Preloaded tiles into LDS to improve performance of pointwise transposes
+* Exposed the `external_data_path` property in `onnx_options` to set the path from `onnxruntime`
+
+#### Resolved issues
+
+* Fixed a bug with gfx1030 that overwrote `dpp_reduce`.
+* Fixed a bug in 1-arg dynamic reshape that created a failure.
+* Fixed a bug with `dot_broadcast` and `inner_broadcast` that caused compile failures.
+* Fixed a bug where some configs were failing when using exhaustive-tune.
+* Fixed the ROCm Install Guide URL.
+* Fixed an issue while building a whl package due to an apostrophe.
+* Fixed the BERT Squad example requirements file to support different versions of Python.
+* Fixed a bug that stopped the Vicuna model from compiling.
+* Fixed failures with the verify option of migraphx-driver that would cause the application to exit early.
+
+
 ### **MIOpen** (3.3.0)

 #### Added
--- a/docs/compatibility/compatibility-matrix-historical-6.0.csv
+++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv
@@ -1,118 +1,118 @@
-ROCm Version,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0
-      :ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,
-      ,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2"
-      ,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5"
-      ,"RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2"
-      ,"RHEL 8.10","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8"
-      ,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4"
-      ,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9
-      ,Oracle Linux 8.10 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,,,
-      ,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3
-      ,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2
-      ,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA
-      ,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3
-      ,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2
-      ,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100
-      ,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030
-      ,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_
-      ,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a
-      ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
-      ,,,,,,,,,,
-      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`PyTorch <rocm-install-on-linux:install/3rd-party/pytorch-install>`,"2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
-      :doc:`TensorFlow <rocm-install-on-linux:install/3rd-party/tensorflow-install>`,"2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
-      :doc:`JAX <rocm-install-on-linux:install/3rd-party/jax-install>`,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
-      `ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
-      ,,,,,,,,,,
-      THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,
-      `UCC <https://github.com/ROCm/ucc>`_,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
-      `UCX <https://github.com/ROCm/ucx>`_,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
-      ,,,,,,,,,,
-      THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,
-      Thrust,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
-      CUB,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
-      ,,,,,,,,,,
-      KFD & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,
-      Tested user space versions,"6.3.x, 6.2.x, 6.1.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
-      ,,,,,,,,,,
-      ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0
-      :doc:`MIGraphX <amdmigraphx:index>`,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0
-      :doc:`MIOpen <miopen:index>`,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
-      :doc:`MIVisionX <mivisionx:index>`,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0
-      :doc:`rocAL <rocal:index>`,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
-      :doc:`rocDecode <rocdecode:index>`,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A
-      :doc:`rocJPEG <rocjpeg:index>`,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
-      :doc:`rocPyDecode <rocpydecode:index>`,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A
-      :doc:`RPP <rpp:index>`,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0
-      ,,,,,,,,,,
-      COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`RCCL <rccl:index>`,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3
-      ,,,,,,,,,,
-      MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,
-      `half <https://github.com/ROCm/half>`_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0
-      :doc:`hipBLAS <hipblas:index>`,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0
-      :doc:`hipBLASLt <hipblaslt:index>`,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0
-      :doc:`hipFFT <hipfft:index>`,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13
-      :doc:`hipfort <hipfort:index>`,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0
-      :doc:`hipRAND <hiprand:index>`,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16
-      :doc:`hipSOLVER <hipsolver:index>`,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0
-      :doc:`hipSPARSE <hipsparse:index>`,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
-      :doc:`hipSPARSELt <hipsparselt:index>`,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0
-      :doc:`rocALUTION <rocalution:index>`,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3
-      :doc:`rocBLAS <rocblas:index>`,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0
-      :doc:`rocFFT <rocfft:index>`,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23
-      :doc:`rocRAND <rocrand:index>`,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17
-      :doc:`rocSOLVER <rocsolver:index>`,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0
-      :doc:`rocSPARSE <rocsparse:index>`,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2
-      :doc:`rocWMMA <rocwmma:index>`,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0
-      :doc:`Tensile <tensile:index>`,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0
-      ,,,,,,,,,,
-      PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`hipCUB <hipcub:index>`,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
-      :doc:`hipTensor <hiptensor:index>`,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0
-      :doc:`rocPRIM <rocprim:index>`,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
-      :doc:`rocThrust <rocthrust:index>`,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
-      ,,,,,,,,,,
-      SUPPORT LIBS,,,,,,,,,,
-      `hipother <https://github.com/ROCm/hipother>`_,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
-      `rocm-core <https://github.com/ROCm/rocm-core>`_,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0
-      `ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245
-      ,,,,,,,,,,
-      SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`AMD SMI <amdsmi:index>`,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2
-      :doc:`ROCm Data Center Tool <rdc:index>`,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0
-      :doc:`rocminfo <rocminfo:index>`,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
-      :doc:`ROCm SMI <rocm_smi_lib:index>`,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0
-      :doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000
-      ,,,,,,,,,,
-      PERFORMANCE TOOLS,,,,,,,,,,
-      :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0
-      :doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A
-      :doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A
-      :doc:`ROCProfiler <rocprofiler:index>`,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000
-      :doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A
-      :doc:`ROCTracer <roctracer:index>`,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000
-      ,,,,,,,,,,
-      DEVELOPMENT TOOLS,,,,,,,,,,
-      :doc:`HIPIFY <hipify:index>`,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
-      :doc:`ROCm CMake <rocmcmakebuildtools:index>`,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0
-      :doc:`ROCdbgapi <rocdbgapi:index>`,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0
-      :doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0
-      `rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,N/A,N/A
-      :doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3
-      ,,,,,,,,,,
-      COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,
-      `clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0
-      :doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
-      `Flang <https://github.com/ROCm/flang>`_,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
-      :doc:`llvm-project <llvm-project:index>`,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
-      `OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
-      ,,,,,,,,,,
-      RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,
-      :doc:`AMD CLR <hip:understand/amd_clr>`,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
-      :doc:`HIP <hip:index>`,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
-      `OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0
-      :doc:`ROCr Runtime <rocr-runtime:index>`,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0
+ROCm Version,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0
+      :ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,,
+      ,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2"
+      ,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5"
+      ,"RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.4 [#red-hat94-past-60]_, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2"
+      ,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8"
+      ,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4"
+      ,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9
+      ,Oracle Linux 8.10 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,Oracle Linux 8.9 [#oracle89-past-60]_,,,
+      ,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3
+      ,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2
+      ,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA
+      ,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3
+      ,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2
+      ,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100
+      ,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030
+      ,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_
+      ,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a
+      ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
+      ,,,,,,,,,,,
+      FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`PyTorch <rocm-install-on-linux:install/3rd-party/pytorch-install>`,"2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
+      :doc:`TensorFlow <rocm-install-on-linux:install/3rd-party/tensorflow-install>`,"2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
+      :doc:`JAX <rocm-install-on-linux:install/3rd-party/jax-install>`,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
+      `ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
+      ,,,,,,,,,,,
+      THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,
+      `UCC <https://github.com/ROCm/ucc>`_,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
+      `UCX <https://github.com/ROCm/ucx>`_,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
+      ,,,,,,,,,,,
+      THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,
+      Thrust,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
+      CUB,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
+      ,,,,,,,,,,,
+      KMD & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,
+      Tested user space versions,"6.3.x, 6.2.x, 6.1.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
+      ,,,,,,,,,,,
+      ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0
+      :doc:`MIGraphX <amdmigraphx:index>`,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0
+      :doc:`MIOpen <miopen:index>`,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
+      :doc:`MIVisionX <mivisionx:index>`,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0
+      :doc:`rocAL <rocal:index>`,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
+      :doc:`rocDecode <rocdecode:index>`,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A
+      :doc:`rocJPEG <rocjpeg:index>`,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`rocPyDecode <rocpydecode:index>`,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`RPP <rpp:index>`,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0
+      ,,,,,,,,,,,
+      COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`RCCL <rccl:index>`,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3
+      ,,,,,,,,,,,
+      MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,
+      `half <https://github.com/ROCm/half>`_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0
+      :doc:`hipBLAS <hipblas:index>`,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0
+      :doc:`hipBLASLt <hipblaslt:index>`,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0
+      :doc:`hipFFT <hipfft:index>`,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13
+      :doc:`hipfort <hipfort:index>`,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0
+      :doc:`hipRAND <hiprand:index>`,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16
+      :doc:`hipSOLVER <hipsolver:index>`,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0
+      :doc:`hipSPARSE <hipsparse:index>`,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
+      :doc:`hipSPARSELt <hipsparselt:index>`,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0
+      :doc:`rocALUTION <rocalution:index>`,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3
+      :doc:`rocBLAS <rocblas:index>`,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0
+      :doc:`rocFFT <rocfft:index>`,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23
+      :doc:`rocRAND <rocrand:index>`,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17
+      :doc:`rocSOLVER <rocsolver:index>`,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0
+      :doc:`rocSPARSE <rocsparse:index>`,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2
+      :doc:`rocWMMA <rocwmma:index>`,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0
+      :doc:`Tensile <tensile:index>`,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0
+      ,,,,,,,,,,,
+      PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`hipCUB <hipcub:index>`,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
+      :doc:`hipTensor <hiptensor:index>`,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0
+      :doc:`rocPRIM <rocprim:index>`,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
+      :doc:`rocThrust <rocthrust:index>`,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
+      ,,,,,,,,,,,
+      SUPPORT LIBS,,,,,,,,,,,
+      `hipother <https://github.com/ROCm/hipother>`_,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
+      `rocm-core <https://github.com/ROCm/rocm-core>`_,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0
+      `ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245
+      ,,,,,,,,,,,
+      SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`AMD SMI <amdsmi:index>`,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2
+      :doc:`ROCm Data Center Tool <rdc:index>`,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0
+      :doc:`rocminfo <rocminfo:index>`,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
+      :doc:`ROCm SMI <rocm_smi_lib:index>`,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0
+      :doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000
+      ,,,,,,,,,,,
+      PERFORMANCE TOOLS,,,,,,,,,,,
+      :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0
+      :doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`ROCProfiler <rocprofiler:index>`,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000
+      :doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A
+      :doc:`ROCTracer <roctracer:index>`,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000
+      ,,,,,,,,,,,
+      DEVELOPMENT TOOLS,,,,,,,,,,,
+      :doc:`HIPIFY <hipify:index>`,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
+      :doc:`ROCm CMake <rocmcmakebuildtools:index>`,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0
+      :doc:`ROCdbgapi <rocdbgapi:index>`,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0
+      :doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0
+      `rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A
+      :doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3
+      ,,,,,,,,,,,
+      COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,
+      `clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0
+      :doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
+      `Flang <https://github.com/ROCm/flang>`_,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
+      :doc:`llvm-project <llvm-project:index>`,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
+      `OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
+      ,,,,,,,,,,,
+      RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,
+      :doc:`AMD CLR <hip:understand/amd_clr>`,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
+      :doc:`HIP <hip:index>`,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
+      `OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0
+      :doc:`ROCr Runtime <rocr-runtime:index>`,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0
--- a/docs/compatibility/compatibility-matrix.rst
+++ b/docs/compatibility/compatibility-matrix.rst
@@ -61,7 +61,7 @@ compatibility and system requirements.
      Thrust,2.3.2,2.2.0,2.1.0
      CUB,2.3.2,2.2.0,2.1.0
      ,,,
-      KFD & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,,
+      KMD & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,,
      Tested user space versions,"6.3.x, 6.2.x, 6.1.x","6.3.x, 6.2.x, 6.1.x, 6.0.x","6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x"
      ,,,
      ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix:,,
@@ -146,11 +146,11 @@ compatibility and system requirements.

 .. rubric:: Footnotes

-.. [#red-hat94] RHEL 9.4 is supported only on AMD Instinct MI300A.
+.. [#red-hat94] **For ROCm 6.1** - RHEL 9.4 is supported only on AMD Instinct MI300A.
 .. [#oracle89] Oracle Linux is supported only on AMD Instinct MI300X.
 .. [#mi300_624] **For ROCm 6.2.4** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
 .. [#mi300_610] **For ROCm 6.1.0** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.4.
-.. [#kfd_support] ROCm provides forward and backward compatibility between the Kernel Fusion Driver (KFD) and its user space software for +/- 2 releases. These are the compatibility combinations that are currently supported.
+.. [#kfd_support] ROCm provides forward and backward compatibility between the AMD Kernel-mode GPU Driver (KMD) and its user space software for +/- 2 releases. These are the compatibility combinations that are currently supported.
 .. [#ROCT-rocr] As of ROCm 6.3.0, the ROCT Thunk Interface is now included as part of the ROCr runtime package.

 .. _OS-kernel-versions:
@@ -172,9 +172,8 @@ Use this lookup table to confirm which operating system and kernel versions are
   `Ubuntu <https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle>`_, 22.04.5, "5.15 GA, 6.8 HWE"
   , 22.04.4, "5.15 GA, 6.5 HWE"
   , 22.04.3, "5.15 GA, 6.2 HWE"
-   , 22.04.2, "5.15 GA, 5.19 HWE"
   ,,
-   `Ubuntu <https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle>`_, 20.04.06, "5.15 HWE"
+   `Ubuntu <https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle>`_, 20.04.6, "5.15 HWE"
   , 20.04.5, "5.15 HWE"
   ,,
   `Red Hat Enterprise Linux (RHEL) <https://access.redhat.com/articles/3078#RHEL9>`_, 9.5, 5.14.0
@@ -194,7 +193,6 @@ Use this lookup table to confirm which operating system and kernel versions are
   ,,
   `Oracle Linux <https://blogs.oracle.com/scoter/post/oracle-linux-and-unbreakable-enterprise-kernel-uek-releases>`_, 8.10, 5.15.0
   ,8.9, 5.15.0
-   `Azure Linux <https://github.com/microsoft/azurelinux/releases>`_, 3.0, 6.6.60

 ..
   Footnotes and ref anchors in below historical tables should be appended with "-past-60", to differentiate from the 
@@ -222,6 +220,7 @@ Expand for full historical view of:

   .. rubric:: Footnotes

+   .. [#red-hat94-past-60] **For ROCm 6.1** - RHEL 9.4 is supported only on AMD Instinct MI300A.
   .. [#oracle89-past-60] Oracle Linux is supported only on AMD Instinct MI300X.
   .. [#mi300_624-past-60] **For ROCm 6.2.4** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
   .. [#mi300_622-past-60] **For ROCm 6.2.2** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
@@ -232,5 +231,5 @@ Expand for full historical view of:
   .. [#mi300_610-past-60] **For ROCm 6.1.0** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.4.
   .. [#mi300_602-past-60] **For ROCm 6.0.2** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.3.
   .. [#mi300_600-past-60] **For ROCm 6.0.0** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.3.
-   .. [#kfd_support-past-60] ROCm provides forward and backward compatibility between the Kernel Fusion Driver (KFD) and its user space software for +/- 2 releases. These are the compatibility combinations that are currently supported.
+   .. [#kfd_support-past-60] ROCm provides forward and backward compatibility between the AMD Kernel-mode GPU Driver (KMD) and its user space software for +/- 2 releases. These are the compatibility combinations that are currently supported.
   .. [#ROCT-rocr-past-60] As of ROCm 6.3.0, the ROCT Thunk Interface is now included as part of the ROCr runtime package.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -43,6 +43,7 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/train-a-model", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/accelerate-training", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/deploy-your-model", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/hugging-face-models", "os": ["linux"]},
    {"file": "how-to/rocm-for-hpc/index", "os": ["linux"]},
--- a/docs/data/how-to/rocm-for-ai/2-node-training-master.png
+++ b/docs/data/how-to/rocm-for-ai/2-node-training-master.png
--- a/docs/data/how-to/rocm-for-ai/2-node-training-worker.png
+++ b/docs/data/how-to/rocm-for-ai/2-node-training-worker.png
--- a/docs/data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
+++ b/docs/data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
--- a/docs/data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
+++ b/docs/data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
--- a/docs/data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
+++ b/docs/data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
--- a/docs/how-to/rocm-for-ai/index.rst
+++ b/docs/how-to/rocm-for-ai/index.rst
@@ -16,6 +16,8 @@ In this guide, you'll learn about:

 - :doc:`Installing ROCm and machine learning frameworks <install>`

+- :doc:`Scaling model training <scale-model-training>`
+
 - :doc:`Training a model <train-a-model>`

 - :doc:`Running models from Hugging Face <hugging-face-models>`
--- a/docs/how-to/rocm-for-ai/scale-model-training.rst
+++ b/docs/how-to/rocm-for-ai/scale-model-training.rst
@@ -0,0 +1,135 @@
+.. meta::
+   :description: How to scale and accelerate model training
+   :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial
+
+**********************
+Scaling model training
+**********************
+
+To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the
+model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or
+accelerator can simultaneously store and process the entire model's parameters during training. PyTorch
+provides an answer to this computational constraint through its distributed training frameworks.
+
+.. _rocm-for-ai-pytorch-distributed:
+
+PyTorch distributed
+===================
+
+Features in ``torch.distributed`` are categorized into three main components:
+
+- `Distributed data-parallel training
+  <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
+
+- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
+
+- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
+
+In this topic, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
+you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
+
+The DDP workflow on multiple accelerators or GPUs is as follows:
+
+#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
+   the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
+
+#. Copy the model to every device so each can process its local batches independently.
+
+#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
+   model for that local batch. This happens in parallel on multiple devices.
+
+#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
+   weights are then redistributed to each device.
+
+In DDP training, each process or worker owns a replica of the model and processes a batch of data, and then the reducer uses
+``allreduce`` to sum up gradients over different workers.
+
+See the following developer blogs for more in-depth explanations and examples.
+
+*  `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
+
+*  `Building a decoder transformer model on AMD GPUs — ROCm Blogs
+   <https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
+
+.. _rocm-for-ai-pytorch-fsdp:
+
+PyTorch FSDP
+------------
+
+As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, DDP model weights and optimizer states
+are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
+model parameters, optimizer states, and gradients across DDP ranks.
+
+When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
+training some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
+comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
+like overlapping communication and computation.
+
+For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
+<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
+
+For detailed training steps, see `PyTorch FSDP examples
+<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
+
+.. _rocm-for-ai-deepspeed:
+
+DeepSpeed
+---------
+
+`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
+efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
+the training pillar.
+
+See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
+<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
+training with DeepSpeed on an AMD accelerator or GPU.
+
+.. _rocm-for-ai-automatic-mixed-precision:
+
+Automatic mixed precision (AMP)
+-------------------------------
+
+As models increase in size, so do the time and memory needed to train them; their cost also increases. Any measure we
+can take to reduce training time and memory usage through `automatic mixed precision
+<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
+
+See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
+<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
+for more information about running AMP on an AMD accelerator.
+
+.. _rocm-for-ai-fine-tune:
+
+Fine-tuning your model
+======================
+
+ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
+example, LoRA, QLoRA, PEFT, and FSDP.
+
+Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
+
+The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU.
+
+* Fine-tuning Llama2 with LoRA
+
+  * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering
+    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
+
+* Fine-tuning Llama2 with QLoRA
+
+  * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU
+    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
+
+* Fine-tuning a BERT-based LLM for a text classification task using JAX
+
+  * `LLM distributed supervised fine-tuning with JAX
+    <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
+
+* Fine-tuning StarCoder using PEFT
+
+  * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs
+    <https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
+
+* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
+
+  * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
+    single/multi-node GPUs <https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning>`_
--- a/docs/how-to/rocm-for-ai/train-a-model.rst
+++ b/docs/how-to/rocm-for-ai/train-a-model.rst
@@ -1,140 +1,503 @@
 .. meta::
-   :description: How to use ROCm for AI
-   :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial
+   :description: How to train a model using ROCm Megatron-LM
+   :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch

-****************
-Training a model
-****************
+**************************************
+Training a model with ROCm Megatron-LM
+**************************************

-The following is a brief overview of popular component paths per AI development use-case, such as training, LLMs,
-and inferencing.
+.. _amd-megatron-lm:

-Accelerating model training
-===========================
+The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
+enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
+accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
+workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support>`
+like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
+efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.

-To train a large model like GPT2 or Llama 2 70B, a single accelerator or GPU cannot store all the model parameters
-required for training. What if you could convert the single-GPU training code to run on multiple accelerators or GPUs?
-PyTorch offers distributed training solutions to facilitate this.
+For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
+components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
+following software to accelerate training workloads:

-.. _rocm-for-ai-pytorch-distributed:
+--------------------------+--------------------------------+
+| Software component       | Version                        |
+==========================+================================+
+| ROCm                     | 6.1                            |
+--------------------------+--------------------------------+
+| PyTorch                  | 2.4.0                          |
+--------------------------+--------------------------------+
+| PyTorch Lightning        | 2.4.0                          |
+--------------------------+--------------------------------+
+| Megatron Core            | 0.9.0                          |
+--------------------------+--------------------------------+
+| Transformer Engine       | 1.5.0                          |
+--------------------------+--------------------------------+
+| Flash Attention          | v2.6                           |
+--------------------------+--------------------------------+
+| Transformers             | 4.44.0                         |
+--------------------------+--------------------------------+

-PyTorch distributed
-------------------
+Supported features and models
+=============================

-As of PyTorch 1.6.0, features in ``torch.distributed`` are categorized into three main components:
+Megatron-LM provides the following key features to train large language models efficiently:

- `Distributed data-parallel training
-  <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_ (DDP)
+- Transformer Engine (TE)

- `RPC-Based distributed training <https://pytorch.org/docs/stable/rpc.html>`_ (RPC)
+- APEX

- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
+- GEMM tuning

-In this guide, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP,
-let’s first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
+- Torch.compile

-The DDP workflow on multiple accelerators or GPUs is as follows:
+- 3D parallelism: TP + SP + CP

-#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
-   the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
+- Distributed optimizer

-#. Copy the model to every device so each device can process its local batches independently.
+- Flash Attention (FA) 2

-#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the
-   model for that local batch. This happens in parallel on multiple devices.
+- Fused kernels

-#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated
-   weights are then redistributed to each device.
+- Pre-training

-In DDP training, each process or worker owns a replica of the model and processes a batch of data, then the reducer uses
-``allreduce`` to sum up gradients over different workers.
+.. _amd-megatron-lm-model-support:

-See the following developer blogs for more in-depth explanations and examples.
+The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.

-*  `Multi GPU training with DDP — PyTorch Tutorials <https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html>`_
+* Llama 2 7B

-*  `Building a decoder transformer model on AMD GPUs — ROCm Blogs
-   <https://rocm.blogs.amd.com/artificial-intelligence/decoder-transformer/README.html#distributed-training-on-multiple-gpus>`_
+* Llama 2 70B

-.. _rocm-for-ai-pytorch-fsdp:
+* Llama 3 8B

-PyTorch FSDP
------------
+* Llama 3 70B

-As noted in :ref:`PyTorch distributed <rocm-for-ai-pytorch-distributed>`, in DDP model weights and optimizer states
-are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards
-model parameters, optimizer states, and gradients across DDP ranks.
+* Llama 3.1 8B

-When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes
-the training of some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this
-comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations
-like overlapping communication and computation.
+* Llama 3.1 70B

-For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel
-<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#how-fsdp-works>`_.
+Prerequisite system validation steps
+====================================

-For detailed training steps, refer to the `PyTorch FSDP examples
-<https://github.com/pytorch/examples/tree/main/distributed/FSDP>`_.
+Complete the following system validation and optimization steps to set up your system before starting training.

-.. _rocm-for-ai-deepspeed:
+Disable NUMA auto-balancing
+---------------------------

-DeepSpeed
---------
+Generally, application performance can benefit from disabling NUMA auto-balancing. However,
+it might be detrimental to performance with certain types of workloads.

-`DeepSpeed <https://deepspeed.ai>`_ offers system innovations that make large-scale deep learning training effective,
-efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under
-the training pillar.
+Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform
+Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or
+the output is ``1``, run the following command to disable NUMA auto-balancing.

-See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs — ROCm Blogs
-<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
-training with DeepSpeed on an AMD accelerator or GPU.
+.. code-block:: shell

-.. _rocm-for-ai-automatic-mixed-precision:
+   sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

-Automatic mixed precision (AMP)
+See :ref:`mi300x-disable-numa` for more information.
+
+Hardware verification with ROCm
 -------------------------------

-As models increase in size, the time and memory needed to train them; that is, their cost also increases. Any measure we
-can take to reduce training time and memory usage through `automatic mixed precision
-<https://pytorch.org/docs/stable/amp.html>`_ (AMP) is highly beneficial for most use cases.
+Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
+instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
+GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
+You can restore this setting to its default value with the ``rocm-smi -r`` command.

-See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
-<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
-for more information about running AMP on an AMD accelerator.
+Run the command:

-.. _rocm-for-ai-fine-tune:
+.. code-block:: shell

-Fine-tuning your model
-======================
+   rocm-smi --setperfdeterminism 1900

-ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
-example, LoRA, QLoRA, PEFT, and FSDP.
+See :ref:`mi300x-hardware-verification-with-rocm` for more information.

-Learn more about challenges and solutions for model fine-tuning in :doc:`../llm-fine-tuning-optimization/index`.
+RCCL Bandwidth Test
+-------------------

-The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.
+ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication
+routines for GPUs. See the :doc:`RCCL documentation <rccl:index>` for more information. Before starting
+pre-training, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized
+for efficient distributed training.

-* Fine-tuning Llama2 with LoRA
+Running the RCCL bandwidth test helps verify that:

-  * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs
-    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-lora/README.html>`_
+- The GPUs can communicate across nodes or within a single node.

-* Fine-tuning Llama2 with QLoRA
+- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and
+  provides adequate bandwidth for communication.

-  * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs
-    <https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html>`_
+- No hardware setup or cabling issues could affect the communication between GPUs

-* Fine-tuning a BERT-based LLM for a text classification task using JAX
+Tuning and optimizing hyperparameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  * `LLM distributed supervised fine-tuning with JAX — ROCm Blogs
-    <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
+In distributed training, specific hyperparameters related to distributed communication can be tuned based on
+the results of the RCCL bandwidth test. These variables are already set in the Docker image:

-* Fine-tuning StarCoder using PEFT
+.. code-block:: shell

-  * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs — ROCm Blogs
-    <https://rocm.blogs.amd.com/artificial-intelligence/starcoder-fine-tune/README.html>`_
+   # force all RCCL streams to be high priority
+   export TORCH_NCCL_HIGH_PRIORITY=1

-* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes``
+   # specify which RDMA interfaces to use for communication
+   export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
+
+   # define the Global ID index used in RoCE mode
+   export NCCL_IB_GID_INDEX=3
+
+   # avoid data corruption/mismatch issue that existed in past releases
+   export RCCL_MSCCL_ENABLE=0
+
+Running the RCCL Bandwidth Test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It's recommended you run the RCCL bandwidth test before launching training. It ensures system
+performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker
+image; follow the instructions in `<https://github.com/ROCm/rccl-tests>`__ to get started.
+See :ref:`mi300x-rccl` for more information.
+
+Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB:
+
+.. code-block:: shell
+
+   ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8
+
+.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png
+   :width: 800
+
+Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is
+recommended. So, a run on 8 GPUs looks something like:
+
+.. code-block:: shell
+
+   mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1
+
+.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png
+   :width: 800
+
+Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial
+for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like
+PyTorch and TensorFlow.
+
+Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed.
+
+.. code-block::
+
+   /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \
+   --mca pml ucx \
+   --mca btl ^openib \
+   -x NCCL_SOCKET_IFNAME=ens50f0np0 \
+   -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \
+   -x NCCL_IB_GID_INDEX=3 \
+   -x NCCL_MIN_NCHANNELS=40 \
+   -x NCCL_DEBUG=version \
+   $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1
+
+.. image:: ../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png
+   :width: 800
+
+.. _mi300x-amd-megatron-lm-training:
+
+Start training on MI300X accelerators
+=====================================
+
+The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
+training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
+
+Use the following instructions to set up the environment, configure the script to train models, and
+reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
+image.
+
+.. _amd-megatron-lm-requirements:
+
+Download the Docker image and required packages
+-----------------------------------------------
+
+1. Use the following command to pull the Docker image from Docker Hub.
+
+   .. code-block:: shell
+
+      docker pull rocm/megatron-lm:24.12-dev
+
+2. Launch the Docker container.
+
+   .. code-block:: shell
+
+      docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash
+
+3. Clone the ROCm Megatron-LM repository to a local directory and install the required packages on the host machine.
+
+   .. code-block:: shell
+
+      git clone https://github.com/ROCm/Megatron-LM
+      cd Megatron-LM
+
+   .. note::
+
+      This release is validated with ``ROCm/Megatron-LM`` commit `bb93ccb <https://github.com/ROCm/Megatron-LM/tree/bb93ccbfeae6363c67b361a97a27c74ab86e7e92>`_.
+      Checking out this specific commit is recommended for a stable and reproducible environment.
+
+      .. code-block:: shell
+         
+         git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92
+
+Prepare training datasets
+-------------------------
+
+If you already have the preprocessed data, you can skip this section.
+
+Use the following command to process datasets. We use GPT data as an example. You may change the merge table, use an
+end-of-document token, remove sentence splitting, and use the tokenizer type.
+
+.. code-block:: shell
+
+   python tools/preprocess_data.py \
+       --input my-corpus.json \
+       --output-prefix my-gpt2 \
+       --vocab-file gpt2-vocab.json \
+       --tokenizer-type GPT2BPETokenizer \
+       --merge-file gpt2-merges.txt \
+       --append-eod
+
+In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and
+``my-gpt2_text_document.idx``.
+
+.. image:: ../../data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png
+   :width: 800
+
+.. _amd-megatron-lm-environment-setup:
+
+Environment setup
+-----------------
+
+In the ``examples/llama`` directory of Megatron-LM, if you're working with Llama 2 7B or Llama 2 70 B, use the
+``train_llama2.sh`` configuration script. Likewise, if you're working with Llama 3 or Llama 3.1, then use
+``train_llama3.sh`` and update the configuration script accordingly.
+
+Network interface
+^^^^^^^^^^^^^^^^^
+
+To avoid connectivity issues, ensure the correct network interface is set in your training scripts.
+
+1. Run the following command to find the active network interface on your system.
+
+   .. code-block:: shell
+
+      ip a
+
+2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For
+   example:
+
+   .. code-block:: shell
+
+      export NCCL_SOCKET_IFNAME=ens50f0np0
+
+      export GLOO_SOCKET_IFNAME=ens50f0np0
+
+Dataset options
+^^^^^^^^^^^^^^^
+
+You can use either mock data or real data for training.
+
+* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset.
+
+  .. code-block:: shell
+
+     DATA_DIR="/root/.cache/data" # Change to where your dataset is stored
+
+     DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
+
+  .. code-block:: shell
+
+     --data-path $DATA_PATH
+
+  Ensure that the files are accessible inside the Docker container.
+
+* Mock data can be useful for testing and validation. If you're using mock data, replace ``--data-path $DATA_PATH`` with the ``--mock-data`` option.
+
+  .. code-block:: shell
+
+     --mock-data
+
+Tokenizer
+^^^^^^^^^
+
+Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama
+models, this typically involves sub-word tokenization, where words are broken down into smaller units based on
+a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a
+fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to
+handle a variety of input sequences, including unseen words or domain-specific terms.
+
+To train any of the Llama 2 models that this Docker image supports, use the ``Llama2Tokenizer``.
+
+To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``.
+Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable.
+
+For example, if you're using the Llama 3.1 8B model:
+
+.. code-block:: shell
+
+   TOKENIZER_MODEL=meta-llama/Llama-3.1-8B
+
+Run benchmark tests
+-------------------
+
+.. note::
+
+   If you're running **multi node training**, update the following environment variables. They can
+   also be passed as command line arguments.
+
+   * Change ``localhost`` to the master node's hostname:
+
+     .. code-block:: shell
+
+        MASTER_ADDR="${MASTER_ADDR:-localhost}"
+
+   * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``):
+
+     .. code-block:: shell
+
+        NNODES="${NNODES:-1}"
+
+   * Set the rank of each node (0 for master, 1 for the first worker node, and so on):
+
+     .. code-block:: shell
+
+        NODE_RANK="${NODE_RANK:-0}"
+
+* Use this command to run a performance benchmark test of any of the Llama 2 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
+
+  .. code-block:: shell
+
+     {variables} bash examples/llama/train_llama2.sh
+
+* Use this command to run a performance benchmark test of any of the Llama 3 and Llama 3.1 models that this Docker image supports (see :ref:`variables <amd-megatron-lm-benchmark-test-vars>`).
+
+  .. code-block:: shell
+
+     {variables} bash examples/llama/train_llama3.sh
+
+.. _amd-megatron-lm-benchmark-test-vars:
+
+The benchmark tests support the same set of variables:
+
+--------------------------+-----------------------+-----------------------+
+| Name                     | Options               | Description           |
+==========================+=======================+=======================+
+| ``TEE_OUTPUT``           | 0 or 1                | 0: disable training   |
+|                          |                       | log                   |
+|                          |                       |                       |
+|                          |                       | 1: enable training    |
+|                          |                       | log                   |
+--------------------------+-----------------------+-----------------------+
+| ``MBS``                  |                       | Micro batch size      |
+--------------------------+-----------------------+-----------------------+
+| ``BS``                   |                       | Batch size            |
+--------------------------+-----------------------+-----------------------+
+| ``TP``                   | 1, 2, 4, 8            | Tensor parallel       |
+--------------------------+-----------------------+-----------------------+
+| ``TE_FP8``               | 0 or 1                | Datatype.             |
+|                          |                       | If it is set to 1,    |
+|                          |                       | FP8.                  |
+|                          |                       |                       |
+|                          |                       | If it is set to 0.    |
+|                          |                       | BP16                  |
+--------------------------+-----------------------+-----------------------+
+| ``NO_TORCH_COMPILE``     | 0 or 1                | If it is set to 1,    |
+|                          |                       | enable torch.compile. |
+|                          |                       |                       |
+|                          |                       | If it is set to 0.    |
+|                          |                       | Disable torch.compile |
+|                          |                       | (default)             |
+--------------------------+-----------------------+-----------------------+
+| ``SEQ_LENGTH``           |                       | Input sequence length |
+--------------------------+-----------------------+-----------------------+
+| ``GEMM_TUNING``          | 0 or 1                | If it is set to 1,    |
+|                          |                       | enable gemm tuning.   |
+|                          |                       |                       |
+|                          |                       | If it is set to 0,    |
+|                          |                       | disable gemm tuning   |
+--------------------------+-----------------------+-----------------------+
+| ``USE_FLASH_ATTN``       | 0 or 1                | 0: disable flash      |
+|                          |                       | attention             |
+|                          |                       |                       |
+|                          |                       | 1: enable flash       |
+|                          |                       | attention             |
+--------------------------+-----------------------+-----------------------+
+| ``ENABLE_PROFILING``     | 0 or 1                | 0: disable torch      |
+|                          |                       | profiling             |
+|                          |                       |                       |
+|                          |                       | 1: enable torch       |
+|                          |                       | profiling             |
+--------------------------+-----------------------+-----------------------+
+| ``MODEL_SIZE``           |                       | The size of the mode: |
+|                          |                       | 7B/70B, etc.          |
+--------------------------+-----------------------+-----------------------+
+| ``TOTAL_ITERS``          |                       | Total number of       |
+|                          |                       | iterations            |
+--------------------------+-----------------------+-----------------------+
+| ``transformer-impl``     | transformer_engine or | Enable transformer    |
+|                          | local                 | engine by default     |
+--------------------------+-----------------------+-----------------------+
+
+Benchmarking examples
+^^^^^^^^^^^^^^^^^^^^^
+
+.. tab-set::
+
+   .. tab-item:: Single node training
+      :sync: single
+
+      Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP,
+      datatype, and so on.
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
+
+      See the sample output:
+
+      .. image:: ../../data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png
+         :width: 800
+
+   .. tab-item:: Multi node training
+      :sync: multi
+
+      Launch the Docker container on each node.
+
+      In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and
+      so on.
+
+      On the master node:
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      On the worker node:
+
+      .. code-block:: bash
+
+         TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1
+         SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh
+
+      You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script <amd-megatron-lm-environment-setup>`.
+
+      Sample output for 2-node training:
+
+      Master node:
+
+      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-master.png
+         :width: 800
+
+      Worker node:
+
+      .. image:: ../../data/how-to/rocm-for-ai/2-node-training-worker.png
+         :width: 800

-  * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover
-    single/multi-node GPUs <https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/finetuning>`_
--- a/docs/how-to/rocm-for-hpc/index.rst
+++ b/docs/how-to/rocm-for-hpc/index.rst
@@ -115,6 +115,12 @@ Ubuntu versions.
          for non-destructive testing or for ocean acoustics.

      * - Molecular dynamics
+        - `Amber <https://github.com/amd/InfinityHub-CI/tree/main/amber>`_
+        - Amber is a suite of biomolecular simulation programs. It is a set of molecular mechanical force fields for 
+          simulating biomolecules. Amber is also a package of molecular simulation 
+          programs which includes source code and demos.
+
+      * - 
        - `GROMACS with HIP (AMD implementation) <https://github.com/amd/InfinityHub-CI/tree/main/gromacs>`_
        - GROMACS is a versatile package to perform molecular dynamics, i.e.
          simulate the Newtonian equations of motion for systems with hundreds
@@ -129,6 +135,13 @@ Ubuntu versions.
          Parallel Simulator.

      * - Computational fluid dynamics
+        - `Ansys Fluent <https://github.com/amd/InfinityHub-CI/tree/main/ansys-fluent>`_
+        - Ansys Fluent is an advanced computational fluid dynamics (CFD) tool for 
+          simulating and analyzing fluid flow, heat transfer, and related phenomena in complex systems. 
+          It offers a range of powerful features for detailed and accurate modeling of various physical 
+          processes, including turbulence, chemical reactions, and multiphase flows.
+
+      * -
        - `NEKO <https://github.com/amd/InfinityHub-CI/tree/main/neko>`_
        - Neko is a portable framework for high-order spectral element flow simulations.
          Written in modern Fortran, Neko adopts an object-oriented approach, allowing
@@ -141,6 +154,26 @@ Ubuntu versions.
        - nekRS is an open-source Navier Stokes solver based on the spectral element
          method targeting classical processors and accelerators like GPUs. 

+      * -
+        - `OpenFOAM <https://github.com/amd/InfinityHub-CI/tree/main/openfoam>`_
+        - OpenFOAM is a free, open-source computational fluid dynamics (CFD) 
+          tool developed primarily by OpenCFD Ltd. It has a large user 
+          base across most areas of engineering and science, from both commercial and 
+          academic organizations. OpenFOAM has extensive features to solve 
+          anything from complex fluid flows involving chemical reactions, turbulence, and 
+          heat transfer, to acoustics, solid mechanics, and electromagnetics.
+
+      * -
+        - `PeleC <https://github.com/amd/InfinityHub-CI/tree/main/pelec>`_
+        - PeleC is an adaptive mesh refinement(AMR) solver for compressible reacting flows.
+
+      * -
+        - `Simcenter Star-CCM+ <https://github.com/amd/InfinityHub-CI/tree/main/siemens-star-ccm>`_
+        - Simcenter Star-CCM+ is a comprehensive computational fluid dynamics (CFD) and multiphysics 
+          simulation tool developed by Siemens Digital Industries Software. It is designed to 
+          help engineers and researchers analyze and optimize the performance of products and 
+          systems across various industries.
+
      * - Computational chemistry
        - `QUDA <https://github.com/amd/InfinityHub-CI/tree/main/quda>`_
        - Library designed for efficient lattice QCD computations on
@@ -170,12 +203,30 @@ Ubuntu versions.
          developing atmosphere, ocean, and other earth-system simulation components
          for use in climate, regional climate, and weather studies.

+      * - Energy, Oil, and Gas
+        - `DevitoPRO <https://github.com/amd/InfinityHub-CI/tree/main/devitopro>`_
+        - DevitoPRO is an advanced extension of the open-source Devito platform with added 
+          features tailored for high-demand production workflows. It supports
+          high-performance computing (HPC) needs, especially in seismic imaging and inversion. 
+          It is used to perform optimized finite difference (FD) computations 
+          from high-level symbolic problem definitions. DevitoPro performs automated 
+          code generation and Just-In-time (JIT) compilation based on symbolic equations 
+          defined in SymPy to create and execute highly optimized Finite Difference stencil 
+          kernels on multiple computer platforms.
+
+      * - 
+        - `ECHELON <https://github.com/amd/InfinityHub-CI/tree/main/srt-echelon>`_
+        - ECHELON by Stone Ridge Technology is a reservoir simulation tool. With 
+          fast processing, it retains precise accuracy and preserves legacy simulator results. 
+          Faster reservoir simulation enables reservoir engineers to produce many realizations, 
+          address larger models, and use advanced physics. It opens new workflows based on 
+          ensemble methodologies for history matching and forecasting that yield 
+          increased accuracy and more predictive results.
+
      * - Benchmark
        - `rocHPL <https://github.com/amd/InfinityHub-CI/tree/main/rochpl>`_
-        - HPL, or High-Performance Linpack, is a benchmark which solves a uniformly 
-          random system of linear equations and reports floating-point execution rate. 
-          This documentation supports the implementation of the HPL benchmark on 
-          top of AMD's ROCm platform.
+        - HPL, or High-Performance Linpack, is a benchmark which solves a uniformly 
+          random system of linear equations and reports floating-point execution rate.

      * -
        - `rocHPL-MxP <https://github.com/amd/InfinityHub-CI/tree/main/hpl-mxp>`_
@@ -216,6 +267,14 @@ Ubuntu versions.
          range of hardware platforms via use of an in-built domain specific language derived
          from the Mako templating engine.

+      * -
+        - `PETSc <https://github.com/amd/InfinityHub-CI/tree/main/petsc>`_
+        - Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures 
+          and routines for the scalable (parallel) solution of scientific applications modeled by partial 
+          differential equations. It supports MPI, GPUs through CUDA, HIP, and OpenCL, 
+          as well as hybrid MPI-GPU parallelism. It also supports the NEC-SX Tsubasa Vector Engine. 
+          PETSc also includes the Toolkit for Advanced Optimization (TAO) library.
+
      * -
        - `RAJA <https://github.com/amd/InfinityHub-CI/tree/main/raja>`_
        - RAJA is a library of C++ software abstractions, primarily developed at Lawrence
--- a/docs/how-to/system-optimization/mi300x.rst
+++ b/docs/how-to/system-optimization/mi300x.rst
@@ -537,6 +537,8 @@ installation was successful, refer to the
 :doc:`rocm-install-on-linux:install/post-install`.
 Should verification fail, consult :doc:`/how-to/system-debugging`.

+.. _mi300x-hardware-verification-with-rocm:
+
 Hardware verification with ROCm
 -------------------------------

--- a/docs/how-to/tuning-guides/mi300x/workload.rst
+++ b/docs/how-to/tuning-guides/mi300x/workload.rst
@@ -2062,11 +2062,10 @@ collectives.
 Multi-node FSDP and RCCL settings
 ---------------------------------

-It's recommended to use high-priority HIP streams with RCCL.
-
-The simplest way to enable this is by using the nightly PyTorch wheels, as the required changes from
-`PR #122830 <https://github.com/pytorch/pytorch/pull/122830>`_ were not included in the PyTorch 2.3
-release but are available in the nightly builds.
+When using PyTorch's FSDP (Full Sharded Data Parallel) feature, the HIP
+streams used by RCCL and HIP streams used for compute kernels do not
+always overlap well. As a workaround, it's recommended to use
+high-priority HIP streams with RCCL.

 To configure high-priority streams:

--- a/docs/index.md
+++ b/docs/index.md
@@ -37,13 +37,12 @@ ROCm documentation is organized into the following categories:
 :::{grid-item-card} How to
 :class-body: rocm-card-banner rocm-hue-12

-* [Programming guide](./how-to/hip_programming_guide.rst)
 * [Use ROCm for AI](./how-to/rocm-for-ai/index.rst)
 * [Use ROCm for HPC](./how-to/rocm-for-hpc/index.rst)
 * [Fine-tune LLMs and inference optimization](./how-to/llm-fine-tuning-optimization/index.rst)
 * [System optimization](./how-to/system-optimization/index.rst)
 * [AMD Instinct MI300X performance validation and tuning](./how-to/tuning-guides/mi300x/index.rst)
-* [GPU cluster networking](https://rocm.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html)
+* [GPU cluster networking](https://dcgpu.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html)
 * [System debugging](./how-to/system-debugging.md)
 * [Use MPI](./how-to/gpu-enabled-mpi.rst)
 * [Use advanced compiler features](./conceptual/compiler-topics.md)
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -40,6 +40,8 @@ subtrees:
        title: Installation
      - file: how-to/rocm-for-ai/train-a-model.rst
        title: Train a model
+      - file: how-to/rocm-for-ai/scale-model-training.rst
+        title: Scale model training
      - file: how-to/rocm-for-ai/hugging-face-models.rst
        title: Run models from Hugging Face
      - file: how-to/rocm-for-ai/deploy-your-model.rst
@@ -92,7 +94,7 @@ subtrees:
        title: System tuning
      - file: how-to/tuning-guides/mi300x/workload.rst
        title: Workload tuning
-  - url: https://rocm.docs.amd.com/projects/gpu-cluster-networking/en/${branch}/index.html
+  - url: https://dcgpu.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html
    title: GPU cluster networking
  - file: how-to/gpu-enabled-mpi.rst
    title: Use MPI
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,3 +1,3 @@
-rocm-docs-core==1.11.0
+rocm-docs-core==1.12.1
 sphinx-reredirects
 sphinx-sitemap
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -16,17 +16,17 @@ beautifulsoup4==4.12.3
    # via pydata-sphinx-theme
 breathe==4.35.0
    # via rocm-docs-core
-certifi==2024.8.30
+certifi==2024.12.14
    # via requests
 cffi==1.17.1
    # via
    #   cryptography
    #   pynacl
-charset-normalizer==3.4.0
+charset-normalizer==3.4.1
    # via requests
-click==8.1.7
+click==8.1.8
    # via sphinx-external-toc
-cryptography==43.0.3
+cryptography==44.0.0
    # via pyjwt
 deprecated==1.2.15
    # via pygithub
@@ -36,17 +36,17 @@ docutils==0.21.2
    #   myst-parser
    #   pydata-sphinx-theme
    #   sphinx
-fastjsonschema==2.20.0
+fastjsonschema==2.21.1
    # via rocm-docs-core
-gitdb==4.0.11
+gitdb==4.0.12
    # via gitpython
-gitpython==3.1.43
+gitpython==3.1.44
    # via rocm-docs-core
 idna==3.10
    # via requests
 imagesize==1.4.1
    # via sphinx
-jinja2==3.1.4
+jinja2==3.1.5
    # via
    #   myst-parser
    #   sphinx
@@ -66,7 +66,7 @@ packaging==24.2
    # via sphinx
 pycparser==2.22
    # via cffi
-pydata-sphinx-theme==0.16.0
+pydata-sphinx-theme==0.16.1
    # via
    #   rocm-docs-core
    #   sphinx-book-theme
@@ -77,7 +77,7 @@ pygments==2.18.0
    #   accessible-pygments
    #   pydata-sphinx-theme
    #   sphinx
-pyjwt[crypto]==2.10.0
+pyjwt[crypto]==2.10.1
    # via pygithub
 pynacl==1.5.0
    # via pygithub
@@ -90,9 +90,9 @@ requests==2.32.3
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.11.0
+rocm-docs-core==1.12.1
    # via -r requirements.in
-smmap==5.0.1
+smmap==5.0.2
    # via gitdb
 snowballstemmer==2.2.0
    # via sphinx
@@ -137,13 +137,13 @@ sphinxcontrib-qthelp==2.0.0
    # via sphinx
 sphinxcontrib-serializinghtml==2.0.0
    # via sphinx
-tomli==2.1.0
+tomli==2.2.1
    # via sphinx
 typing-extensions==4.12.2
    # via
    #   pydata-sphinx-theme
    #   pygithub
-urllib3==2.2.3
+urllib3==2.3.0
    # via
    #   pygithub
    #   requests
Author	SHA1	Message	Date
randyh62	acb54bdcdd	Update RELEASE.md Add Optimized entry for HIP graph performance	2025-05-15 12:03:20 -07:00
Pratik Basyal	298ca30757	615 column added 630 (#4637 ) * 6.1.5 compatibility column added	2025-04-17 11:49:13 -04:00
Peter Park	72970cda1f	Merge pull request #4283 from ROCm/Llama-recipe-link-update Fixed link for Llama recipe	2025-01-27 17:09:25 -05:00
Pratik Basyal	8edd5e550e	Fixed link for Llama recipe	2025-01-21 16:12:36 -05:00
Pratik Basyal	b606b1577d	HPC application list updated (#4066 ) (#4244 ) * PETSc added * List of HPC applications updated for 6.2.4 * Leo's feedback incorporated * Review feedback incorporated * vllm removed --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>	2025-01-08 10:06:54 -05:00
Pratik Basyal	596e606edc	AMDSMI Github reference updated in 6.3.0 Release notes (#4237 ) (#4239 ) * AMDSMi github reference updated (#4237) * Release version fixed	2025-01-07 10:52:04 -05:00
Alex Xu	cf241d7269	update rocm-docs-core to 1.12.1	2025-01-02 16:05:35 -05:00
alexxu-amd	6afbb33144	Change version variable to latest Since gpu-cluster-networking gets moved to dcgpu. All versioning will be renamed. (cherry picked from commit `027b2ea376`)	2024-12-23 18:31:29 -05:00
alexxu-amd	22e71b4dce	Update index.md (cherry picked from commit `fe69fc1bb4`)	2024-12-23 18:08:27 -05:00
alexxu-amd	aeb073716b	Update _toc.yml.in (cherry picked from commit `4d31d717a6`)	2024-12-23 18:08:24 -05:00
Istvan Kiss	6a4247156f	Remove `hipExtHostAlloc` add at HIP section	2024-12-23 15:31:26 +01:00
Peter Park	511b4b3a9a	Merge pull request #4170 from ROCm/hip_changelog_update Remove upcoming changes section in HIP changes	2024-12-18 13:15:57 -05:00
Peter Park	cacd5a7845	Merge pull request #4169 from peterjunpark/docs/6.3.0 Hotfix compatibility matrix	2024-12-18 13:02:34 -05:00
Istvan Kiss	42da33d8b6	Remove upcoming changes section	2024-12-18 17:49:44 +01:00
Peter Park	33ca42f743	hotfix compat matrix	2024-12-18 11:22:18 -05:00
Peter Park	449c0e00b3	Merge pull request #4165 from peterjunpark/docs/6.3.0 [6.3] Add megatron training doc (#4159)	2024-12-16 13:50:27 -05:00
Peter Park	6c426ff9fa	add megatron training doc (#4159 ) * add megatron training doc update toc add images update formatting and wording formatting update formatting update conf.py update formatting update docker img tweak formatting Fix stuff fix mock-data/data-path add specific commit hash to checkout update docker pull tag fix docker run cmd and examples path fix docker cmd * wording words words * improve title (cherry picked from commit `f9dbc1f21f`)	2024-12-16 13:38:40 -05:00
Jeffrey Novotny	43399d4eed	Change reference to kernel-mode GPU compute driver in ROCm (#4147 ) (#4156 ) * Change reference to kernel-mode GPU compute driver in ROCm * More changes for kernel-mode terminology * Fix linting (cherry picked from commit `04fdc08328`)	2024-12-13 12:27:10 -05:00
spolifroni-amd	555f4d43ca	Merge pull request #4151 from spolifroni-amd/spolifroni-amd/cherry-pick-RN-update Cherry pick release note update	2024-12-12 11:27:12 -05:00
spolifroni-amd	8db294f215	Added MIGraphX changes (#4150 ) * Added MIGraphX changes * removed gfx support * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update RELEASE.md * Update RELEASE.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> (cherry picked from commit `2a7520f08a`)	2024-12-12 11:21:59 -05:00
randyh62	309f3a19e6	Update index.md (#4144 ) Remove Programming Guide topic from "How to"	2024-12-09 10:04:51 -08:00
Peter Park	8070f30a46	Merge pull request #4142 from peterjunpark/docs/6.3.0 [6.3] fix rccl hip streams section in workload tuning guide (#4140)	2024-12-09 11:10:55 -05:00
Peter Park	b5955a8d46	fix rccl hip streams section in workload tuning guide (#4140 ) (cherry picked from commit `78f9adc6ec`)	2024-12-09 11:07:40 -05:00
randyh62	5c25c3e797	Revert "Update RELEASE.md (#4127 )" (#4141 ) This reverts commit `7b57247b9a`.	2024-12-09 07:46:47 -08:00
Sam Wu	6bbc7429df	Merge pull request #4132 from ROCm/roc-6.3.x Merge Roc 6.3.x into 6.3.0	2024-12-06 14:38:11 -07:00
Sam Wu	7e8947fdb4	Merge pull request #4128 from ROCm/develop Merge develop into roc-6.3.x	2024-12-06 11:34:46 -07:00
Peter Park	31d24eb5cc	Merge pull request #4129 from peterjunpark/docs/6.3.0 [6.3] Add @hongxiayang updates to MI300X workload tuning guide (#4123)	2024-12-06 12:30:08 -05:00
Peter Park	9f6757b71d	Add @hongxiayang updates to MI300X workload tuning guide (#4123 ) minor fixes to formatting fix spelling errors more spelling fixes quantization update fix format simplify wording in tunableops and format fix Apply suggestions from code review review feedback by Peter Co-authored-by: Peter Park <peter.park@amd.com> Apply suggestions from code review addressing feedback Co-authored-by: Peter Park <peter.park@amd.com> Apply suggestions from code review feedback again Co-authored-by: Peter Park <peter.park@amd.com> add hipblaslt yaml file figure feedback and minor formatting formatting update wordlist.txt remove outdated sentence regarding fsdp and rccl (cherry picked from commit 87fa9fd83a2e623f6cab4e69d65f49e3db0a45f6) update wordlist Co-authored-by: hongxyan <hongxyan@amd.com> (cherry picked from commit `b0722b3228`)	2024-12-06 12:21:16 -05:00
randyh62	7b57247b9a	Update RELEASE.md (#4127 ) Change __AMDGCN_WAVEFRONT_SIZE__ macro to __AMDGCN_WAVEFRONT_SIZE macro	2024-12-06 09:09:44 -08:00
Swati Rawat	1f25c77654	Merge pull request #4126 from SwRaw/docs/6.3.0 Update what-is-rocm.rst (#4122)	2024-12-06 21:13:19 +05:30
Swati Rawat	3d3d3cb1da	Update what-is-rocm.rst (#4122 ) (cherry picked from commit `5e6ddec385`)	2024-12-06 21:08:51 +05:30
Peter Park	0f16b8eb29	remove programming guide from TOC (#4116 ) (cherry picked from commit `1a4d54a4f1`)	2024-12-05 17:23:16 -05:00
Sam Wu	66cac5301f	Merge pull request #4113 from ROCm/develop Merge develop into roc-6.3.x	2024-12-05 09:35:17 -07:00
Peter Park	33a6c37f44	Merge pull request #4114 from ROCm/develop [docs/6.3.0] fix stack image (#4112)	2024-12-04 22:01:55 -05:00
Sam Wu	3490079e2e	Merge branch 'roc-6.3.x' into docs/6.3.0	2024-12-04 19:34:39 -07:00
Sam Wu	9f3a1de117	Merge branch 'develop' into roc-6.3.x	2024-12-04 19:34:29 -07:00
Sam Wu	f9abf88965	Merge pull request #4110 from ROCm/roc-6.3.x fix links to smi tools full changelog on GH (#4108) in 6.3.0 docs	2024-12-04 19:12:31 -07:00
Sam Wu	0915fb17e8	Merge pull request #4109 from ROCm/develop fix links to smi tools full changelog on GH (#4108) in 6.3 release branch	2024-12-04 19:08:06 -07:00
Sam Wu	480c23a83e	Merge pull request #4105 from ROCm/roc-6.3.x Merge ROCm 6.3 release branch into 6.3.0 docs branch	2024-12-04 17:14:34 -07:00
Sam Wu	0d3eb1d774	Merge pull request #4104 from ROCm/develop Merge develop into ROCm 6.3 release branch	2024-12-04 17:09:23 -07:00
randyh62	52c44cccca	Update license.md (#4099 ) Change ROCgdb license to point to GNU version 3	2024-12-04 11:03:57 -08:00
Sam Wu	e868fb6c19	Merge pull request #4094 from ROCm/roc-6.3.x Merge Roc 6.3.x into 6.3.0 docs	2024-12-03 16:19:00 -07:00
Sam Wu	7a258cdba9	Merge pull request #4093 from ROCm/develop Merge develop into roc-6.3.x	2024-12-03 16:17:01 -07:00