Merge branch 'docs/6.0.2' into roc-6.0.x

Update doc requirements to 6.0.2 point fix
update
2026-01-11 15:47:59 -05:00 · 2024-03-21 14:09:48 -06:00 · 2024-03-21 09:56:19 -06:00 · 2024-03-21 09:35:35 -06:00 · 2024-03-20 21:07:38 -04:00 · 2024-03-20 15:35:57 -06:00
108 changed files with 3290 additions and 2186 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -1 +1,5 @@
 * @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney
+# Documentation files
+docs/* @ROCm/rocm-documentation
+*.md @ROCm/rocm-documentation
+*.rst @ROCm/rocm-documentation
--- a/.github/workflows/linting.yml
+++ b/.github/workflows/linting.yml
@@ -17,4 +17,4 @@ on:
 jobs:
  call-workflow-passing-data:
    name: Documentation
-    uses: RadeonOpenCompute/rocm-docs-core/.github/workflows/linting.yml@develop
+    uses: ROCm/rocm-docs-core/.github/workflows/linting.yml@develop
--- a/.markdownlint-cli2.yaml
+++ b/.markdownlint-cli2.yaml
@@ -13,6 +13,5 @@ config:
  MD051: false
 ignores:
  - CHANGELOG.md
-  - docs/CHANGELOG.md
  - "{,docs/}{RELEASE,release}.md"
  - tools/autotag/templates/**/*.md
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@@ -3,19 +3,16 @@

 version: 2

-build:
-   os: ubuntu-22.04
-   tools:
-      python: "3.10"
-   apt_packages:
-     - "doxygen"
-     - "graphviz" # For dot graphs in doxygen
+sphinx:
+   configuration: docs/conf.py
+
+formats: [htmlzip, pdf]

 python:
   install:
   - requirements: docs/sphinx/requirements.txt

-sphinx:
-   configuration: docs/conf.py
-
-formats: []
+build:
+   os: ubuntu-20.04
+   tools:
+      python: "3.8"
--- a/.wordlist.txt
+++ b/.wordlist.txt
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -15,6 +15,49 @@ This page contains the release notes for AMD ROCm Software.

 -------------------

+## ROCm 6.0.2
+
+The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings. 
+
+### Library changes in ROCm 6.0.2
+
+| Library | Version |
+|---------|---------|
+| AMDMIGraphX |  ⇒ [2.8](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.0.2) |
+| hipBLAS |  ⇒ [2.0.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.0.2) |
+| hipBLASLt |  ⇒ [0.6.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.0.2) |
+| hipCUB |  ⇒ [3.0.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.0.2) |
+| hipFFT |  ⇒ [1.0.13](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.0.2) |
+| hipRAND |  ⇒ [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.0.2) |
+| hipSOLVER |  ⇒ [2.0.0](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.0.2) |
+| hipSPARSE |  ⇒ [3.0.0](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.0.2) |
+| hipSPARSELt |  ⇒ [0.1.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.0.2) |
+| hipTensor |  ⇒ [1.1.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.0.2) |
+| MIOpen |  ⇒ [2.19.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.0.2) |
+| rccl |  ⇒ [2.15.5](https://github.com/ROCm/rccl/releases/tag/rocm-6.0.2) |
+| rocALUTION |  ⇒ [3.0.3](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.0.2) |
+| rocBLAS |  ⇒ [4.0.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.0.2) |
+| rocFFT |  ⇒ [1.0.25](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.0.2) |
+| rocm-cmake |  ⇒ [0.11.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.0.2) |
+| rocPRIM |  ⇒ [3.0.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.0.2) |
+| rocRAND |  ⇒ [3.0.0](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.0.2) |
+| rocSOLVER |  ⇒ [3.24.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.0.2) |
+| rocSPARSE |  ⇒ [3.0.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.0.2) |
+| rocThrust |  ⇒ [3.0.0](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.0.2) |
+| rocWMMA |  ⇒ [1.3.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.0.2) |
+| Tensile |  ⇒ [4.39.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.0.2) |
+
+#### hipFFT 1.0.13
+
+hipFFT 1.0.13 for ROCm 6.0.2
+
+##### Changes
+
+* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
+ over (this should help simplify downstream builds and packaging)
+
+-------------------
+
 ## ROCm 6.0.0
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable no-duplicate-header -->
@@ -31,6 +74,9 @@ MI300 series. Future releases will further enable and optimize this new platfo
  tutorials on the [AMD ROCm Docs](https://rocm.docs.amd.com) site.
 * Consolidated developer resources and training on the new AMD ROCm Developer Hub.

+The following section provide a release overview for ROCm 6.0. For additional details, you can refer to
+the [Changelog](https://rocm.docs.amd.com/en/docs-6.0.0/about/CHANGELOG.html).
+
 ### OS and GPU support changes

 AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating
@@ -57,6 +103,8 @@ We've added a new ROCm meta package for easy installation of all ROCm core packa
 libraries. For example, the following command will install the full ROCm package: `apt-get install rocm`
 (Ubuntu), or `yum install rocm` (RHEL).

+   > To use ROCm on Radeon GPUs, refer to [Install Radeon software for Linux with ROCm](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-radeon.html).
+
 ### Filesystem Hierarchy Standard

 ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We've removed
@@ -74,11 +122,11 @@ the backward compatibility support for old file locations.
 ### Documentation

 CMake support has been added for documentation in the
-[ROCm repository](https://github.com/RadeonOpenCompute/ROCm).
+[ROCm repository](https://github.com/ROCm/ROCm).

 ### AMD Instinct™ MI50 end-of-support notice

-AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters
+AMD Instinct MI50, Radeon™ PRO VII, and Radeon™ VII products (collectively gfx906 GPUs) enters
 maintenance mode in ROCm 6.0.

 As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 was the
@@ -106,29 +154,29 @@ final release for gfx906 GPUs in a fully supported state.

 | Library | Version |
 |---------|---------|
-| AMDMIGraphX |  ⇒ [2.8](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/releases/tag/rocm-6.0.0) |
+| AMDMIGraphX |  ⇒ [2.8](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.0.0) |
 | HIP | [6.0.0](https://github.com/ROCm/HIP/releases/tag/rocm-6.0.0) |
-| hipBLAS |  ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-6.0.0) |
-| hipCUB |  ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-6.0.0) |
-| hipFFT |  ⇒ [1.0.13](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-6.0.0) |
-| hipSOLVER |  ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-6.0.0) |
-| hipSPARSE |  ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-6.0.0) |
-| hipTensor |  ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipTensor/releases/tag/rocm-6.0.0) |
-| MIOpen |  ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-6.0.0) |
-| rccl |  ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-6.0.0) |
-| rocALUTION |  ⇒ [3.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-6.0.0) |
-| rocBLAS |  ⇒ [4.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-6.0.0) |
-| rocFFT |  ⇒ [1.0.25](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-6.0.0) |
+| hipBLAS |  ⇒ [2.0.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.0.0) |
+| hipCUB |  ⇒ [3.0.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.0.0) |
+| hipFFT |  ⇒ [1.0.13](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.0.0) |
+| hipSOLVER |  ⇒ [2.0.0](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.0.0) |
+| hipSPARSE |  ⇒ [3.0.0](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.0.0) |
+| hipTensor |  ⇒ [1.1.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.0.0) |
+| MIOpen |  ⇒ [2.19.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.0.0) |
+| rccl |  ⇒ [2.15.5](https://github.com/ROCm/rccl/releases/tag/rocm-6.0.0) |
+| rocALUTION |  ⇒ [3.0.3](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.0.0) |
+| rocBLAS |  ⇒ [4.0.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.0.0) |
+| rocFFT |  ⇒ [1.0.25](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.0.0) |
 | ROCgdb | [13.2](https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.0.0) |
-| rocm-cmake |  ⇒ [0.11.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-6.0.0) |
-| rocPRIM |  ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-6.0.0) |
+| rocm-cmake |  ⇒ [0.11.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.0.0) |
+| rocPRIM |  ⇒ [3.0.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.0.0) |
 | rocprofiler | [2.0.0](https://github.com/ROCm/rocprofiler/releases/tag/rocm-6.0.0) |
-| rocRAND |  ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-6.0.0) |
-| rocSOLVER |  ⇒ [3.24.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-6.0.0) |
-| rocSPARSE |  ⇒ [3.0.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-6.0.0) |
-| rocThrust |  ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-6.0.0) |
-| rocWMMA |  ⇒ [1.3.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-6.0.0) |
-| Tensile |  ⇒ [4.39.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-6.0.0) |
+| rocRAND |  ⇒ [2.10.17](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.0.0) |
+| rocSOLVER |  ⇒ [3.24.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.0.0) |
+| rocSPARSE |  ⇒ [3.0.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.0.0) |
+| rocThrust |  ⇒ [3.0.0](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.0.0) |
+| rocWMMA |  ⇒ [1.3.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.0.0) |
+| Tensile |  ⇒ [4.39.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.0.0) |

 #### AMDMIGraphX 2.8

@@ -280,10 +328,7 @@ HIP only supports LUID on Windows OS.
 * HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, `hipFloatComplex` and `hipDoubleComplex` are defined as complex data types: `typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex;` Any application that uses complex multiplication and division operations needs to replace '*' and '/' operators with the following:
  * `hipCmulf()` and `hipCdivf()` for `hipFloatComplex`
  * `hipCmul()` and `hipCdiv()` for `hipDoubleComplex`
-
-  :::{note}
-  These complex operations are equivalent to corresponding types/functions on NVIDIA platform.
-  :::
+Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.

 ##### Removals

@@ -299,11 +344,7 @@ HIP only supports LUID on Windows OS.
      * `HIP_ROCclr`
    * NVIDIA platform
      * `HIP_PLATFORM_NVCC`
-<<<<<<< HEAD
-* File directories in the clr repository are removed, for more details see https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/hcc_detail and https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/nvcc_detail
-=======
 * The `hcc_detail` and `nvcc_detail` directories in the clr repository are removed.
->>>>>>> ebfec1b7 (remove nvcc (#3313))
 * Deprecated gcnArch is removed from hip device struct `hipDeviceProp_t`.
 * Deprecated `enum hipMemoryType memoryType;` is removed from HIP struct `hipPointerAttribute_t` union.

@@ -328,6 +369,25 @@ hipBLAS 2.0.0 for ROCm 6.0.0
 * `hipblasXtrmm` (calculates B <- alpha * op(A) * B) has been replaced with `hipblasXtrmm` (calculates
  C <- alpha * op(A) * B)

+#### hipBLASLt 0.6.0
+
+hipBLASLt 0.6.0 for ROCm 6.0.0
+
+##### Additions
+
+* Added `UserArguments` for `GroupedGemm`
+* Support for datatype: FP16 in with FP32 out
+* New samples
+* Support for datatype: `Int8` in `Int32` out
+* Support for gfx94x platform
+* Support for FP8/BF8 datatype (only for gfx94x platform)
+* Support Scalar A,B,C,D for FP8/BF8 datatype
+
+##### Changes
+
+* Replaced `hipblasDatatype_t` with `hipDataType`
+* Replaced `hipblasLtComputeType_t` with `hipblasComputeType_t`
+* Deprecated `HIPBLASLT_MATMUL_DESC_D_SCALE_VECTOR_POINTER`

 #### hipCUB 3.0.0

@@ -426,15 +486,41 @@ MIOpen 2.19.0 for ROCm 6.0.0

 ##### Fixes

-* 3-D convolution host API bug
-* `[HOTFIX][MI200][FP16]` has been disabled for `ConvHipImplicitGemmBwdXdlops` when FP16_ALT is
+* 3D convolution host API bug
+* `[HOTFIX][MI200][FP16]` has been disabled for `ConvHipImplicitGemmBwdXdlops` when `FP16_ALT` is
  required

-####	MIVisionX
+####	MIVisionX 2.5.0

-* Added Comprehensive CTests to aid developers
-* Introduced Doxygen support for complete API documentation
-* Simplified dependencies for rocAL
+### Additions
+
+* CTest: Tests for install verification
+* Hardware support updates
+* Doxygen support for API documentation
+
+### Optimizations
+
+* CMakeList Cleanup
+* Readme
+
+### Changes
+
+* rocAL: PyBind Link to prebuilt library
+  * PyBind11
+  * RapidJSON
+* Setup Updates
+* RPP - Use package install
+* Dockerfiles: Updates & bugfix
+* CuPy - No longer installed with setup.py
+
+### Fixes
+
+* rocAL bug fix and updates
+
+### Known issues
+
+* OpenCV 4.X support for some applications is missing
+* MIVisionX package install requires manual prerequisites installation

 #### OpenMP

@@ -504,7 +590,7 @@ RCCL 2.15.5 for ROCm 6.0.0
 ##### Removals

 * Removed TransferBench from tools as it exists in standalone repo:
-  [https://github.com/ROCmSoftwarePlatform/TransferBench](https://github.com/ROCmSoftwarePlatform/TransferBench)
+  [https://github.com/ROCm/TransferBench](https://github.com/ROCm/TransferBench)

 #### rocALUTION 3.0.3

@@ -512,7 +598,7 @@ rocALUTION 3.0.3 for ROCm 6.0.0

 ##### Additions

-* Support for 64bit integer vectors
+* Support for 64-bit integer vectors
 * Inclusive and exclusive sum functionality for vector classes
 * Transpose functionality for `GlobalMatrix` and `LocalMatrix`
 * `TripleMatrixProduct` functionality for `LocalMatrix`
@@ -551,8 +637,8 @@ rocBLAS 4.0.0 for ROCm 6.0.0
 ##### Additions

 * Beta API `rocblas_gemm_batched_ex3` and `rocblas_gemm_strided_batched_ex3`
-* Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and
-  gemv_strided_batched
+* Input/output type `f16_r`/`bf16_r` and execution type `f32_r` support for Level 2 `gemv_batched` and
+  `gemv_strided_batched`
 * Use of `rocblas_status_excluded_from_build` when calling functions that require Tensile (when using
  rocBLAS built without Tensile)
 * System for asynchronous kernel launches that set a `rocblas_status` failure based on a
@@ -572,7 +658,7 @@ rocBLAS 4.0.0 for ROCm 6.0.0
 * `rocblas_gemm_ext2` API function
 * In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and
  out-of-place trmm
-* int8x4 support is removed (int8 support is unchanged)
+* INT8x4 support is removed (INT8 support is unchanged)
 * `#define __STDC_WANT_IEC_60559_TYPES_EXT__` is removed from `rocblas-types.h` (if you want
  ISO/IEC TS 18661-3:2015 functionality, you must define `__STDC_WANT_IEC_60559_TYPES_EXT__`
  before including `float.h`, `math.h`, and `rocblas.h`)
@@ -608,7 +694,7 @@ rocFFT 1.0.25 for ROCm 6.0.0

  These interfaces are still experimental and subject to change. Your feedback is appreciated.
  You can raise questions and concerns by opening issues in the
-  [rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).
+  [rocFFT issue tracker](https://github.com/ROCm/rocFFT/issues).

  Note that multi-device FFTs currently have several limitations (we plan to address these in future
  releases):
@@ -629,7 +715,6 @@ rocFFT 1.0.25 for ROCm 6.0.0
 * Built kernels in a solution map to the library kernel cache
 * Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real
  inverse (complex-to-real) input, as this allows for faster performance
-
 * `rocfft-rider` and `dyna-rocfft-rider` have been renamed to `rocfft-bench` and `dyna-rocfft-bench`;
  these are controlled by the `BUILD_CLIENTS_BENCH` CMake option
  * Links for the former file names are installed, and the former `BUILD_CLIENTS_RIDER` CMake option
@@ -801,7 +886,7 @@ rocSPARSE 3.0.2 for ROCm 6.0.0

 * `rocsparse_inverse_permutation`
 * Mixed-precisions for SpVV
-* Uniform int8 precision for gather and scatter
+* Uniform INT8 precision for gather and scatter

 #### rocThrust 3.0.0

@@ -836,7 +921,7 @@ rocWMMA 1.3.0 for ROCm 6.0.0
 ##### Additions

 * Support for gfx942
-* Support for f8, bf8, and xfloat32 data types
+* Support for F8, BF8, and xfloat32 data types
 * support for `HIP_NO_HALF`, `__ HIP_NO_HALF_CONVERSIONS__`, and
    `__ HIP_NO_HALF_OPERATORS__` (e.g., PyTorch environment)

@@ -857,7 +942,7 @@ Tensile 4.39.0 for ROCm 6.0.0

 ##### Additions

-* Added `aquavanjaram` support: gfx942, fp8/bf8 datatype, xf32 datatype, and
+* Added Aqua Vanjaram support: gfx942, FP8/BF8 datatype, xf32 datatype, and
  stochastic rounding for various datatypes
 * Added and updated tuning scripts
 * Added `DirectToLds` support for larger data types with 32-bit global load (old parameter `DirectToLds`
@@ -871,7 +956,7 @@ Tensile 4.39.0 for ROCm 6.0.0
 ##### Optimizations

 * Enabled `InitAccVgprOpt` for `MatrixInstruction` cases
-* Implemented local read related parameter calculations with `DirectToVgpr`
+* Implemented local read-related parameter calculations with `DirectToVgpr`
 * Enabled dedicated vgpr allocation for local read + pack
 * Optimized code initialization
 * Optimized sgpr allocation
@@ -900,7 +985,7 @@ Tensile 4.39.0 for ROCm 6.0.0

 ##### Fixes

-* Predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
+* Predicate ordering for FP16alt impl round near zero mode to unbreak distance modes
 * Boundary check for mirror dims and re-enable disabled mirror dims test cases
 * Merge error affecting i8 with WMMA
 * Mismatch issue with DTLds + TSGR + TailLoop
@@ -996,8 +1081,8 @@ hipBLAS 2.0.0 for ROCm 6.0.0

 ##### Added

- added option to define HIPBLAS_USE_HIP_BFLOAT16 to switch API to use hip_bfloat16 type
- added hipblasGemmExWithFlags API
+- added option to define `HIPBLAS_USE_HIP_BFLOAT16` to switch API to use `hip_bfloat16` type
+- added `hipblasGemmExWithFlags` API

 ##### Deprecated

@@ -1068,7 +1153,7 @@ hipSPARSE 3.0.0 for ROCm 6.0.0

 ##### Changed

- Changed hipsparseSpSV_solve() API function to match cusparse API
+- Changed hipsparseSpSV_solve() API function to match cuSPARSE API
 - Changed generic API functions to use const descriptors
 - Documentation improved

@@ -1138,9 +1223,12 @@ rocBLAS 4.0.0 for ROCm 6.0.0
 ##### Added

 - Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
- Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
- Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile
- Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy
+- Added input/output type `f16_r`/`bf16_r` and execution type `f32_r` support for Level 2
+  `gemv_batched` and `gemv_strided_batched`
+- Added `rocblas_status_excluded_from_build` to be used when calling functions which require Tensile
+  when using rocBLAS built without Tensile
+- Added system for async kernel launches setting a failure `rocblas_status` based on
+  `hipPeekAtLastError` discrepancy

 ##### Optimized

@@ -1154,7 +1242,7 @@ rocBLAS 4.0.0 for ROCm 6.0.0

 - rocblas_gemm_ext2 API function is removed
 - in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm
- int8x4 support is removed. int8 support is unchanged
+- INT8x4 support is removed. INT8 support is unchanged
 - The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h
 - The default build removes device code for gfx803 architecture from the fat binary

@@ -1179,7 +1267,7 @@ rocFFT 1.0.25 for ROCm 6.0.0

  `rocfft_field` is a new type that can be added to a plan description, to describe layout of FFT input or output.  `rocfft_field_add_brick` can be called one or more times to describe a brick decomposition of an FFT field, where each brick can be assigned a different device.

-  These interfaces are still experimental and subject to change.  We are interested to hear feedback on them.  Questions and concerns may be raised by opening issues on the [rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).
+  These interfaces are still experimental and subject to change.  We are interested to hear feedback on them.  Questions and concerns may be raised by opening issues on the [rocFFT issue tracker](https://github.com/ROCm/rocFFT/issues).

  Note that at this time, multi-device FFTs have several limitations:

@@ -1220,8 +1308,8 @@ rocm-cmake 0.11.0 for ROCm 6.0.0

 ##### Fixed

- ROCMClangTidy: Fixed extra make flags passed for clang tidy.
- ROCMTest: Fixed issues when using module in a subdirectory.
+- ROCMClangTidy: Fixed extra make flags passed for Clang-Tidy
+- ROCMTest: Fixed issues when using module in a subdirectory

 #### rocPRIM 3.0.0

@@ -1262,7 +1350,7 @@ rocSPARSE 3.0.2 for ROCm 6.0.0

 - Added rocsparse_inverse_permutation
 - Added mixed precisions for SpVV
- Added uniform int8 precision for Gather and Scatter
+- Added uniform INT8 precision for Gather and Scatter

 ##### Optimized

@@ -1310,7 +1398,7 @@ rocThrust 3.0.0 for ROCm 6.0.0

 ##### Removed

- Removed cub symlink from the root of the repository.
+- Removed CUB symlink from the root of the repository.
 - Removed support for deprecated macros (THRUST_DEVICE_BACKEND and THRUST_HOST_BACKEND).

 ##### Fixed
@@ -1328,13 +1416,13 @@ rocWMMA 1.3.0 for ROCm 6.0.0
 ##### Added

 - Added support for gfx940, gfx941 and gfx942 targets
- Added support for f8, bf8 and xfloat32 datatypes
+- Added support for f8, BF8 and xfloat32 datatypes
 - Added support for HIP_NO_HALF, __ HIP_NO_HALF_CONVERSIONS__ and __ HIP_NO_HALF_OPERATORS__ (e.g. pytorch environment)

 ##### Changed

- rocWMMA with hipRTC now supports bfloat16_t datatype
- gfx11 wmma now uses lane swap instead of broadcast for layout adjustment
+- rocWMMA with hipRTC now supports `bfloat16_t` datatype
+- gfx11 WMMA now uses lane swap instead of broadcast for layout adjustment
 - Updated samples GEMM parameter validation on host arch

 ##### Fixed
@@ -1348,7 +1436,7 @@ Tensile 4.39.0 for ROCm 6.0.0

 ##### Added

- Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
+- Added aquavanjaram support: gfx940/gfx941/gfx942, FP8/BF8 datatype, xf32 datatype, and stochastic rounding for various datatypes
 - Added/updated tuning scripts
 - Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
 - Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
@@ -1387,9 +1475,9 @@ Tensile 4.39.0 for ROCm 6.0.0

 ##### Fixed

- Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
+- Fixed predicate ordering for FP16alt impl round near zero mode to unbreak distance modes
 - Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
- Fixed merge error affecting i8 with wmma
+- Fixed merge error affecting i8 with WMMA
 - Fixed mismatch issue with DTLds + TSGR + TailLoop
 - Fixed a bug with InitAccVgprOpt + GSU&gt;1 and a mismatch issue with PGR=0
 - Fixed override for unloaded solutions when lazy loading
@@ -1514,7 +1602,7 @@ API, splitting rocRAND and hipRAND into separate packages, and reorganizing our

 #### AMD Instinct™ MI50 end-of-support notice

-AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter
+AMD Instinct MI50, Radeon PRO VII, and Radeon VII products (collectively gfx906 GPUs) will enter
 maintenance mode starting Q3 2023.

 As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the
@@ -1592,7 +1680,7 @@ the GPU in heterogeneous applications. Ideally, developers should treat heteroge
 OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.

 Refer to the documentation on LLVM ASan with the GPU at
-[LLVM AddressSanitizer User Guide](../conceptual/using-gpu-sanitizer.md).
+{doc}`LLVM AddressSanitizer User Guide<rocm:conceptual/using-gpu-sanitizer>`.

 :::{note}
 The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
@@ -1764,7 +1852,7 @@ MIGraphX 2.7 for ROCm 5.7.0

 ##### Removed

- Removed int8x4 rocBlas calls due to deprecation
+- Removed INT8x4 rocBlas calls due to deprecation
 - removed std::reduce usage since not all OS&#39; support it

 #### hipBLAS 1.1.0
@@ -1793,9 +1881,9 @@ hipSPARSE 2.3.8 for ROCm 5.7.0

 ##### Improved

- Fix compilation failures when using cusparse 12.1.0 backend
- Fix compilation failures when using cusparse 12.0.0 backend
- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
+- Fix compilation failures when using cuSPARSE 12.1.0 backend
+- Fix compilation failures when using cuSPARSE 12.0.0 backend
+- Fix compilation failures when using cuSPARSE 10.1 (non-update versions) as backend
 - Minor improvements

 #### rocALUTION 2.1.11
@@ -2037,7 +2125,7 @@ few examples include:
 ### OS and GPU support changes

 * SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
-* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs)
+* AMD Instinct MI50, Radeon PRO VII, and Radeon VII products (collectively referred to as gfx906 GPUs)
  will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA
  release date.
  * No new features and performance optimizations will be supported for the gfx906 GPUs beyond
@@ -2068,7 +2156,7 @@ few examples include:
 #### Fixes

 * Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in
-  [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198).
+  [Issue 2198](https://github.com/ROCm/ROCm/issues/2198).

 ### HIP 5.6 (for ROCm 5.6)

@@ -2079,7 +2167,7 @@ few examples include:

 #### Additions

-* Added hipRTC support for amd_hip_fp16
+* Added hipRTC support for `amd_hip_fp16`
 * Added hipStreamGetDevice implementation to get the device associated with the stream
 * Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
 * hipArrayGetInfo for getting information about the specified array
@@ -2295,7 +2383,7 @@ rocBLAS 3.0.0 for ROCm 5.6.0

 ##### Added

- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.
+- Added BF16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex.

 ##### Deprecated

@@ -2397,7 +2485,7 @@ rocThrust 2.18.0 for ROCm 5.6.0

 ##### Changed

- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
+- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/ROCm/rocm-docs-core).

 #### rocWMMA 1.1.0

@@ -2416,7 +2504,7 @@ rocWMMA 1.1.0 for ROCm 5.6.0
 ##### Changed

 - Default to GPU rocBLAS validation against rocWMMA
- Re-enabled int8 gemm tests on gfx9
+- Re-enabled INT8 gemm tests on gfx9
 - Upgraded to C++17
 - Restructured unit test folder for consistency
 - Consolidated rocWMMA samples common code
@@ -2430,7 +2518,7 @@ Tensile 4.37.0 for ROCm 5.6.0
 - Added user driven tuning API
 - Added decision tree fallback feature
 - Added SingleBuffer + AtomicAdd option for GlobalSplitU
- DirectToVgpr support for fp16 and Int8 with TN orientation
+- DirectToVgpr support for FP16 and Int8 with TN orientation
 - Added new test cases for various functions
 - Added SingleBuffer algorithm for ZGEMM/CGEMM
 - Added joblib for parallel map calls
@@ -2495,7 +2583,7 @@ Windows are described in detail in our
 page and differs from the Linux feature set. Visit
 [Quick Start](https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html#)
 page to get started. Known issues are tracked on
-[GitHub](https://github.com/RadeonOpenCompute/ROCm/issues?q=is%3Aopen+label%3A5.5.1+label%3A%22Verified+Issue%22+label%3AWindows).
+[GitHub](https://github.com/ROCm/ROCm/issues?q=is%3Aopen+label%3A5.5.1+label%3A%22Verified+Issue%22+label%3AWindows).

 #### HIP API change

@@ -2561,8 +2649,8 @@ The following hipcc changes are implemented in this release:
  `hipcc` package for installing `hipcc` binaries in future ROCm releases.

 * In a future ROCm release, the following samples will be removed from the `hip-tests` project.
-  * `hipBusbandWidth` at <https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/shipBusBandwidth>
-  * `hipCommander` at <https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/hipCommander>
+  * `hipBusbandWidth` at <https://github.com/ROCm/hip-tests/tree/752f3295baddeb122aa8dc18d254cb9df6fd619d/samples/1_Utils/hipBusBandwidth>
+  * `hipCommander` at <https://github.com/ROCm/hip-tests/tree/752f3295baddeb122aa8dc18d254cb9df6fd619d/samples/1_Utils/hipCommander>

  Note that the samples will continue to be available in previous release branches.
 * Removal of gcnarch from hipDeviceProp_t structure
@@ -2680,7 +2768,7 @@ The new HIP graph management APIs are as follows:
 This release consists of the following OpenMP enhancements:

 * Additional support for OMPT functions `get_device_time` and `get_record_type`
-* Added support for min/max fast fp atomics on AMD GPUs
+* Added support for min/max fast FP atomics on AMD GPUs
 * Fixed the use of the abs function in C device regions

 ### Deprecations and warnings
@@ -2751,7 +2839,6 @@ included symbolic-link and wrapper header files in its old location for backward
 :::{note}
 ROCm will continue supporting backward compatibility until the next major release.
 :::
-
 ##### Wrapper header files

 Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
@@ -2891,9 +2978,9 @@ MIGraphX 2.5 for ROCm 5.5.0
 ##### Added

 - Y-Model feature to store tuning information with the optimized model
- Added Python 3.10 bindings 
+- Added Python 3.10 bindings
 - Accuracy checker tool based on ONNX Runtime
- ONNX Operators parse_split, and Trilu 
+- ONNX Operators parse_split, and Trilu
 - Build support for ROCm MLIR
 - Added migraphx-driver flag to print optimizations in python (--python)
 - Added JIT implementation of the Gather and Pad operator which results in better handling of larger tensor sizes.
@@ -2907,9 +2994,9 @@ MIGraphX 2.5 for ROCm 5.5.0

 ##### Fixed

- Improved parsing Tensorflow Protobuf files 
+- Improved parsing Tensorflow Protobuf files
 - Resolved various accuracy issues with some onnx models
- Resolved a gcc-12 issue with mivisionx
+- Resolved a gcc-12 issue with MIVisionX
 - Improved support for larger sized models and batches
 - Use --offload-arch instead of --cuda-gpu-arch for the HIP compiler
 - Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow.
@@ -3008,7 +3095,7 @@ hipSPARSE 2.3.5 for ROCm 5.5.0
 ##### Improved

 - Fixed an issue, where the rocm folder was not removed on upgrade of meta packages
- Fixed a compilation issue with cusparse backend
+- Fixed a compilation issue with cuSPARSE backend
 - Added more detailed messages on unit test failures due to missing input data
 - Improved documentation
 - Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis)
@@ -3055,7 +3142,7 @@ RCCL 2.15.5 for ROCm 5.5.0

 ##### Removed

- Removed TransferBench from tools.  Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench
+- Removed TransferBench from tools.  Exists in standalone repo: https://github.com/ROCm/TransferBench

 #### rocALUTION 2.1.8

@@ -3758,7 +3845,7 @@ For more information, refer to the HIP API Guide.

 With ROCm v5.4, a separate GitHub project is created at

-<https://github.com/ROCm-Developer-Tools/hip-tests>
+<https://github.com/ROCm/hip-tests>

 This contains HIP catch2 tests and samples, and new tests will continue to develop.

@@ -3966,7 +4053,7 @@ hipBLAS 0.53.0 for ROCm 5.4.0

 ##### Added

- Allow for selection of int8 datatype
+- Allow for selection of INT8 datatype
 - Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions),
  only supported with rocBLAS backend
 - Added support for hipblasXgelsBatched operations (with s,d,c,z precisions)
@@ -4020,7 +4107,7 @@ hipSPARSE 2.3.3 for ROCm 5.4.0

 ##### Changed

- HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse
+- HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cuSPARSE

 #### rccl 2.13.4

@@ -4078,7 +4165,7 @@ rocBLAS 2.46.0 for ROCm 5.4.0

 - Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values.  With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer.   This improves consistency with legacy BLAS behaviour.
 - Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
- Allow hipBLAS to select int8 datatype
+- Allow hipBLAS to select INT8 datatype
 - Disallow B == C &amp;&amp; ldb != ldc in rocblas_xtrmm_outofplace

 ##### Fixed
@@ -4515,7 +4602,7 @@ The compiler fix consists of the following patches:
 * The application of this attribute was refined such that it was not added to a specific compiler built-in
  where the compiler knows that inactive lanes do not impact program execution. For more
  information, see
-  <https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.
+  <https://github.com/ROCm/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.

 ### Known issues

@@ -4732,7 +4819,8 @@ rocBLAS 2.45.0 for ROCm 5.3.0

 ##### Removed

- install.sh options  --hip-clang , --no-hip-clang, --merge-files, --no-merge-files are removed.
+- `install.sh` options  `--hip-clang`, `--no-hip-clang`, `--merge-files`, and `--no-merge-files` were
+  removed

 #### rocFFT 1.0.18

@@ -4956,7 +5044,7 @@ debugger deployment and management tools
 No notable changes in this release for deployment and management tools.

 For release information for older ROCm releases, refer to
-<https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>
+<https://github.com/ROCm/ROCm/blob/master/CHANGELOG.md>

 ### Library changes in ROCM 5.2.3

@@ -5053,7 +5141,7 @@ and can grow until the available free memory on the device is consumed.
 The test codes at the following link show how to implement applications using malloc and free
 functions in device kernels:

-<https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp>
+<https://github.com/ROCm/HIP/blob/d6224a55390bf2d8fd0180c21bae44f0d718d1eb/tests/src/deviceLib/hipDeviceMalloc.cpp>

 ##### New HIP APIs in this release

@@ -5336,7 +5424,7 @@ illustrate example usages of the C++ API. GEMM matrix multiplication is used as
 given the heavy precedent for the library. However, the usage portfolio is growing significantly and
 demonstrates different ways rocWMMA may be consumed.

-For more information, refer to [Communication Libraries](../reference/library-index.md)
+For more information, refer to [Communication libraries](../reference/api-libraries.md).

 #### OpenMP enhancements in this release

@@ -5571,7 +5659,7 @@ ambiguous kernel execution.
  `noundef` attribute, if it finds that argument is tagged with shuffle attribute. Refer to
  <https://reviews.llvm.org/D125378> for more information.

-* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute.
+* Introduce Clang builtin for `__shfl` to identify it and skip adding `noundef` attribute.

 * Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header
  need to insert freezes on the relevant inputs.
@@ -5811,25 +5899,25 @@ rocWMMA 0.7 for ROCm 5.2.0
 - Added GEMM sample
 - Added DLRM sample
 - Added SGEMV sample
- Added unit tests for cooperative wmma load and stores
+- Added unit tests for cooperative WMMA load and stores
 - Added unit tests for IOBarrier.h
- Added wmma load/ store  tests for different matrix types (A, B and Accumulator)
+- Added WMMA load/ store  tests for different matrix types (A, B and Accumulator)
 - Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest
 - Added block sizes 4, 8 to test MmaSynMultiLdsTest
- Added support for wmma load / store layouts with block dimension greater than 64
- Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types
+- Added support for WMMA load / store layouts with block dimension greater than 64
+- Added IOShape structure to define the attributes of mapping and layouts for all WMMA matrix types
 - Added CI testing for rocWMMA

 ##### Changed

- Renamed wmma to rocwmma in cmake, header files and documentation
+- Renamed WMMA to rocWMMA in cmake, header files and documentation
 - Renamed library files
 - Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset)
 - Opaque load/store continue to use incrementatl offsets as they fill the entire block
 - Cooperative load/store use cumulative offsets as they fill only small portions for the entire block
 - Increased Max split counts to 64 for cooperative load/store
- Moved all the wmma definitions, API headers to rocwmma namespace
- Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator)
+- Moved all the WMMA definitions, API headers to rocWMMA namespace
+- Modified WMMA fill unit tests to validate all matrix types (A, B, Accumulator)

 #### Tensile 4.33.0

@@ -5996,7 +6084,7 @@ new inferior.

 #### MIOpen support for RDNA GPUs

-This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and
+This release includes support for AMD Radeon PRO W6800, in addition to other bug fixes and
 performance improvements as listed below:

 * MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)
@@ -6046,7 +6134,7 @@ in this release.

 For more information, refer to the following websites:

-* <https://github.com/RadeonOpenCompute/criu/blob/amdgpu_plugin-03252022/Documentation/amdgpu_plugin.txt>
+* <https://github.com/ROCm/criu/blob/amdgpu_plugin-03252022/Documentation/amdgpu_plugin.txt>

 * <https://criu.org/Main_Page>

@@ -6103,9 +6191,7 @@ debug information.
 **Issue:** Random memory access fault issues are observed while running Math libraries unit tests.
 This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.

-:::{note}
-The faults only occur in the SRIOV environment.
-:::
+Note, the faults only occur in the SRIOV environment.

 **Workaround:** Use SDMA to update the page table. The Guest set up steps are as follows:

@@ -6126,7 +6212,7 @@ Where expectation is 0.
 #### CU masking causes application to freeze

 Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed
-only in the GFX10 suite of products. Note that this issue is observed only in GFX10 suite of products.
+only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.

 This issue is under active investigation at this time.

@@ -6144,7 +6230,7 @@ As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-
 Workloads that use the cooperative groups function to ensure all waves can be resident at the same
 time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.

-#### Radeon Pro V620 and W6800 workstation GPUs
+#### Radeon PRO V620 and W6800 workstation GPUs

 ##### No support for ROCDebugger on SRIOV

@@ -6211,7 +6297,7 @@ hipCUB 2.11.0 for ROCm 5.1.0
 ##### Added

 - Device segmented sort
- Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB
+- Warp merge sort, WarpMask and thread sort from CUB 1.15.0 supported in hipCUB
 - Device three way partition

 ##### Changed
@@ -6376,7 +6462,7 @@ rocFFT 1.0.16 for ROCm 5.1.0

 ##### Removed

- The hipFFT API (header) has been removed from after a long deprecation period.  Please use the [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package/repository to obtain the hipFFT API.
+- The hipFFT API (header) has been removed from after a long deprecation period.  Please use the [hipFFT](https://github.com/ROCm/hipFFT) package/repository to obtain the hipFFT API.

 #### rocPRIM 2.10.13

@@ -6414,7 +6500,7 @@ rocRAND 2.10.13 for ROCm 5.1.0

 ##### Changed

- [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND.git) split into a separate package
+- [hipRAND](https://github.com/ROCm/hipRAND.git) split into a separate package
 - Header file installation location changed to match other libraries.
  - Using the `rocrand.h` header file should now use `#include &lt;rocrand/rocrand.h&gt;`, rather than `#include &lt;rocrand/rocrand.h&gt;`
 - rocRAND still includes hipRAND using a submodule
@@ -6531,7 +6617,7 @@ This fix may lead to breakage in some OpenMP offload use cases, which use print
 and result in an abort in device code. The issue will be fixed in a future release.
 :::

-The compatibility matrix in the [Deep-learning guide](../how-to/deep-learning-rocm.md) is updated for
+The compatibility matrix in the {doc}`Deep-learning guide<rocm:how-to/deep-learning-rocm>` is updated for
 ROCm v5.0.2.

 ### Library changes in ROCM 5.0.2
@@ -6644,7 +6730,7 @@ Refer to the HIP API documentation for more details on managed memory APIs.

 For the application, see

-<https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp>
+<https://github.com/ROCm/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp>

 #### New environment variable

@@ -6927,7 +7013,7 @@ follows:

 5. Users can find the values of the collected counters in the output file generated in step 2.

-#### Radeon Pro V620 and W6800 workstation GPUs
+#### Radeon PRO V620 and W6800 workstation GPUs

 ##### No support for SMI and ROCDebugger on SRIOV

@@ -6991,15 +7077,13 @@ In this release, arithmetic operators of HIP complex and vector types are deprec
  `std::complex` types.

 * As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native
-  clang vector type associated with the data member of HIP vector types.
+  Clang vector type associated with the data member of HIP vector types.

 During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and
 `_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic
 operators of HIP complex or vector types.

-:::{note}
-The two macros are mutually exclusive and, by default, set to Off.
-:::
+Note, the two macros are mutually exclusive and, by default, set to Off.

 The arithmetic operators of HIP complex and vector types will be removed in a future release.

@@ -7062,7 +7146,7 @@ hipCUB 2.10.13 for ROCm 5.0.0

 ##### Fixed

- Added missing includes to hipcub.hpp
+- Added missing includes to `hipcub.hpp`

 ##### Added

--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -8,7 +8,7 @@

 AMD values and encourages contributions to our code and documentation. If you want to contribute
 to our ROCm repositories, first review the following guidance. For documentation-specific information,
-see [Contributing to ROCm docs](https://rocm.docs.amd.com/en/latest/contribute/contribute-docs.html).
+see [Contributing to ROCm docs](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).

 ROCm is a software stack made up of a collection of drivers, development tools, and APIs that enable
 GPU programming from low-level kernel to end-user applications. Because some of our components
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance
 artificial intelligence (AI), scientific computing, and computer aided design (CAD).

 ROCm is powered by AMD’s
-[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
+[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm/HIP),
 an open-source software C++ GPU programming environment and its corresponding runtime. HIP
 allows ROCm developers to create portable applications on different platforms by deploying code on a
 range of platforms, from dedicated gaming GPUs to exascale HPC clusters.
@@ -21,10 +21,11 @@ source software compilers, debuggers, and libraries. ROCm is fully integrated in

 ## ROCm documentation

-This repository contains the manifest file for ROCm releases, changelogs, and release information.
+This repository contains the [manifest file](https://gerrit.googlesource.com/git-repo/+/HEAD/docs/manifest-format.md)
+for ROCm releases, changelogs, and release information.

 The `default.xml` file contains information for all repositories and the associated commit used to build
-the current ROCm release; `default.xml` uses the Manifest Format repository.
+the current ROCm release; `default.xml` uses the [Manifest Format repository](https://gerrit.googlesource.com/git-repo/).

 Source code for our documentation is located in the `/docs` folder of most ROCm repositories. The
 `develop` branch of our repositories contains content for the next ROCm release.
@@ -34,7 +35,7 @@ The ROCm documentation homepage is [rocm.docs.amd.com](https://rocm.docs.amd.com
 ### Building our documentation

 For a quick-start build, use the following code. For more options and detail, refer to
-[Building documentation](./contribute/building.md).
+[Building documentation](./docs/contribute/building.md).

 ```bash
 cd docs
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,248 +1,54 @@
-# Release notes for AMD ROCm™ 6.0
-
-ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library
-support, and improved developer experience. This includes initial enablement of the AMD Instinct™
-MI300 series. Future releases will further enable and optimize this new platform. Key features include:
-
-* Improved performance in areas like lower precision math and attention layers.
-* New hipSPARSELt library accelerates AI workloads via AMD's sparse matrix core technique.
-* Upstream support is now available for popular AI frameworks like TensorFlow, JAX, and PyTorch.
-* New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.
-* Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and
-  tutorials on the [AMD ROCm Docs](https://rocm.docs.amd.com) site.
-* Consolidated developer resources and training on the new
-  [AMD ROCm Developer Hub](https://www.amd.com/en/developer/resources/rocm-hub.html).
-
-The following section provide a release overview for ROCm 6.0. For additional details, you can refer to
-the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html). We list known
-issues on [GitHub](https://github.com/ROCm/ROCm/issues).
-
-## OS and GPU support changes
-
-ROCm 6.0 enables the use of MI300A and MI300X Accelerators with a limited operating systems
-support. Future releases will add additional OS's to match our general offering.
-
-| Operating Systems | MI300A | MI300X |
-|:---:|:---:|:---:|
-| Ubuntu 22.04.3 | Supported | Supported |
-| RHEL 8.9 | Supported |  |
-| SLES15 SP5 | Supported |  |
-
-For older generations of supported Instinct products we've added the following operating systems:
-
-* RHEL 9.3
-* RHEL 8.9
-
-Note: For ROCm 6.2 and beyond, we've planned for end-of-support (EoS) for the following operating
-systems:
-
-* Ubuntu 20.04.5
-* SLES 15 SP4
-* RHEL/CentOS 7.9
-
-## New ROCm meta package
-
-We've added a new ROCm meta package for easy installation of all ROCm core packages, tools, and
-libraries. For example, the following command will install the full ROCm package: `apt-get install rocm`
-(Ubuntu), or `yum install rocm` (RHEL).
-
-## Filesystem Hierarchy Standard
-
-ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We've removed
-the backward compatibility support for old file locations.
-
-## Compiler location change
-
-* The installation path of LLVM has been changed from `/opt/rocm-<rel>/llvm` to
-  `/opt/rocm-<rel>/lib/llvm`. For backward compatibility, a symbolic link is provided to the old
-  location and will be removed in a future release.
-* The installation path of the device library bitcode has changed from `/opt/rocm-<rel>/amdgcn` to
-  `/opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn`. For backward compatibility, a symbolic link
-  is provided and will be removed in a future release.
-
-## Documentation
-
-CMake support has been added for documentation in the
-[ROCm repository](https://github.com/RadeonOpenCompute/ROCm).
-
-## AMD Instinct™ MI50 end-of-support notice
-
-AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters
-maintenance mode in ROCm 6.0.
-
-As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 was the
-final release for gfx906 GPUs in a fully supported state.
-
-  * Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.
-  * Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
-    2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
-  * Bug fixes will be made up to the next ROCm point release.
-  * Bug fixes will not be backported to older ROCm releases for gfx906.
-  * Distribution and operating system updates will continue per the ROCm release cadence for gfx906
-    GPUs until EOM.
-
-## ROCm projects
-
-The following sections contains project-specific release notes for ROCm 6.0. For additional details, you
-can refer to the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html).
-
-### AMD SMI
-
-* **Integrated the E-SMI (EPYC-SMI) library**.
-    You can now query CPU-related information directly through AMD SMI. Metrics include power,
-    energy, performance, and other system details.
-
-* **Added support for gfx942 metrics**.
-    You can now query MI300 device metrics to get real-time information. Metrics include power,
-    temperature, energy, and performance.
-
-### HIP
-
-* **New features to improve resource interoperability**.
-  * For external resource interoperability, we've added new structs and enums.
-  * We've added new members to HIP struct `hipDeviceProp_t` for surfaces, textures, and device
-    identifiers.
-
-* **Changes impacting backward compatibility**.
-    There are several changes impacting backward compatibility: we changed some struct members and
-    some enum values, and removed some deprecated flags. For additional information, please refer to
-    the Changelog.
-
-### hipCUB
-
-* **Additional CUB API support**.
-    The hipCUB backend is updated to CUB and Thrust 2.1.
-
-### HIPIFY
-
-* **Enhanced CUDA2HIP document generation**.
-    API versions are now listed in the CUDA2HIP documentation. To see if the application binary
-    interface (ABI) has changed, refer to the
-    [*C* column](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Runtime_API_functions_supported_by_HIP.html)
-    in our API documentation.
-
-* **Hipified rocSPARSE**.
-    We've implemented support for the direct hipification of additional cuSPARSE APIs into rocSPARSE
-    APIs under the `--roc` option. This covers a major milestone in the roadmap towards complete
-    cuSPARSE-to-rocSPARSE hipification.
-
-### hipRAND
-
-* **Official release**.
-    hipRAND is now a *standalone project*--it's no longer available as a submodule for rocRAND.
-
-### hipTensor
-
-* **Added architecture support**.
-    We've added contraction support for gfx942 architectures, and f32 and f64 data
-    types.
-
-* **Upgraded testing infrastructure**.
-    hipTensor will now support dynamic parameter configuration with input YAML config.
-
-### MIGraphX
-
-* **Added TorchMIGraphX**.
-    We introduced a Dynamo backend for Torch, which allows PyTorch to use MIGraphX directly
-    without first requiring a model to be converted to the ONNX model format. With a single line of
-    code, PyTorch users can utilize the performance and quantization benefits provided by MIGraphX.
-
-* **Boosted overall performance with rocMLIR**.
-    We've integrated the rocMLIR library for ROCm-supported RDNA and CDNA GPUs. This
-    technology provides MLIR-based convolution and GEMM kernel generation.
-
-* **Added INT8 support across the MIGraphX portfolio**.
-    We now support the INT8 data type. MIGraphX can perform the quantization or ingest
-    prequantized models. INT8 support extends to the MIGraphX execution provider for ONNX Runtime.
-
-### ROCgdb
-
-* **Added support for additional GPU architectures**.
-  * Navi 3 series: gfx1100, gfx1101, and gfx1102.
-  * MI300 series: gfx942.
-
-### rocm-smi-lib
-
-* **Improved accessibility to GPU partition nodes**.
-    You can now view, set, and reset the compute and memory partitions. You'll also get notifications of
-    a GPU busy state, which helps you avoid partition set or reset failure.
-
-* **Upgraded GPU metrics version 1.4**.
-    The upgraded GPU metrics binary has an improved metric version format with a content version
-    appended to it. You can read each metric within the binary without the full `rsmi_gpu_metric_t` data
-    structure.
-
-* **Updated GPU index sorting**.
-    We made GPU index sorting consistent with other ROCm software tools by optimizing it to use
-    `Bus:Device.Function` (BDF) instead of the card number.
-
-### ROCm Compiler
-
-* **Added kernel argument optimization on gfx942**.
-    With the new feature, you can preload kernel arguments into Scalar General-Purpose Registers
-    (SGPRs) rather than pass them in memory. This feature is enabled with a compiler option, which also
-    controls the number of arguments to pass in SGPRs. For more information, see:
-    [https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments](https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments)
-
-* **Improved register allocation at -O0**.
-    We've improved the register allocator used at -O0 to avoid compiler crashes (when the signature is
-    'ran out of registers during register allocation').
-
-* **Improved generation of debug information**.
-    We've improved compile time when generating debug information for certain corner cases. We've
-    also improved the compiler to eliminate compiler crashes when generating debug information.
-
-### ROCmValidationSuite
-
-* **Added GPU and operating system support**.
-    We added support for MI300X GPU in GPU Stress Test (GST).
-
-### Roc Profiler
-
-* **Added option to specify desired Roc Profiler version**.
-    You can now use rocProfV1 or rocProfV2 by specifying your desired version, as the legacy rocProf
-    (`rocprofv1`) provides the option to use the latest version (`rocprofv2`).
-
-* **Automated the ISA dumping process by Advance Thread Tracer**.
-    Advance Thread Tracer (ATT) no longer depends on user-supplied Instruction Set Architecture (ISA)
-    and compilation process (using ``hipcc --save-temps``) to dump ISA from the running kernels.
-
-* **Added ATT support for parallel kernels**.
-    The automatic ISA dumping process also helps ATT successfully parse multiple kernels running in
-    parallel, and provide cycle-accurate occupancy information for multiple kernels at the same time.
-
-### ROCr
-
-* **Support for SDMA link aggregation**.
-    If multiple XGMI links are available when making SDMA copies between GPUs, the copy is
-    distributed over multiple links to increase peak bandwidth.
-
-### rocThrust
-
-* **Added Thrust 2.1 API support**.
-    rocThrust backend is updated to Thrust and CUB 2.1.
-
-### rocWMMA
-
-* **Added new architecture support**.
-    We added support for gfx942 architectures.
-
-* **Added data type support**.
-    We added support for f8, bf8, xf32 data types on supporting architectures, and for bf16 in the HIP RTC
-    environment.
-
-* **Added support for the PyTorch kernel plugin**.
-    We added awareness of `__HIP_NO_HALF_CONVERSIONS__` to support PyTorch users.
-
-### TransferBench (beta)
-
-* **Improved ordering control**.
-    You can now set the thread block size (`BLOCK_SIZE`) and the thread block order (`BLOCK_ORDER`)
-    in which thread blocks from different transfers are run when using a single stream.
-
-* **Added comprehensive reports**.
-    We modified individual transfers to report X Compute Clusters (XCC) ID when `SHOW_ITERATIONS`
-    is set to 1.
-
-* **Improved accuracy in result validation**.
-    You can now validate results for each iteration instead of just once for all iterations.
+# Release notes
+<!-- Disable lints since this is an auto-generated file.    -->
+<!-- markdownlint-disable blanks-around-headers             -->
+<!-- markdownlint-disable no-duplicate-header               -->
+<!-- markdownlint-disable no-blanks-blockquote              -->
+<!-- markdownlint-disable ul-indent                         -->
+<!-- markdownlint-disable no-trailing-spaces                -->
+
+<!-- spellcheck-disable -->
+
+This page contains the release notes for AMD ROCm Software.
+
+-------------------
+
+## ROCm 6.0.2
+
+The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings. 
+
+### Library changes in ROCm 6.0.2
+
+| Library | Version |
+|---------|---------|
+| AMDMIGraphX |  ⇒ [2.8](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.0.2) |
+| hipBLAS |  ⇒ [2.0.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.0.2) |
+| hipBLASLt |  ⇒ [0.6.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.0.2) |
+| hipCUB |  ⇒ [3.0.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.0.2) |
+| hipFFT |  ⇒ [1.0.13](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.0.2) |
+| hipRAND |  ⇒ [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.0.2) |
+| hipSOLVER |  ⇒ [2.0.0](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.0.2) |
+| hipSPARSE |  ⇒ [3.0.0](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.0.2) |
+| hipSPARSELt |  ⇒ [0.1.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.0.2) |
+| hipTensor |  ⇒ [1.1.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.0.2) |
+| MIOpen |  ⇒ [2.19.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.0.2) |
+| rccl |  ⇒ [2.15.5](https://github.com/ROCm/rccl/releases/tag/rocm-6.0.2) |
+| rocALUTION |  ⇒ [3.0.3](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.0.2) |
+| rocBLAS |  ⇒ [4.0.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.0.2) |
+| rocFFT |  ⇒ [1.0.25](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.0.2) |
+| rocm-cmake |  ⇒ [0.11.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.0.2) |
+| rocPRIM |  ⇒ [3.0.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.0.2) |
+| rocRAND |  ⇒ [3.0.0](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.0.2) |
+| rocSOLVER |  ⇒ [3.24.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.0.2) |
+| rocSPARSE |  ⇒ [3.0.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.0.2) |
+| rocThrust |  ⇒ [3.0.0](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.0.2) |
+| rocWMMA |  ⇒ [1.3.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.0.2) |
+| Tensile |  ⇒ [4.39.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.0.2) |
+
+#### hipFFT 1.0.13
+
+hipFFT 1.0.13 for ROCm 6.0.2
+
+##### Changes
+
+* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
+ over (this should help simplify downstream builds and packaging)
--- a/cmake/Modules/Dependencies.cmake
+++ b/cmake/Modules/Dependencies.cmake
@@ -35,7 +35,7 @@ if(BUILD_DOCS)
      CACHE STRING "rocm-cmake tag to download")
    FetchContent_Declare(
      rocm-cmake
-      GIT_REPOSITORY https://github.com/RadeonOpenCompute/rocm-cmake.git
+      GIT_REPOSITORY https://github.com/ROCm/rocm-cmake.git
      GIT_TAG        ${rocm_cmake_tag}
      SOURCE_SUBDIR "DISABLE ADDING TO BUILD" # We don't really want to consume the build and test targets of ROCm CMake.
    )
--- a/default.xml
+++ b/default.xml
@@ -1,13 +1,8 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <manifest>
    <remote name="rocm-org" fetch="https://github.com/ROCm/" />
-    <remote name="roc-github" fetch="https://github.com/RadeonOpenCompute/" />
-    <remote name="rocm-devtools" fetch="https://github.com/ROCm-Developer-Tools/" />
-    <remote name="rocm-swplat" fetch="https://github.com/ROCmSoftwarePlatform/" />
-    <remote name="gpuopen-libs" fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
-    <remote name="gpuopen-tools" fetch="https://github.com/GPUOpen-Tools/" />
    <remote name="KhronosGroup" fetch="https://github.com/KhronosGroup/" />
-    <default revision="refs/tags/rocm-6.0.0"
+    <default revision="refs/tags/rocm-6.0.2"
     remote="rocm-org"
     sync-c="true"
     sync-j="4" />
--- a/docs/about/compatibility/data-type-support.rst
+++ b/docs/about/compatibility/data-type-support.rst
@@ -0,0 +1,564 @@
+.. meta::
+  :description: Supported data types in ROCm
+  :keywords: int8, float8, float8 (E4M3), float8 (E5M2), bfloat8, float16, half, bfloat16, tensorfloat32, float, float32, float64, double, AMD, ROCm, AMDGPU
+
+.. _rocm-supported-data-types:
+
+*************************************************************
+ROCm data type specifications
+*************************************************************
+
+Integral types
+==========================================
+
+The signed and unsigned integral types that are supported by ROCm™ are listed in the following table,
+together with their corresponding HIP type and a short description.
+
+
+.. list-table::
+    :header-rows: 1
+    :widths: 15,35,50
+
+    *
+      - Type name
+      - HIP type
+      - Description
+    *
+      - int8
+      - ``int8_t``, ``uint8_t``
+      - A signed or unsigned 8-bit integer
+    *
+      - int16
+      - ``int16_t``, ``uint16_t``
+      - A signed or unsigned 16-bit integer
+    *
+      - int32
+      - ``int32_t``, ``uint32_t``
+      - A signed or unsigned 32-bit integer
+    *
+      - int64
+      - ``int64_t``, ``uint64_t``
+      - A signed or unsigned 64-bit integer
+
+Floating-point types
+==========================================
+
+The floating-point types that are supported by ROCm are listed in the following table, together with
+their corresponding HIP type and a short description.
+
+.. image:: ../../data/about/compatibility/floating-point-data-types.png
+    :alt: Supported floating-point types
+
+.. list-table::
+    :header-rows: 1
+    :widths: 15,15,70
+
+    *
+      - Type name
+      - HIP type
+      - Description
+    *
+      - float8 (E4M3)
+      - ``-``
+      - An 8-bit floating-point number that mostly follows IEEE-754 conventions and **S1E4M3** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.
+    *
+      - float8 (E5M2)
+      - ``-``
+      - An 8-bit floating-point number mostly following IEEE-754 conventions and **S1E5M2** bit layout, as described in `8-bit Numerical Formats for Deep Neural Networks <https://arxiv.org/abs/2206.02915>`_ , with expanded range and with no infinity or signed zero. NaN is represented as negative zero.
+    *
+      - float16
+      - ``half``
+      - A 16-bit floating-point number that conforms to the IEEE 754-2008 half-precision storage format.
+    *
+      - bfloat16
+      - ``bfloat16``
+      - A shortened 16-bit version of the IEEE 754 single-precision storage format.
+    *
+      - tensorfloat32
+      - ``-``
+      - A floating-point number that occupies 32 bits or less of storage, providing improved range compared to half (16-bit) format, at (potentially) greater throughput than single-precision (32-bit) formats.
+    *
+      - float32
+      - ``float``
+      - A 32-bit floating-point number that conforms to the IEEE 754 single-precision storage format.
+    *
+      - float64
+      - ``double``
+      - A 64-bit floating-point number that conforms to the IEEE 754 double-precision storage format.
+
+.. note::
+
+  * The float8 and tensorfloat32 types are internal types used in calculations in Matrix Cores and can be stored in any type of the same size.
+  * The encodings for FP8 (E5M2) and FP8 (E4M3) that are natively supported by MI300 differ from the FP8 (E5M2) and FP8 (E4M3) encodings used in H100 (`FP8 Formats for Deep Learning <https://arxiv.org/abs/2209.05433>`_).
+  * In some AMD documents and articles, float8 (E5M2) is referred to as bfloat8.
+
+ROCm support icons
+==========================================
+
+In the following sections, we use icons to represent the level of support. These icons, described in the
+following table, are also used on the library data type support pages.
+
+.. list-table::
+    :header-rows: 1
+
+    *
+      -  Icon
+      - Definition
+    *
+      - ❌
+      - Not supported
+
+    *
+      - ⚠️
+      - Partial support
+
+    *
+      - ✅
+      - Full support
+
+.. note::
+
+  * Full support means that the type is supported natively or with hardware emulation.
+  * Native support means that the operations for that type are implemented in hardware. Types that are not natively supported are emulated with the available hardware. The performance of non-natively supported types can differ from the full instruction throughput rate. For example, 16-bit integer operations can be performed on the 32-bit integer ALUs at full rate; however, 64-bit integer operations might need several instructions on the 32-bit integer ALUs.
+  * Any type can be emulated by software, but this page does not cover such cases.
+
+Hardware type support
+==========================================
+
+AMD GPU hardware support for data types is listed in the following tables.
+
+Compute units support
+-------------------------------------------------------------------------------
+
+The following table lists data type support for compute units.
+
+.. tab-set::
+
+  .. tab-item:: Integral types
+    :sync: integral-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - int8
+        - int16
+        - int32
+        - int64
+      *
+        - MI100
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+      *
+        - MI200 series
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+      *
+        - MI300 series
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+
+  .. tab-item:: Floating-point types
+    :sync: floating-point-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - float8 (E4M3)
+        - float8 (E5M2)
+        - float16
+        - bfloat16
+        - tensorfloat32
+        - float32
+        - float64
+      *
+        - MI100
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+        - ❌
+        - ✅
+        - ✅
+      *
+        - MI200 series
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+        - ❌
+        - ✅
+        - ✅
+      *
+        - MI300 series
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+        - ❌
+        - ✅
+        - ✅
+
+Matrix core support
+-------------------------------------------------------------------------------
+
+The following table lists data type support for AMD GPU matrix cores.
+
+.. tab-set::
+
+  .. tab-item:: Integral types
+    :sync: integral-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - int8
+        - int16
+        - int32
+        - int64
+      *
+        - MI100
+        - ✅
+        - ❌
+        - ❌
+        - ❌
+      *
+        - MI200 series
+        - ✅
+        - ❌
+        - ❌
+        - ❌
+      *
+        - MI300 series
+        - ✅
+        - ❌
+        - ❌
+        - ❌
+
+  .. tab-item:: Floating-point types
+    :sync: floating-point-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - float8 (E4M3)
+        - float8 (E5M2)
+        - float16
+        - bfloat16
+        - tensorfloat32
+        - float32
+        - float64
+      *
+        - MI100
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+        - ❌
+        - ✅
+        - ❌
+      *
+        - MI200 series
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+        - ❌
+        - ✅
+        - ✅
+      *
+        - MI300 series
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+        - ✅
+
+Atomic operations support
+-------------------------------------------------------------------------------
+
+The following table lists data type support for atomic operations.
+
+.. tab-set::
+
+  .. tab-item:: Integral types
+    :sync: integral-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - int8
+        - int16
+        - int32
+        - int64
+      *
+        - MI100
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+      *
+        - MI200 series
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+      *
+        - MI300 series
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+
+  .. tab-item:: Floating-point types
+    :sync: floating-point-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Type name
+        - float8 (E4M3)
+        - float8 (E5M2)
+        - float16
+        - bfloat16
+        - tensorfloat32
+        - float32
+        - float64
+      *
+        - MI100
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+      *
+        - MI200 series
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+      *
+        - MI300 series
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+        - ❌
+        - ✅
+        - ✅
+
+.. note::
+
+  For cases that are not natively supported, you can emulate atomic operations using software.
+  Software-emulated atomic operations have high negative performance impact when they frequently
+  access the same memory address.
+
+Data Type support in ROCm Libraries
+==========================================
+
+ROCm library support for int8, float8 (E4M3), float8 (E5M2), int16, float16, bfloat16, int32,
+tensorfloat32, float32, int64, and float64 is listed in the following tables.
+
+Libraries input/output type support
+-------------------------------------------------------------------------------
+
+The following tables list ROCm library support for specific input and output data types. For a detailed
+description, refer to the corresponding library data type support page.
+
+.. tab-set::
+
+  .. tab-item:: Integral types
+    :sync: integral-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Library input/output data type name
+        - int8
+        - int16
+        - int32
+        - int64
+      *
+        - hipSPARSELt (:doc:`details<hipsparselt:reference/data-type-support>`)
+        - ✅/✅
+        - ❌/❌
+        - ❌/❌
+        - ❌/❌
+      *
+        - rocRAND (:doc:`details<rocrand:data-type-support>`)
+        - -/✅
+        - -/✅
+        - -/✅
+        - -/✅
+      *
+        - hipRAND (:doc:`details<hiprand:data-type-support>`)
+        - -/✅
+        - -/✅
+        - -/✅
+        - -/✅
+      *
+        - rocPRIM (:doc:`details<rocprim:data-type-support>`)
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+      *
+        - hipCUB (:doc:`details<hipcub:data-type-support>`)
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+      *
+        - rocThrust (:doc:`details<rocthrust:data-type-support>`)
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+        - ✅/✅
+
+  .. tab-item:: Floating-point types
+    :sync: floating-point-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Library input/output data type name
+        - float8 (E4M3)
+        - float8 (E5M2)
+        - float16
+        - bfloat16
+        - tensorfloat32
+        - float32
+        - float64
+      *
+        - hipSPARSELt (:doc:`details<hipsparselt:reference/data-type-support>`)
+        - ❌/❌
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+        - ❌/❌
+        - ❌/❌
+        - ❌/❌
+      *
+        - rocRAND (:doc:`details<rocrand:data-type-support>`)
+        - -/❌
+        - -/❌
+        - -/✅
+        - -/❌
+        - -/❌
+        - -/✅
+        - -/✅
+      *
+        - hipRAND (:doc:`details<hiprand:data-type-support>`)
+        - -/❌
+        - -/❌
+        - -/✅
+        - -/❌
+        - -/❌
+        - -/✅
+        - -/✅
+      *
+        - rocPRIM (:doc:`details<rocprim:data-type-support>`)
+        - ❌/❌
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+      *
+        - hipCUB (:doc:`details<hipcub:data-type-support>`)
+        - ❌/❌
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+      *
+        - rocThrust (:doc:`details<rocthrust:data-type-support>`)
+        - ❌/❌
+        - ❌/❌
+        - ⚠️/⚠️
+        - ⚠️/⚠️
+        - ❌/❌
+        - ✅/✅
+        - ✅/✅
+
+
+Libraries internal calculations type support
+-------------------------------------------------------------------------------
+
+The following tables list ROCm library support for specific internal data types. For a detailed
+description, refer to the corresponding library data type support page.
+
+.. tab-set::
+
+  .. tab-item:: Integral types
+    :sync: integral-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Library internal data type name
+        - int8
+        - int16
+        - int32
+        - int64
+      *
+        - hipSPARSELt (:doc:`details<hipsparselt:reference/data-type-support>`)
+        - ❌
+        - ❌
+        - ✅
+        - ❌
+
+
+  .. tab-item:: Floating-point types
+    :sync: floating-point-type
+
+    .. list-table::
+      :header-rows: 1
+
+      *
+        - Library internal data type name
+        - float8 (E4M3)
+        - float8 (E5M2)
+        - float16
+        - bfloat16
+        - tensorfloat32
+        - float32
+        - float64
+      *
+        - hipSPARSELt (:doc:`details<hipsparselt:reference/data-type-support>`)
+        - ❌
+        - ❌
+        - ❌
+        - ❌
+        - ❌
+        - ✅
+        - ❌
--- a/docs/about/compatibility/openmp.md
+++ b/docs/about/compatibility/openmp.md
@@ -78,7 +78,7 @@ Obtain the value of `gpu-arch` by running the following command:
 [//]: # (dated link below, needs updating)

 See the complete list of compiler command-line references
-[here](https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/clang/docs/CommandGuide/clang.rst).
+[here](https://github.com/ROCm/llvm-project/blob/amd-stg-open/clang/docs/CommandGuide/clang.rst).

 ### Using `rocprof` with OpenMP

@@ -413,7 +413,7 @@ void  main() {
 ```

 See the complete sample code for heap buffer overflow
-[here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp).
+[here](https://github.com/ROCm/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp).

 * Global buffer overflow

@@ -438,7 +438,7 @@ for(int i=0; i<N; i++){
 ```

 See the complete sample code for global buffer overflow
-[here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/global_buffer_overflow/openmp/vecadd-GBO.cpp).
+[here](https://github.com/ROCm/aomp/blob/aomp-dev/examples/tools/asan/global_buffer_overflow/openmp/vecadd-GBO.cpp).

 ### Clang compiler option for kernel optimization

--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -1,13 +1,142 @@
-# License
+<head>
+  <meta charset="UTF-8">
+  <meta name="description" content="ROCm licensing terms">
+  <meta name="keywords" content="license, licensing terms">
+</head>

-:::{note}
-This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that
-primarily contains documentation. For other licensing information, refer to the
-[Licensing Terms page](./licensing).
-:::
+# ROCm license

 ```{include} ../../LICENSE
 ```

-```{include} ./licensing.md
+:::{note}
+The preceding license applies to the [ROCm repository](https://github.com/ROCm/ROCm), which
+primarily contains documentation. For licenses related to other ROCm components, refer to the
+following section.
+:::
+
+## ROCm component licenses
+
+ROCm is released by Advanced Micro Devices, Inc. and is licensed per component separately.
+The following table is a list of ROCm components with links to their respective license
+terms. These components may include third party components subject to
+additional licenses. Please review individual repositories for more information.
+
+<!-- spellcheck-disable -->
+| Component | License |
+|:---------------------|:-------------------------|
+| [AMDMIGraphX](https://github.com/ROCm/AMDMIGraphX/) | [MIT](https://github.com/ROCm/AMDMIGraphX/blob/develop/LICENSE) |
+| [HIPCC](https://github.com/ROCm/HIPCC/blob/develop/LICENSE.txt) | [MIT](https://github.com/ROCm/HIPCC/blob/develop/LICENSE.txt) |
+| [HIPIFY](https://github.com/ROCm/HIPIFY/) | [MIT](https://github.com/ROCm/HIPIFY/blob/amd-staging/LICENSE.txt) |
+| [HIP](https://github.com/ROCm/HIP/) | [MIT](https://github.com/ROCm/HIP/blob/develop/LICENSE.txt) |
+| [MIOpenGEMM](https://github.com/ROCm/MIOpenGEMM/) | [MIT](https://github.com/ROCm/MIOpenGEMM/blob/master/LICENSE.txt) |
+| [MIOpen](https://github.com/ROCm/MIOpen/) | [MIT](https://github.com/ROCm/MIOpen/blob/master/LICENSE.txt) |
+| [MIVisionX](https://github.com/ROCm/MIVisionX/) | [MIT](https://github.com/ROCm/MIVisionX/blob/master/LICENSE.txt) |
+| [RCP](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | [MIT](https://github.com/GPUOpen-Tools/radeon_compute_profiler/blob/master/LICENSE) |
+| [ROCK-Kernel-Driver](https://github.com/ROCm/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/ROCm/ROCK-Kernel-Driver/blob/master/COPYING) |
+| [ROCR-Runtime](https://github.com/ROCm/ROCR-Runtime/) | [The University of Illinois/NCSA](https://github.com/ROCm/ROCR-Runtime/blob/master/LICENSE.txt) |
+| [ROCT-Thunk-Interface](https://github.com/ROCm/ROCT-Thunk-Interface/) | [MIT](https://github.com/ROCm/ROCT-Thunk-Interface/blob/master/LICENSE.md) |
+| [ROCclr](https://github.com/ROCm/ROCclr/) | [MIT](https://github.com/ROCm/ROCclr/blob/develop/LICENSE.txt) |
+| [ROCdbgapi](https://github.com/ROCm/ROCdbgapi/) | [MIT](https://github.com/ROCm/ROCdbgapi/blob/amd-master/LICENSE.txt) |
+| [ROCgdb](https://github.com/ROCm/ROCgdb/) | [GNU General Public License v2.0](https://github.com/ROCm/ROCgdb/blob/amd-master/COPYING) |
+| [ROCm-CompilerSupport](https://github.com/ROCm/ROCm-CompilerSupport/) | [The University of Illinois/NCSA](https://github.com/ROCm/ROCm-CompilerSupport/blob/amd-stg-open/LICENSE.txt) |
+| [ROCm-Device-Libs](https://github.com/ROCm/ROCm-Device-Libs/) | [The University of Illinois/NCSA](https://github.com/ROCm/ROCm-Device-Libs/blob/amd-stg-open/LICENSE.TXT) |
+| [ROCm-OpenCL-Runtime/api/opencl/khronos/icd](https://github.com/KhronosGroup/OpenCL-ICD-Loader/) | [Apache 2.0](https://github.com/KhronosGroup/OpenCL-ICD-Loader/blob/main/LICENSE) |
+| [ROCm-OpenCL-Runtime](https://github.com/ROCm/ROCm-OpenCL-Runtime/) | [MIT](https://github.com/ROCm/ROCm-OpenCL-Runtime/blob/develop/LICENSE.txt) |
+| [ROCmValidationSuite](https://github.com/ROCm/ROCmValidationSuite/) | [MIT](https://github.com/ROCm/ROCmValidationSuite/blob/master/LICENSE) |
+| [Tensile](https://github.com/ROCm/Tensile/) | [MIT](https://github.com/ROCm/Tensile/blob/develop/LICENSE.md) |
+| [aomp-extras](https://github.com/ROCm/aomp-extras/) | [MIT](https://github.com/ROCm/aomp-extras/blob/aomp-dev/LICENSE) |
+| [aomp](https://github.com/ROCm/aomp/) | [Apache 2.0](https://github.com/ROCm/aomp/blob/aomp-dev/LICENSE) |
+| [atmi](https://github.com/ROCm/atmi/) | [MIT](https://github.com/ROCm/atmi/blob/master/LICENSE.txt) |
+| [clang-ocl](https://github.com/ROCm/clang-ocl/) | [MIT](https://github.com/ROCm/clang-ocl/blob/master/LICENSE) |
+| [flang](https://github.com/ROCm/flang/) | [Apache 2.0](https://github.com/ROCm/flang/blob/master/LICENSE.txt) |
+| [half](https://github.com/ROCm/half/) | [MIT](https://github.com/ROCm/half/blob/master/LICENSE.txt) |
+| [hipBLAS](https://github.com/ROCm/hipBLAS/) | [MIT](https://github.com/ROCm/hipBLAS/blob/develop/LICENSE.md) |
+| [hipCUB](https://github.com/ROCm/hipCUB/) | [Custom](https://github.com/ROCm/hipCUB/blob/develop/LICENSE.txt) |
+| [hipFFT](https://github.com/ROCm/hipFFT/) | [MIT](https://github.com/ROCm/hipFFT/blob/develop/LICENSE.md) |
+| [hipSOLVER](https://github.com/ROCm/hipSOLVER/) | [MIT](https://github.com/ROCm/hipSOLVER/blob/develop/LICENSE.md) |
+| [hipSPARSELt](https://github.com/ROCm/hipSPARSELt/) | [MIT](https://github.com/ROCm/hipSPARSELt/blob/develop/LICENSE.md) |
+| [hipSPARSE](https://github.com/ROCm/hipSPARSE/) | [MIT](https://github.com/ROCm/hipSPARSE/blob/develop/LICENSE.md) |
+| [hipTensor](https://github.com/ROCm/hipTensor) | [MIT](https://github.com/ROCm/hipTensor/blob/develop/LICENSE) |
+| [hipamd](https://github.com/ROCm/hipamd/) | [MIT](https://github.com/ROCm/hipamd/blob/develop/LICENSE.txt) |
+| [hipfort](https://github.com/ROCm/hipfort/) | [MIT](https://github.com/ROCm/hipfort/blob/master/LICENSE) |
+| [llvm-project](https://github.com/ROCm/llvm-project/) | [Apache](https://github.com/ROCm/llvm-project/blob/main/LICENSE.TXT) |
+| [rccl](https://github.com/ROCm/rccl/) | [Custom](https://github.com/ROCm/rccl/blob/develop/LICENSE.txt) |
+| [rdc](https://github.com/ROCm/rdc/) | [MIT](https://github.com/ROCm/rdc/blob/master/LICENSE) |
+| [rocALUTION](https://github.com/ROCm/rocALUTION/) | [MIT](https://github.com/ROCm/rocALUTION/blob/develop/LICENSE.md) |
+| [rocBLAS](https://github.com/ROCm/rocBLAS/) | [MIT](https://github.com/ROCm/rocBLAS/blob/develop/LICENSE.md) |
+| [rocFFT](https://github.com/ROCm/rocFFT/) | [MIT](https://github.com/ROCm/rocFFT/blob/develop/LICENSE.md) |
+| [rocPRIM](https://github.com/ROCm/rocPRIM/) | [MIT](https://github.com/ROCm/rocPRIM/blob/develop/LICENSE.txt) |
+| [rocRAND](https://github.com/ROCm/rocRAND/) | [MIT](https://github.com/ROCm/rocRAND/blob/develop/LICENSE.txt) |
+| [rocSOLVER](https://github.com/ROCm/rocSOLVER/) | [BSD-2-Clause](https://github.com/ROCm/rocSOLVER/blob/develop/LICENSE.md) |
+| [rocSPARSE](https://github.com/ROCm/rocSPARSE/) | [MIT](https://github.com/ROCm/rocSPARSE/blob/develop/LICENSE.md) |
+| [rocThrust](https://github.com/ROCm/rocThrust/) | [Apache 2.0](https://github.com/ROCm/rocThrust/blob/develop/LICENSE) |
+| [rocWMMA](https://github.com/ROCm/rocWMMA/) | [MIT](https://github.com/ROCm/rocWMMA/blob/develop/LICENSE.md) |
+| [rocm-cmake](https://github.com/ROCm/rocm-cmake/) | [MIT](https://github.com/ROCm/rocm-cmake/blob/develop/LICENSE) |
+| [rocm_bandwidth_test](https://github.com/ROCm/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocm_bandwidth_test/blob/master/LICENSE.txt) |
+| [rocm_smi_lib](https://github.com/ROCm/rocm_smi_lib/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocm_smi_lib/blob/master/License.txt) |
+| [rocminfo](https://github.com/ROCm/rocminfo/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocminfo/blob/master/License.txt) |
+| [rocprofiler](https://github.com/ROCm/rocprofiler/) | [MIT](https://github.com/ROCm/rocprofiler/blob/amd-master/LICENSE) |
+| [rocr_debug_agent](https://github.com/ROCm/rocr_debug_agent/) | [The University of Illinois/NCSA](https://github.com/ROCm/rocr_debug_agent/blob/master/LICENSE.txt) |
+| [roctracer](https://github.com/ROCm/roctracer/) | [MIT](https://github.com/ROCm/roctracer/blob/amd-master/LICENSE) |
+| rocm-llvm-alt | [AMD Proprietary License](https://www.amd.com/en/support/amd-software-eula)
+
+Open sourced ROCm components are released via public GitHub
+repositories, packages on https://repo.radeon.com and other distribution channels.
+Proprietary products are only available on https://repo.radeon.com. Currently, only
+one component of ROCm, rocm-llvm-alt is governed by a proprietary license.
+Proprietary components are organized in a proprietary subdirectory in the package
+repositories to distinguish from open sourced packages.
+
+```{note}
+The following additional terms and conditions apply to your use of ROCm technical documentation.
 ```
+
+©2023 Advanced Micro Devices, Inc. All rights reserved.
+
+The information presented in this document is for informational purposes only
+and may contain technical inaccuracies, omissions, and typographical errors. The
+information contained herein is subject to change and may be rendered inaccurate
+for many reasons, including but not limited to product and roadmap changes,
+component and motherboard version changes, new model and/or product releases,
+product differences between differing manufacturers, software changes, BIOS
+flashes, firmware upgrades, or the like. Any computer system has risks of
+security vulnerabilities that cannot be completely prevented or mitigated. AMD
+assumes no obligation to update or otherwise correct or revise this information.
+However, AMD reserves the right to revise this information and to make changes
+from time to time to the content hereof without obligation of AMD to notify any
+person of such revisions or changes.
+
+THIS INFORMATION IS PROVIDED “AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES
+WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
+INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD
+SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
+MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
+LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER
+CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,
+EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of
+Advanced Micro Devices, Inc. Other product names used in this publication are
+for identification purposes only and may be trademarks of their respective
+companies.
+
+### Package licensing
+
+:::{attention}
+AQL Profiler and AOCC CPU optimization are both provided in binary form, each
+subject to the license agreement enclosed in the directory for the binary and is
+available here: `/opt/rocm/share/doc/rocm-llvm-alt/EULA`. By using, installing,
+copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to
+the terms and conditions of this license agreement. If you do not agree to the
+terms of this agreement, do not install, copy or use the AQL Profiler and/or the
+AOCC CPU Optimizations.
+:::
+
+For the rest of the ROCm packages, you can find the licensing information at the
+following location: `/opt/rocm/share/doc/<component-name>/`
+
+For example, you can fetch the licensing information of the `_amd_comgr_`
+component (Code Object Manager) from the `amd_comgr` folder. A file named
+`LICENSE.txt` contains the license details at:
+`/opt/rocm-5.4.3/share/doc/amd_comgr/LICENSE.txt`
--- a/docs/about/licensing.md
+++ b/docs/about/licensing.md
@@ -1,133 +0,0 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="ROCm licensing terms">
-  <meta name="keywords" content="license, licensing terms">
-</head>
-
-# ROCm licensing terms
-
-ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately.
-The following table is a list of ROCm components with links to their respective license
-terms. These components may include third party components subject to
-additional licenses. Please review individual repositories for more information.
-
-The table shows ROCm components, the name of license, and link to the license terms.
-The table is ordered to follow the ROCm manifest file.
-
-<!-- spellcheck-disable -->
-| Component | License |
-|:---------------------|:-------------------------|
-| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | [MIT](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/LICENSE) |
-| [HIPCC](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) | [MIT](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) |
-| [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/) | [MIT](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/LICENSE.txt) |
-| [HIP](https://github.com/ROCm-Developer-Tools/HIP/) | [MIT](https://github.com/ROCm-Developer-Tools/HIP/blob/develop/LICENSE.txt) |
-| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/blob/master/LICENSE.txt) |
-| [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/LICENSE.txt) |
-| [MIVisionX](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/) | [MIT](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/LICENSE.txt) |
-| [RCP](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | [MIT](https://github.com/GPUOpen-Tools/radeon_compute_profiler/blob/master/LICENSE) |
-| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/COPYING) |
-| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/LICENSE.txt) |
-| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | [MIT](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/LICENSE.md) |
-| [ROCclr](https://github.com/ROCm-Developer-Tools/ROCclr/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCclr/blob/develop/LICENSE.txt) |
-| [ROCdbgapi](https://github.com/ROCm-Developer-Tools/ROCdbgapi/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCdbgapi/blob/amd-master/LICENSE.txt) |
-| [ROCgdb](https://github.com/ROCm-Developer-Tools/ROCgdb/) | [GNU General Public License v2.0](https://github.com/ROCm-Developer-Tools/ROCgdb/blob/amd-master/COPYING) |
-| [ROCm-CompilerSupport](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/LICENSE.txt) |
-| [ROCm-Device-Libs](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/LICENSE.TXT) |
-| [ROCm-OpenCL-Runtime/api/opencl/khronos/icd](https://github.com/KhronosGroup/OpenCL-ICD-Loader/) | [Apache 2.0](https://github.com/KhronosGroup/OpenCL-ICD-Loader/blob/main/LICENSE) |
-| [ROCm-OpenCL-Runtime](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/) | [MIT](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/develop/LICENSE.txt) |
-| [ROCmValidationSuite](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/LICENSE) |
-| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile/) | [MIT](https://github.com/ROCmSoftwarePlatform/Tensile/blob/develop/LICENSE.md) |
-| [aomp-extras](https://github.com/ROCm-Developer-Tools/aomp-extras/) | [MIT](https://github.com/ROCm-Developer-Tools/aomp-extras/blob/aomp-dev/LICENSE) |
-| [aomp](https://github.com/ROCm-Developer-Tools/aomp/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/LICENSE) |
-| [atmi](https://github.com/RadeonOpenCompute/atmi/) | [MIT](https://github.com/RadeonOpenCompute/atmi/blob/master/LICENSE.txt) |
-| [clang-ocl](https://github.com/RadeonOpenCompute/clang-ocl/) | [MIT](https://github.com/RadeonOpenCompute/clang-ocl/blob/master/LICENSE) |
-| [flang](https://github.com/ROCm-Developer-Tools/flang/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/flang/blob/master/LICENSE.txt) |
-| [half](https://github.com/ROCmSoftwarePlatform/half/) | [MIT](https://github.com/ROCmSoftwarePlatform/half/blob/master/LICENSE.txt) |
-| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/LICENSE.md) |
-| [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB/) | [Custom](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/LICENSE.txt) |
-| [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/LICENSE.md) |
-| [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/LICENSE.md) |
-| [hipSPARSELt](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/blob/develop/LICENSE.md) |
-| [hipSPARSE](https://github.com/ROCmSoftwarePlatform/hipSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSE/blob/develop/LICENSE.md) |
-| [hipTensor](https://github.com/ROCmSoftwarePlatform/hipTensor) | [MIT](https://github.com/ROCmSoftwarePlatform/hipTensor/blob/develop/LICENSE) |
-| [hipamd](https://github.com/ROCm-Developer-Tools/hipamd/) | [MIT](https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/LICENSE.txt) |
-| [hipfort](https://github.com/ROCmSoftwarePlatform/hipfort/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipfort/blob/master/LICENSE) |
-| [llvm-project](https://github.com/ROCm-Developer-Tools/llvm-project/) | [Apache](https://github.com/ROCm-Developer-Tools/llvm-project/blob/main/LICENSE.TXT) |
-| [rccl](https://github.com/ROCmSoftwarePlatform/rccl/) | [Custom](https://github.com/ROCmSoftwarePlatform/rccl/blob/develop/LICENSE.txt) |
-| [rdc](https://github.com/RadeonOpenCompute/rdc/) | [MIT](https://github.com/RadeonOpenCompute/rdc/blob/master/LICENSE) |
-| [rocALUTION](https://github.com/ROCmSoftwarePlatform/rocALUTION/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/LICENSE.md) |
-| [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/LICENSE.md) |
-| [rocFFT](https://github.com/ROCmSoftwarePlatform/rocFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/LICENSE.md) |
-| [rocPRIM](https://github.com/ROCmSoftwarePlatform/rocPRIM/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/LICENSE.txt) |
-| [rocRAND](https://github.com/ROCmSoftwarePlatform/rocRAND/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/LICENSE.txt) |
-| [rocSOLVER](https://github.com/ROCmSoftwarePlatform/rocSOLVER/) | [BSD-2-Clause](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/LICENSE.md) |
-| [rocSPARSE](https://github.com/ROCmSoftwarePlatform/rocSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/develop/LICENSE.md) |
-| [rocThrust](https://github.com/ROCmSoftwarePlatform/rocThrust/) | [Apache 2.0](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/LICENSE) |
-| [rocWMMA](https://github.com/ROCmSoftwarePlatform/rocWMMA/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/LICENSE.md) |
-| [rocm-cmake](https://github.com/RadeonOpenCompute/rocm-cmake/) | [MIT](https://github.com/RadeonOpenCompute/rocm-cmake/blob/develop/LICENSE) |
-| [rocm_bandwidth_test](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/blob/master/LICENSE.txt) |
-| [rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/License.txt) |
-| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocminfo/blob/master/License.txt) |
-| [rocprofiler](https://github.com/ROCm-Developer-Tools/rocprofiler/) | [MIT](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/LICENSE) |
-| [rocr_debug_agent](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/) | [The University of Illinois/NCSA](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/blob/master/LICENSE.txt) |
-| [roctracer](https://github.com/ROCm-Developer-Tools/roctracer/) | [MIT](https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/LICENSE) |
-| rocm-llvm-alt | [AMD Proprietary License](https://www.amd.com/en/support/amd-software-eula)
-
-Open sourced ROCm components are released via public GitHub
-repositories, packages on https://repo.radeon.com and other distribution channels.
-Proprietary products are only available on https://repo.radeon.com. Currently, only
-one component of ROCm, rocm-llvm-alt is governed by a proprietary license.
-Proprietary components are organized in a proprietary subdirectory in the package
-repositories to distinguish from open sourced packages.
-
-The additional terms and conditions below apply to your use of ROCm technical
-documentation.
-
-©2023 Advanced Micro Devices, Inc. All rights reserved.
-
-The information presented in this document is for informational purposes only
-and may contain technical inaccuracies, omissions, and typographical errors. The
-information contained herein is subject to change and may be rendered inaccurate
-for many reasons, including but not limited to product and roadmap changes,
-component and motherboard version changes, new model and/or product releases,
-product differences between differing manufacturers, software changes, BIOS
-flashes, firmware upgrades, or the like. Any computer system has risks of
-security vulnerabilities that cannot be completely prevented or mitigated. AMD
-assumes no obligation to update or otherwise correct or revise this information.
-However, AMD reserves the right to revise this information and to make changes
-from time to time to the content hereof without obligation of AMD to notify any
-person of such revisions or changes.
-
-THIS INFORMATION IS PROVIDED “AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES
-WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
-INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD
-SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
-MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
-LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER
-CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,
-EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
-
-AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of
-Advanced Micro Devices, Inc. Other product names used in this publication are
-for identification purposes only and may be trademarks of their respective
-companies.
-
-## Package licensing
-
-:::{attention}
-AQL Profiler and AOCC CPU optimization are both provided in binary form, each
-subject to the license agreement enclosed in the directory for the binary and is
-available here: `/opt/rocm/share/doc/rocm-llvm-alt/EULA`. By using, installing,
-copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to
-the terms and conditions of this license agreement. If you do not agree to the
-terms of this agreement, do not install, copy or use the AQL Profiler and/or the
-AOCC CPU Optimizations.
-:::
-
-For the rest of the ROCm packages, you can find the licensing information at the
-following location: `/opt/rocm/share/doc/<component-name>/`
-
-For example, you can fetch the licensing information of the `_amd_comgr_`
-component (Code Object Manager) from the `amd_comgr` folder. A file named
-`LICENSE.txt` contains the license details at:
-`/opt/rocm-5.4.3/share/doc/amd_comgr/LICENSE.txt`
--- a/docs/conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
+++ b/docs/conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
@@ -63,15 +63,14 @@ There are also a number of papers which talk about these new capabilities:

  * `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
  * `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
-  * `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
  * `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
-
-Other I/O devices with PCIe atomics support
-
-  * `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
-  * `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
  * `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
-  * `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
+
+Other I/O devices with PCIe atomics support:
+
+  * Mellanox ConnectX-5 InfiniBand Card
+  * Cray Aries Interconnect
+  * Xilinx 7 Series Devices

 Future bus technology with richer I/O atomics operation Support

@@ -80,8 +79,8 @@ Future bus technology with richer I/O atomics operation Support
 New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
 with PCIe Generation 3.0 support.

-  * `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
-  * `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
+  * Mellanox Bluefield SOC
+  * Cavium Thunder X2

 In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
 originates two writes to two different targets:
--- a/docs/conceptual/ai-migraphx-optimization.md
+++ b/docs/conceptual/ai-migraphx-optimization.md
@@ -55,15 +55,15 @@ The header files and libraries are installed under `/opt/rocm-\<version\>`, wher

 There are two ways to build the MIGraphX sources.

-* [Use the ROCm build tool](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses `[rbuild](https://github.com/RadeonOpenCompute/rbuild)` to install the prerequisites and build the libraries with just one command.
+* [Use the ROCm build tool](https://github.com/ROCm/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses `[rbuild](https://github.com/ROCm/rbuild)` to install the prerequisites and build the libraries with just one command.

  or

-* [Use CMake](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source.
+* [Use CMake](https://github.com/ROCm/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source.

 For detailed steps on building from source and installing dependencies, refer to the following `README` file:

-[https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#building-from-source](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#building-from-source)
+[https://github.com/ROCm/AMDMIGraphX#building-from-source](https://github.com/ROCm/AMDMIGraphX#building-from-source)

 ### Option 3: use docker

@@ -72,7 +72,7 @@ To use Docker, follow these steps:
 1. The easiest way to set up the development environment is to use Docker. To build Docker from scratch, first clone the MIGraphX repository by running:

    ```bash
-    git clone --recursive https://github.com/ROCmSoftwarePlatform/AMDMIGraphX
+    git clone --recursive https://github.com/ROCm/AMDMIGraphX
    ```

 2. The repository contains a Dockerfile from which you can build a Docker image as:
--- a/docs/conceptual/ai-pytorch-inception.md
+++ b/docs/conceptual/ai-pytorch-inception.md
@@ -22,6 +22,7 @@ Training occurs in multiple phases for every batch of training data. the followi
 :::{table} Types of Training Phases
 :name: training-phases
 :widths: auto
+
 | Types of Phases   |     |
 | ----------------- | --- |
 | Forward Pass      | The input features are fed into the model, whose parameters may be randomly initialized initially. Activations (outputs) of each layer are retained during this pass to help in the loss gradient computation during the backward pass. |
@@ -35,6 +36,7 @@ Training is different from inference, particularly from the hardware perspective
 :::{table} Training vs. Inference
 :name: training-inference
 :widths: auto
+
 | Training | Inference |
 | ----------- | ----------- |
 | Training is measured in hours/days. | The inference is measured in minutes. |
@@ -679,7 +681,7 @@ The dataset has 60,000 images you will use to train the network and 10,000 to ev

 Access the source code from the following repository:

-[https://github.com/ROCmSoftwarePlatform/tensorflow_fashionmnist/blob/main/fashion_mnist.py](https://github.com/ROCmSoftwarePlatform/tensorflow_fashionmnist/blob/main/fashion_mnist.py)
+[https://github.com/ROCm/tensorflow_fashionmnist/blob/main/fashion_mnist.py](https://github.com/ROCm/tensorflow_fashionmnist/blob/main/fashion_mnist.py)

 To understand the code step by step, follow these steps:

@@ -876,7 +878,7 @@ To understand the code step by step, follow these steps:
        thisplot[true_label].set_color('blue')
        ```

-    9. With the model trained, you can use it to make predictions about some images. Review the 0-th image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label.
+    9. With the model trained, you can use it to make predictions about some images. Review the 0<sup>th</sup> image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label.

        ```py
        i = 0
--- a/docs/conceptual/cmake-packages.rst
+++ b/docs/conceptual/cmake-packages.rst
@@ -198,14 +198,13 @@ all the flags necessary for device compilation.

  Compiling for the GPU device requires at least C++11.

-This project can then be configured with the following CMake commands.
+This project can then be configured with the following CMake commands:

-  Windows: ``cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe``
-
-  Linux: ``cmake -D CMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++``
+*  Windows: ``cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe``
+*  Linux: ``cmake -D CMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++``

 Which use the device compiler provided from the binary packages of
-`ROCm HIP SDK <https://www.amd.com/en/developer/rocm-hub.html>`_ and
+`ROCm HIP SDK <https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html>`_ and
 `repo.radeon.com <https://repo.radeon.com>`_ respectively.

 When using the ``CXX`` language support to compile HIP device code, selecting the
--- a/docs/conceptual/compiler-disambiguation.md
+++ b/docs/conceptual/compiler-disambiguation.md
@@ -13,9 +13,9 @@ disambiguates compiler naming used throughout the documentation.

 | Term | Description |
 | - | - |
-| `amdclang++` | Clang/LLVM-based compiler that is part of `rocm-llvm` package. The source code is available at <a href="https://github.com/RadeonOpenCompute/llvm-project" target="_blank">https://github.com/RadeonOpenCompute/llvm-project</a>. |
+| `amdclang++` | Clang/LLVM-based compiler that is part of `rocm-llvm` package. The source code is available at <a href="https://github.com/ROCm/llvm-project" target="_blank">https://github.com/ROCm/llvm-project</a>. |
 | AOCC | Closed-source clang-based compiler that includes additional CPU optimizations. Offered as part of ROCm via the `rocm-llvm-alt` package. See for details, <a href="https://developer.amd.com/amd-aocc/" target="_blank">https://developer.amd.com/amd-aocc/</a>. |
 | HIP-Clang | Informal term for the `amdclang++` compiler |
-| HIPIFY | Tools including `hipify-clang` and `hipify-perl`, used to automatically translate CUDA source code into portable HIP C++. The source code is available at <a href="https://github.com/ROCm-Developer-Tools/HIPIFY" target="_blank">https://github.com/ROCm-Developer-Tools/HIPIFY</a> |
-| `hipcc` | HIP compiler driver. A utility that invokes `clang` or `nvcc` depending on the target and passes the appropriate include and library options for the target compiler and HIP infrastructure. The source code is available at <a href="https://github.com/ROCm-Developer-Tools/HIPCC" target="_blank">https://github.com/ROCm-Developer-Tools/HIPCC</a>. |
+| HIPIFY | Tools including `hipify-clang` and `hipify-perl`, used to automatically translate CUDA source code into portable HIP C++. The source code is available at <a href="https://github.com/ROCm/HIPIFY" target="_blank">https://github.com/ROCm/HIPIFY</a> |
+| `hipcc` | HIP compiler driver. A utility that invokes `clang` or `nvcc` depending on the target and passes the appropriate include and library options for the target compiler and HIP infrastructure. The source code is available at <a href="https://github.com/ROCm/HIPCC" target="_blank">https://github.com/ROCm/HIPCC</a>. |
 | ROCmCC | Clang/LLVM-based compiler. ROCmCC in itself is not a binary but refers to the overall compiler. |
--- a/docs/conceptual/gpu-arch.md
+++ b/docs/conceptual/gpu-arch.md
@@ -5,29 +5,43 @@
  MI100, AMD Instinct">
 </head>

+(gpu-arch-documentation)=
+
 # GPU architecture documentation

 :::::{grid} 1 1 2 2
 :gutter: 1

+:::{grid-item-card}
+**AMD Instinct MI300 series**
+
+Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3
+architecture.
+
+* [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md)
+* [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf)
+* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf)
+* [Performance counters](./gpu-arch/mi300-mi200-performance-counters.rst)
+:::
+
 :::{grid-item-card}
 **AMD Instinct MI200 series**

-Review hardware aspects of the AMD Instinct™ MI200 series of GPU
-accelerators and the CDNA™ 2 architecture.
+Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2
+architecture.

 * [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md)
 * [AMD Instinct MI200/CDNA2 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
 * [White paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
-* [Performance counters](./gpu-arch/mi200-performance-counters.md)
+* [Performance counters](./gpu-arch/mi300-mi200-performance-counters.rst)

 :::

 :::{grid-item-card}
 **AMD Instinct MI100**

-Review hardware aspects of the AMD Instinct™ MI100
-accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.
+Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1
+architecture.

 * [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md)
 * [AMD Instinct MI100/CDNA1 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
--- a/docs/conceptual/gpu-arch/mi200-performance-counters.md
+++ b/docs/conceptual/gpu-arch/mi200-performance-counters.md
@@ -1,578 +0,0 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="MI200 performance counters and metrics">
-  <meta name="keywords" content="MI200, performance counters, counters, GRBM counters, GRBM,
-  CPF counters, CPF, CPC counters, CPC, command processor counters, SPI counters, SPI, AMD, ROCm">
-</head>
-
-# MI200 performance counters and metrics
-<!-- markdownlint-disable no-duplicate-header -->
-
-This document lists and describes the hardware performance counters and derived metrics available on the AMD Instinct™ MI200 GPU. All the hardware basic counters and derived metrics are accessible via {doc}`ROCProfiler tool <rocprofiler:rocprofv1>`.
-
-## MI200 performance counters list
-
-See the category-wise listing of MI200 performance counters in the following tables.
-
-:::{note}
-Preliminary validation of all MI200 performance counters is in progress. Those with “*” appended to the names require further evaluation.
-:::
-
-### Graphics Register Bus Management (GRBM) counters
-
-| Hardware Counter   | Unit   | Definition                                                                |
-|:--------------------|:--------|:--------------------------------------------------------------------------|
-| `GRBM_COUNT`       | Cycles | Number of free-running GPU cycles                                         |
-| `GRBM_GUI_ACTIVE`  | Cycles | Number of GPU active cycles                                               |
-| `GRBM_CP_BUSY`     | Cycles | Number of cycles any of the Command Processor (CP) blocks are busy                  |
-| `GRBM_SPI_BUSY`    | Cycles | Number of cycles any of the Shader Processor Input (SPI) are busy in the shader engine(s) |
-| `GRBM_TA_BUSY`     | Cycles | Number of cycles any of the Texture Addressing Unit (TA) are busy in the shader engine(s) |
-| `GRBM_TC_BUSY`     | Cycles | Number of cycles any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy               |
-| `GRBM_CPC_BUSY`    | Cycles | Number of cycles the Command Processor - Compute (CPC) is busy                            |
-| `GRBM_CPF_BUSY`    | Cycles | Number of cycles the Command Processor - Fetcher (CPF) is busy                            |
-| `GRBM_UTCL2_BUSY`  | Cycles | Number of cycles the Unified Translation Cache - Level 2 (UTCL2) block is busy            |
-| `GRBM_EA_BUSY`     | Cycles | Number of cycles the Efficiency Arbiter (EA) block is busy                                |
-
-### Command Processor (CP) counters
-
-The CP counters are further classified into CP-Fetcher (CPF) and CP-Compute (CPC).
-
-#### CPF counters
-
-| Hardware Counter                     | Unit   | Definition                  |
-|:--------------------------------------|:--------|:-------------------------------------------------------------|
-| `CPF_CMP_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the Compute UTCL1s is stalled waiting on translation |
-| `CPF_CPF_STAT_BUSY`                  | Cycles | Number of cycles CPF is busy                                                   |
-| `CPF_CPF_STAT_IDLE*`               | Cycles | Number of cycles CPF is idle                                                   |
-| `CPF_CPF_STAT_STALL`                 | Cycles | Number of cycles CPF is stalled                                                  |
-| `CPF_CPF_TCIU_BUSY`                  | Cycles | Number of cycles CPF Texture Cache Interface Unit (TCIU) interface is busy                                    |
-| `CPF_CPF_TCIU_IDLE`                  | Cycles | Number of cycles CPF TCIU interface is idle                                    |
-| `CPF_CPF_TCIU_STALL*`              | Cycles | Number of cycles CPF TCIU interface is stalled waiting on free tags        |
-
-#### CPC counters
-
-| Hardware Counter                 | Unit   | Definition                                          |
-|:---------------------------------|:-------|:---------------------------------------------------|
-| `CPC_ME1_BUSY_FOR_PACKET_DECODE` | Cycles | Number of cycles CPC Micro Engine (ME1) is busy decoding packets                       |
-| `CPC_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the UTCL1s is stalled waiting on translation |
-| `CPC_CPC_STAT_BUSY`              | Cycles | Number of cycles CPC is busy                                            |
-| `CPC_CPC_STAT_IDLE`              | Cycles | Number of cycles CPC is idle                                            |
-| `CPC_CPC_STAT_STALL`             | Cycles | Number of cycles CPC is stalled                                         |
-| `CPC_CPC_TCIU_BUSY`              | Cycles | Number of cycles CPC TCIU interface is busy                             |
-| `CPC_CPC_TCIU_IDLE`              | Cycles | Number of cycles CPC TCIU interface is idle                             |
-| `CPC_CPC_UTCL2IU_BUSY`           | Cycles | Number of cycles CPC UTCL2 interface is busy                            |
-| `CPC_CPC_UTCL2IU_IDLE`           | Cycles | Number of cycles CPC UTCL2 interface is idle                            |
-| `CPC_CPC_UTCL2IU_STALL`          | Cycles | Number of cycles CPC UTCL2 interface is stalled                 |
-| `CPC_ME1_DC0_SPI_BUSY`           | Cycles | Number of cycles CPC ME1 Processor is busy                              |
-
-### Shader Processor Input (SPI) counters
-
-| Hardware Counter             | Unit        | Definition                                                   |
-|:----------------------------|:-----------|:-----------------------------------------------------------|
-| `SPI_CSN_BUSY`                 | Cycles      | Number of cycles with outstanding waves                      |
-| `SPI_CSN_WINDOW_VALID`         | Cycles      | Number of cycles enabled by `perfcounter_start` event               |
-| `SPI_CSN_NUM_THREADGROUPS`     | Workgroups  | Number of dispatched workgroups                        |
-| `SPI_CSN_WAVE`                 | Wavefronts  | Number of dispatched wavefronts                        |
-| `SPI_RA_REQ_NO_ALLOC`          | Cycles      | Number of Arb cycles with requests but no allocation |
-|`SPI_RA_REQ_NO_ALLOC_CSN`       | Cycles      | Number of Arb cycles with Compute Shader, n-th pipe (CSn) requests but no CSn allocation |
-| `SPI_RA_RES_STALL_CSN`         | Cycles      | Number of Arb stall cycles due to shortage of CSn pipeline slots |
-| `SPI_RA_TMP_STALL_CSN*`      | Cycles      | Number of stall cycles due to shortage of temp space |
-| `SPI_RA_WAVE_SIMD_FULL_CSN`    | SIMD-cycles | Accumulated number of Single Instruction Multiple Data (SIMDs) per cycle affected by shortage of wave slots for CSn wave dispatch   |
-| `SPI_RA_VGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of VGPR slots for CSn wave dispatch  |
-| `SPI_RA_SGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of SGPR slots for CSn wave dispatch    |
-| `SPI_RA_LDS_CU_FULL_CSN`       | CUs         | Number of Compute Units (CUs) affected by shortage of LDS space for CSn wave dispatch   |
-| `SPI_RA_BAR_CU_FULL_CSN*`    | CUs         | Number of CUs with CSn waves waiting at a BARRIER   |
-| `SPI_RA_BULKY_CU_FULL_CSN*`  | CUs         | Number of CUs with CSn waves waiting for BULKY resource     |
-| `SPI_RA_TGLIM_CU_FULL_CSN*`  | Cycles      | Number of CSn wave stall cycles due to restriction of `tg_limit` for thread group size    |
-| `SPI_RA_WVLIM_STALL_CSN*`  | Cycles      | Number of cycles CSn is stalled due to WAVE_LIMIT            |
-| `SPI_VWC_CSC_WR`               | Qcycles      | Number of quad-cycles taken to initialize Vector General Purpose Register (VGPRs) when launching waves |
-| `SPI_SWC_CSC_WR`               | Qcycles      | Number of quad-cycles taken to initialize Vector General Purpose Register (SGPRs) when launching waves |
-
-### Compute Unit (CU) counters
-
-The CU counters are further classified into instruction mix, Matrix Fused Multiply Add (MFMA) operation counters, level counters, wavefront counters, wavefront cycle counters and Local Data Share (LDS) counters.
-
-#### Instruction mix
-
-| Hardware Counter        | Unit   | Definition                                                               |
-|:-----------------------|:-----|:-----------------------------------------------------------------------|
-| `SQ_INSTS`                | Instr | Number of instructions issued.                                              |
-| `SQ_INSTS_VALU`           | Instr | Number of Vector Arithmetic Logic Unit (VALU) instructions including MFMA issued.                         |
-| `SQ_INSTS_VALU_ADD_F16`   | Instr | Number of VALU Half Precision Floating Point (F16) ADD/SUB instructions issued.                            |
-| `SQ_INSTS_VALU_MUL_F16`   | Instr | Number of VALU F16 Multiply instructions issued.                   |
-| `SQ_INSTS_VALU_FMA_F16`   | Instr | Number of VALU F16 Fused Multiply Add (FMA)/ Multiply Add (MAD) instructions issued.                   |
-| `SQ_INSTS_VALU_TRANS_F16` | Instr | Number of VALU F16 Transcendental instructions issued.                   |
-| `SQ_INSTS_VALU_ADD_F32`   | Instr | Number of VALU Full Precision Floating Point (F32) ADD/SUB instructions issued.                 |
-| `SQ_INSTS_VALU_MUL_F32`   | Instr | Number of VALU F32 Multiply instructions issued.                    |
-| `SQ_INSTS_VALU_FMA_F32`   | Instr | Number of VALU F32 FMA/MAD instructions issued.                   |
-| `SQ_INSTS_VALU_TRANS_F32` | Instr | Number of VALU F32 Transcendental instructions issued.                    |
-| `SQ_INSTS_VALU_ADD_F64`   | Instr | Number of VALU F64 ADD/SUB instructions issued.                |
-| `SQ_INSTS_VALU_MUL_F64`   | Instr | Number of VALU F64 Multiply instructions issued.                    |
-| `SQ_INSTS_VALU_FMA_F64`   | Instr | Number of VALU F64 FMA/MAD instructions issued.                   |
-| `SQ_INSTS_VALU_TRANS_F64` | Instr | Number of VALU F64 Transcendental instructions issued.                 |
-| `SQ_INSTS_VALU_INT32`     | Instr | Number of VALU 32-bit integer instructions (signed or unsigned) issued.        |
-| `SQ_INSTS_VALU_INT64`     | Instr | Number of VALU 64-bit integer instructions (signed or unsigned) issued.       |
-| `SQ_INSTS_VALU_CVT`       | Instr | Number of VALU Conversion instructions issued.                   |
-| `SQ_INSTS_VALU_MFMA_I8`   | Instr | Number of 8-bit Integer MFMA instructions issued.               |
-| `SQ_INSTS_VALU_MFMA_F16`  | Instr | Number of F16 MFMA instructions issued.                                   |
-| `SQ_INSTS_VALU_MFMA_BF16` | Instr | Number of Brain Floating Point - 16 (BF16) MFMA instructions issued.                                  |
-| `SQ_INSTS_VALU_MFMA_F32`  | Instr | Number of F32 MFMA instructions issued.                                    |
-| `SQ_INSTS_VALU_MFMA_F64`  | Instr | Number of F64 MFMA instructions issued.                               |
-| `SQ_INSTS_MFMA`           | Instr | Number of MFMA instructions issued.                                  |
-| `SQ_INSTS_VMEM_WR`        | Instr | Number of Vector Memory (VMEM) Write instructions (including FLAT) issued.                                  |
-| `SQ_INSTS_VMEM_RD`        | Instr | Number of VMEM Read instructions (including FLAT) issued.  |
-| `SQ_INSTS_VMEM`           | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions. |
-| `SQ_INSTS_SALU`           | Instr | Number of SALU instructions issued.                                        |
-| `SQ_INSTS_SMEM`           | Instr | Number of Scalar Memory (SMEM) instructions issued.                                       |
-| `SQ_INSTS_SMEM_NORM`      | Instr | Number of SMEM instructions normalized to match `smem_level` issued. |
-| `SQ_INSTS_FLAT`           | Instr | Number of FLAT instructions issued.                                     |
-| `SQ_INSTS_FLAT_LDS_ONLY`  | Instr | Number of FLAT instructions that read/write only from/to LDS issued. Works only if `EARLY_TA_DONE` is enabled.       |
-| `SQ_INSTS_LDS`            | Instr | Number of Local Data Share (LDS) instructions issued (including FLAT).                                         |
-| `SQ_INSTS_GDS`            | Instr | Number of Global Data Share (GDS) instructions issued.                                         |
-| `SQ_INSTS_EXP_GDS`        | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued.  |
-| `SQ_INSTS_BRANCH`         | Instr | Number of Branch instructions issued.                                     |
-| `SQ_INSTS_SENDMSG`        | Instr | Number of `SENDMSG` instructions including `s_endpgm` issued.                 |
-| `SQ_INSTS_VSKIPPED*`    | Instr | Number of vector instructions skipped.                                 |
-
-#### MFMA operation counters
-
-| Hardware Counter             | Unit  | Definition                                      |
-|:----------------------------|:-----|:----------------------------------------------|
-| `SQ_INSTS_VALU_MFMA_MOPS_I8`   | IOP   | Number of 8-bit integer MFMA ops in the unit of 512 |
-| `SQ_INSTS_VALU_MFMA_MOPS_F16`  | FLOP  | Number of F16 floating MFMA ops in the unit of 512  |
-| `SQ_INSTS_VALU_MFMA_MOPS_BF16` | FLOP  | Number of BF16 floating MFMA ops in the unit of 512 |
-| `SQ_INSTS_VALU_MFMA_MOPS_F32`  | FLOP  | Number of F32 floating MFMA ops in the unit of 512  |
-| `SQ_INSTS_VALU_MFMA_MOPS_F64`  | FLOP  | Number of F64 floating MFMA ops in the unit of 512  |
-
-#### Level counters
-
-:::{note}
-All level counters must be followed by `SQ_ACCUM_PREV_HIRES` counter to measure average latency.
-:::
-
-| Hardware Counter    | Unit  | Definition                             |
-|:-------------------|:-----|:-------------------------------------|
-| `SQ_ACCUM_PREV`       | Count | Accumulated counter sample value where accumulation takes place once every four cycles. |
-| `SQ_ACCUM_PREV_HIRES` | Count | Accumulated counter sample value where accumulation takes place once every cycle. |
-| `SQ_LEVEL_WAVES`      | Waves | Number of inflight waves. To calculate the wave latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_WAVE`.           |
-| `SQ_INST_LEVEL_VMEM` | Instr | Number of inflight VMEM (including FLAT) instructions. To calculate the VMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_VMEM`.   |
-| `SQ_INST_LEVEL_SMEM` | Instr | Number of inflight SMEM instructions. To calculate the SMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_SMEM_NORM`.    |
-| `SQ_INST_LEVEL_LDS`  | Instr | Number of inflight LDS (including FLAT) instructions. To calculate the LDS latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_LDS`.  |
-| `SQ_IFETCH_LEVEL`     | Instr | Number of inflight instruction fetch requests from the cache. To calculate the instruction fetch latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_IFETCH`. |
-
-#### Wavefront counters
-
-| Hardware Counter     | Unit  | Definition                                                        |
-|:--------------------|:-----|:----------------------------------------------------------------|
-| `SQ_WAVES`             | Waves | Number of wavefronts dispatched to Sequencers (SQs), including both new and restored wavefronts  |
-| `SQ_WAVES_SAVED*`    | Waves | Number of context-saved waves                  |
-| `SQ_WAVES_RESTORED*` | Waves | Number of context-restored waves sent to SQs                  |
-| `SQ_WAVES_EQ_64`       | Waves | Number of wavefronts with exactly 64 active threads sent to SQs    |
-| `SQ_WAVES_LT_64`       | Waves | Number of wavefronts with less than 64 active threads sent to SQs  |
-| `SQ_WAVES_LT_48`       | Waves | Number of wavefronts with less than 48 active threads sent to SQs  |
-| `SQ_WAVES_LT_32`       | Waves | Number of wavefronts with less than 32 active threads sent to SQs  |
-| `SQ_WAVES_LT_16`       | Waves | Number of wavefronts with less than 16 active threads sent to SQs  |
-
-#### Wavefront cycle counters
-
-| Hardware Counter         | Unit    | Definition                                                            |
-|:------------------------|:-------|:--------------------------------------------------------------------|
-| `SQ_CYCLES`                | Cycles  | Clock cycles.  |
-| `SQ_BUSY_CYCLES`           | Cycles  | Number of cycles while SQ reports it to be busy.                       |
-| `SQ_BUSY_CU_CYCLES`        | Qcycles | Number of quad-cycles each CU is busy.                                  |
-| `SQ_VALU_MFMA_BUSY_CYCLES` | Cycles  | Number of cycles the MFMA ALU is busy.                                 |
-| `SQ_WAVE_CYCLES`           | Qcycles | Number of quad-cycles spent by waves in the CUs.                       |
-| `SQ_WAIT_ANY`              | Qcycles | Number of quad-cycles spent waiting for anything.                    |
-| `SQ_WAIT_INST_ANY`         | Qcycles | Number of quad-cycles spent waiting for any instruction to be issued.         |
-| `SQ_ACTIVE_INST_ANY`       | Qcycles | Number of quad-cycles spent by each wave to work on an instruction.   |
-| `SQ_ACTIVE_INST_VMEM`      | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VMEM instruction.  |
-| `SQ_ACTIVE_INST_LDS`       | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an LDS instruction. |
-| `SQ_ACTIVE_INST_VALU`      | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VALU instruction.  |
-| `SQ_ACTIVE_INST_SCA`       | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a SALU or SMEM instruction.  |
-| `SQ_ACTIVE_INST_EXP_GDS`   | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an EXPORT or GDS instruction.  |
-| `SQ_ACTIVE_INST_MISC`      | Qcycles | Number of quad-cycles spent by the SQ instruction aribter to work on a BRANCH or `SENDMSG` instruction.  |
-| `SQ_ACTIVE_INST_FLAT`      | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a FLAT instruction.  |
-| `SQ_INST_CYCLES_VMEM_WR`   | Qcycles | Number of quad-cycles  spent to send addr and cmd data for VMEM Write instructions.  |
-| `SQ_INST_CYCLES_VMEM_RD`   | Qcycles | Number of quad-cycles  spent to send addr and cmd data for VMEM Read instructions.  |
-| `SQ_INST_CYCLES_SMEM`      | Qcycles | Number of quad-cycles  spent to execute scalar memory reads.          |
-| `SQ_INST_CYCLES_SALU`      | Qcycles  | Number of quad-cycles spent to execute non-memory read scalar operations.    |
-| `SQ_THREAD_CYCLES_VALU`    | Cycles  | Number of thread-cycles spent to execute VALU operations. This is similar to `INST_CYCLES_VALU` but multiplied by the number of active threads.            |
-| `SQ_WAIT_INST_LDS` | Qcycles | Number of quad-cycles spent waiting for LDS instruction to be issued.  |
-
-#### LDS counters
-
-| Hardware Counter           | Unit   | Definition                                                |
-|:--------------------------|:------|:--------------------------------------------------------|
-| `SQ_LDS_ATOMIC_RETURN`       | Cycles | Number of atomic return cycles in LDS                   |
-| `SQ_LDS_BANK_CONFLICT`       | Cycles | Number of cycles LDS is stalled by bank conflicts     |
-| `SQ_LDS_ADDR_CONFLICT*`    | Cycles | Number of cycles LDS is stalled by address conflicts     |
-| `SQ_LDS_UNALIGNED_STALL*` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
-| `SQ_LDS_MEM_VIOLATIONS*`   | Count  | Number of threads that have a memory violation in the LDS  |
-| `SQ_LDS_IDX_ACTIVE` | Cycles | Number of cycles LDS is used for indexed operations  |
-
-#### Miscellaneous counters
-
-| Hardware Counter           | Unit   | Definition                                                |
-|:--------------------------|:------|:--------------------------------------------------------|
-| `SQ_IFETCH`        | Count   | Number of instruction fetch requests from `L1I` cache, in 32-byte width  |
-| `SQ_ITEMS`         | Threads | Number of valid items per wave                                  |
-
-### L1I and sL1D cache counters
-
-| Hardware Counter             | Unit   | Definition                                                        |
-|:----------------------------|:------|:----------------------------------------------------------------|
-| `SQC_ICACHE_REQ`               | Req    | Number of `L1I` cache requests                                      |
-| `SQC_ICACHE_HITS`              | Count  | Number of `L1I` cache hits                                   |
-| `SQC_ICACHE_MISSES`            | Count  | Number of non-duplicate `L1I` cache misses including uncached requests                   |
-| `SQC_ICACHE_MISSES_DUPLICATE`  | Count  | Number of duplicate `L1I` cache misses whose previous lookup miss on the same cache line is not fulfilled yet |
-| `SQC_DCACHE_REQ`               | Req    | Number of `sL1D` cache requests                                  |
-| `SQC_DCACHE_INPUT_VALID_READYB` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
-| `SQC_DCACHE_HITS`              | Count  | Number of `sL1D` cache hits                                |
-| `SQC_DCACHE_MISSES`            | Count  | Number of non-duplicate `sL1D` cache misses including uncached requests                        |
-| `SQC_DCACHE_MISSES_DUPLICATE`  | Count  | Number of duplicate `sL1D` cache misses                            |
-| `SQC_DCACHE_REQ_READ_1`        | Req    | Number of constant cache read requests in a single DW  |
-| `SQC_DCACHE_REQ_READ_2`        | Req    | Number of constant cache read requests in two DW      |
-| `SQC_DCACHE_REQ_READ_4`        | Req    | Number of constant cache read requests in four DW  |
-| `SQC_DCACHE_REQ_READ_8`        | Req    | Number of constant cache read requests in eight DW     |
-| `SQC_DCACHE_REQ_READ_16`       | Req    | Number of constant cache read requests in 16 DW      |
-| `SQC_DCACHE_ATOMIC*`         | Req    | Number of atomic requests                 |
-| `SQC_TC_REQ`                   | Req    | Number of TC requests that were issued by instruction and constant caches  |
-| `SQC_TC_INST_REQ`              | Req    | Number of instruction requests to the L2 cache            |
-| `SQC_TC_DATA_READ_REQ`         | Req    | Number of data Read requests to the L2 cache                   |
-| `SQC_TC_DATA_WRITE_REQ*`     | Req    | Number of data write requests to the L2 cache                    |
-| `SQC_TC_DATA_ATOMIC_REQ*`    | Req    | Number of data atomic requests to the L2 cache              |
-| `SQC_TC_STALL*`              | Cycles | Number of cycles while the valid requests to the L2 cache are stalled |
-
-### Vector L1 cache subsystem
-
-The vector L1 cache subsystem counters are further classified into Texture Addressing Unit (TA), Texture Data Unit (TD), vector L1D cache or Texture Cache per Pipe (TCP), and Texture Cache Arbiter (TCA) counters.
-
-#### TA counters
-
-| Hardware Counter                 | Unit   | Definition                                        |
-|:--------------------------------|:------|:------------------------------------------------|
-| `TA_TA_BUSY[n]`                       | Cycles | TA busy cycles. Value range for n: [0-15]. |
-| `TA_TOTAL_WAVEFRONTS[n]`              | Instr  | Number of wavefronts processed by TA. Value range for n: [0-15].       |
-| `TA_BUFFER_WAVEFRONTS[n]`             | Instr  | Number of buffer wavefronts processed by TA. Value range for n: [0-15].       |
-| `TA_BUFFER_READ_WAVEFRONTS[n]`        | Instr  | Number of buffer read wavefronts processed by TA. Value range for n: [0-15].  |
-| `TA_BUFFER_WRITE_WAVEFRONTS[n]`       | Instr  | Number of buffer write wavefronts processed by TA. Value range for n: [0-15]. |
-| `TA_BUFFER_ATOMIC_WAVEFRONTS[n]`   | Instr  | Number of buffer atomic wavefronts processed by TA. Value range for n: [0-15]. |
-| `TA_BUFFER_TOTAL_CYCLES[n]`           | Cycles | Number of buffer cycles (including read and write) issued to TC. Value range for n: [0-15].  |
-| `TA_BUFFER_COALESCED_READ_CYCLES[n]`  | Cycles | Number of coalesced buffer read cycles issued to TC. Value range for n: [0-15].         |
-| `TA_BUFFER_COALESCED_WRITE_CYCLES[n]` | Cycles | Number of coalesced buffer write cycles issued to TC. Value range for n: [0-15].         |
-| `TA_ADDR_STALLED_BY_TC_CYCLES[n]`     | Cycles | Number of cycles TA address path is stalled by TC. Value range for n: [0-15]. |
-| `TA_DATA_STALLED_BY_TC_CYCLES[n]`            | Cycles | Number of cycles TA data path is stalled by TC. Value range for n: [0-15].       |
-| `TA_ADDR_STALLED_BY_TD_CYCLES[n]`  | Cycles | Number of cycles TA address path is stalled by TD. Value range for n: [0-15].     |
-| `TA_FLAT_WAVEFRONTS[n]`               | Instr  | Number of flat opcode wavefronts processed by TA. Value range for n: [0-15].            |
-| `TA_FLAT_READ_WAVEFRONTS[n]`          | Instr  | Number of flat opcode read wavefronts processed by TA. Value range for n: [0-15].        |
-| `TA_FLAT_WRITE_WAVEFRONTS[n]`         | Instr  | Number of flat opcode write wavefronts processed by TA. Value range for n: [0-15].      |
-| `TA_FLAT_ATOMIC_WAVEFRONTS[n]`        | Instr  | Number of flat opcode atomic wavefronts processed by TA. Value range for n: [0-15].      |
-
-#### TD counters
-
-| Hardware Counter         | Unit  | Definition                                           |
-|:------------------------|:-----|:---------------------------------------------------|
-| `TD_TD_BUSY[n]`               | Cycle | TD busy cycles while it is processing or waiting for data. Value range for n: [0-15].                            |
-| `TD_TC_STALL[n]`              | Cycle | Number of cycles TD is stalled waiting for TC data. Value range for n: [0-15].   |
-| `TD_SPI_STALL[n]`          | Cycle | Number of cycles TD is stalled by SPI. Value range for n: [0-15].      |
-| `TD_LOAD_WAVEFRONT[n]`        | Instr |Number of wavefront instructions (read/write/atomic). Value range for n: [0-15]. |
-| `TD_STORE_WAVEFRONT[n]`       | Instr | Number of write wavefront instructions. Value range for n: [0-15].|
-| `TD_ATOMIC_WAVEFRONT[n]`      | Instr | Number of atomic wavefront instructions. Value range for n: [0-15]. |
-| `TD_COALESCABLE_WAVEFRONT[n]` | Instr | Number of coalescable wavefronts according to TA. Value range for n: [0-15].     |
-
-#### TCP counters
-
-| Hardware Counter                    | Unit   | Definition                                                  |
-|:-----------------------------------|:------|:----------------------------------------------------------|
-| `TCP_GATE_EN1[n]`                        | Cycles | Number of cycles vL1D interface clocks are turned on. Value range for n: [0-15].    |
-| `TCP_GATE_EN2[n]`                        | Cycles | Number of cycles vL1D core clocks are turned on. Value range for n: [0-15].  |
-| `TCP_TD_TCP_STALL_CYCLES[n]`             | Cycles | Number of cycles TD stalls vL1D. Value range for n: [0-15].                           |
-| `TCP_TCR_TCP_STALL_CYCLES[n]`            | Cycles | Number of cycles TCR stalls vL1D. Value range for n: [0-15].                           |
-| `TCP_READ_TAGCONFLICT_STALL_CYCLES[n]`   | Cycles | Number of cycles tagram conflict stalls on a read. Value range for n: [0-15].          |
-| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]`  | Cycles | Number of cycles tagram conflict stalls on a write. Value range for n: [0-15].         |
-| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on an atomic. Value range for n: [0-15].       |
-| `TCP_PENDING_STALL_CYCLES[n]`            | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache. Value range for n: [0-15]. |
-| `TCP_TCP_TA_DATA_STALL_CYCLES` | Cycles | Number of cycles TCP stalls TA data interface. |
-| `TCP_TA_TCP_STATE_READ[n]`               | Req    | Number of state reads. Value range for n: [0-15].    |
-| `TCP_VOLATILE[n]`                     | Req    | Number of L1 volatile pixels/buffers from TA. Value range for n: [0-15].  |
-| `TCP_TOTAL_ACCESSES[n]`                  | Req    | Number of vL1D accesses. Equals `TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD`. Value range for n: [0-15].                    |
-| `TCP_TOTAL_READ[n]`                      | Req    | Number of vL1D read accesses. Equals `TCP_PERF_SEL_TOTAL_HIT_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_EVICT_READ`. Value range for n: [0-15].    |
-| `TCP_TOTAL_WRITE[n]`                     | Req    | Number of vL1D write accesses. `Equals TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE`+ `TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE`. Value range for n: [0-15].     |
-| `TCP_TOTAL_ATOMIC_WITH_RET[n]`           | Req    | Number of vL1D atomic requests with return. Value range for n: [0-15].       |
-| `TCP_TOTAL_ATOMIC_WITHOUT_RET[n]`        | Req    | Number of vL1D atomic without return. Value range for n: [0-15].        |
-| `TCP_TOTAL_WRITEBACK_INVALIDATES[n]`     | Count  | Total number of vL1D writebacks and invalidates. Equals `TCP_PERF_SEL_TOTAL_WBINVL1`+ `TCP_PERF_SEL_TOTAL_WBINVL1_VOL`+ `TCP_PERF_SEL_CP_TCP_INVALIDATE`+ `TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL`. Value range for n: [0-15].       |
-| `TCP_UTCL1_REQUEST[n]`                   | Req    | Number of address translation requests to UTCL1. Value range for n: [0-15].            |
-| `TCP_UTCL1_TRANSLATION_HIT[n]`           | Req    | Number of UTCL1 translation hits. Value range for n: [0-15].     |
-| `TCP_UTCL1_TRANSLATION_MISS[n]`          | Req    | Number of UTCL1 translation misses. Value range for n: [0-15].    |
-| `TCP_UTCL1_PERMISSION_MISS[n]`          | Req    | Number of UTCL1 permission misses. Value range for n: [0-15].       |
-| `TCP_TOTAL_CACHE_ACCESSES[n]`            | Req    | Number of vL1D cache accesses including hits and misses. Value range for n: [0-15].     |
-| `TCP_TCP_LATENCY[n]`                     | Cycles | Accumulated wave access latency to vL1D over all wavefronts. Value range for n: [0-15]. |
-| `TCP_TCC_READ_REQ_LATENCY[n]`            | Cycles | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return. Value range for n: [0-15]. |
-| `TCP_TCC_WRITE_REQ_LATENCY[n]`           | Cycles | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return. Value range for n: [0-15]. |
-| `TCP_TCC_READ_REQ[n]`                    | Req    | Number of read requests to L2 cache. Value range for n: [0-15].      |
-| `TCP_TCC_WRITE_REQ[n]`                   | Req    | Number of write requests to L2 cache. Value range for n: [0-15].                   |
-| `TCP_TCC_ATOMIC_WITH_RET_REQ[n]`         | Req    | Number of atomic requests to L2 cache with return. Value range for n: [0-15].       |
-| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]`      | Req    | Number of atomic requests to L2 cache without return. Value range for n: [0-15].    |
-| `TCP_TCC_NC_READ_REQ[n]`                 | Req    | Number of NC read requests to L2 cache. Value range for n: [0-15].       |
-| `TCP_TCC_UC_READ_REQ[n]`                 | Req    | Number of UC read requests to L2 cache. Value range for n: [0-15].          |
-| `TCP_TCC_CC_READ_REQ[n]`                 | Req    | Number of CC read requests to L2 cache. Value range for n: [0-15].     |
-| `TCP_TCC_RW_READ_REQ[n]`                 | Req    | Number of RW read requests to L2 cache. Value range for n: [0-15].       |
-| `TCP_TCC_NC_WRITE_REQ[n]`                | Req    | Number of NC write requests to L2 cache. Value range for n: [0-15].         |
-| `TCP_TCC_UC_WRITE_REQ[n]`                | Req    | Number of UC write requests to L2 cache. Value range for n: [0-15].         |
-| `TCP_TCC_CC_WRITE_REQ[n]`                | Req    | Number of CC write requests to L2 cache. Value range for n: [0-15].         |
-| `TCP_TCC_RW_WRITE_REQ[n]`                | Req    | Number of RW write requests to L2 cache. Value range for n: [0-15].         |
-| `TCP_TCC_NC_ATOMIC_REQ[n]`               | Req    | Number of NC atomic requests to L2 cache. Value range for n: [0-15].        |
-| `TCP_TCC_UC_ATOMIC_REQ[n]`               | Req    | Number of UC atomic requests to L2 cache. Value range for n: [0-15].      |
-| `TCP_TCC_CC_ATOMIC_REQ[n]`               | Req    | Number of CC atomic requests to L2 cache. Value range for n: [0-15].      |
-| `TCP_TCC_RW_ATOMIC_REQ[n]`               | Req    | Number of RW atomic requests to L2 cache. Value range for n: [0-15].       |
-
-#### TCA counters
-
-| Hardware Counter | Unit   | Definition                                  |
-|:----------------|:------|:------------------------------------------|
-| `TCA_CYCLE[n]`        | Cycles | Number of TCA cycles. Value range for n: [0-31].                               |
-| `TCA_BUSY[n]`         | Cycles | Number of cycles TCA has a pending request. Value range for n: [0-31]. |
-
-### L2 cache access counters
-
-L2 Cache is also known as Texture Cache per Channel (TCC).
-
-| Hardware Counter                 | Unit   | Definition                                                     |
-|:--------------------------------|:------|:-------------------------------------------------------------|
-| `TCC_CYCLE[n]`                        |Cycle   | Number of L2 cache free-running clocks. Value range for n: [0-31].               |
-| `TCC_BUSY[n]`                         |Cycle   | Number of L2 cache busy cycles. Value range for n: [0-31].                                        |
-| `TCC_REQ[n]`                          |Req     | Number of L2 cache requests of all types. This is measured at the tag block. This may be more than the number of requests arriving at the TCC, but it is a good indication of the total amount of work that needs to be performed. Value range for n: [0-31].      |
-| `TCC_STREAMING_REQ[n]`             |Req     | Number of L2 cache streaming requests. This is measured at the tag block. Value range for n: [0-31]. |
-| `TCC_NC_REQ[n]`                       |Req     | Number of NC requests. This is measured at the tag block. Value range for n: [0-31].   |
-| `TCC_UC_REQ[n]`                       |Req     | Number of UC requests. This is measured at the tag block. Value range for n: [0-31].   |
-| `TCC_CC_REQ[n]`                       |Req     | Number of CC requests. This is measured at the tag block. Value range for n: [0-31].   |
-| `TCC_RW_REQ[n]`                       |Req     | Number of RW requests. This is measured at the tag block. Value range for n: [0-31].   |
-| `TCC_PROBE[n]`                        |Req     | Number of probe requests. Value range for n: [0-31].  |
-| `TCC_PROBE_ALL[n]`                 |Req     | Number of external probe requests with `EA_TCC_preq_all`== 1. Value range for n: [0-31].    |
-| `TCC_READ[n]`                     |Req     | Number of L2 cache read requests. This includes compressed reads but not metadata reads. Value range for n: [0-31].   |
-| `TCC_WRITE[n]`                    |Req     | Number of L2 cache write requests. Value range for n: [0-31].     |
-| `TCC_ATOMIC[n]`                   |Req     | Number of L2 cache atomic requests of all types. Value range for n: [0-31]. |
-| `TCC_HIT[n]`                          |Req     | Number of L2 cache hits. Value range for n: [0-31].      |
-| `TCC_MISS[n]`                         |Req     | Number of L2 cache misses. Value range for n: [0-31].        |
-| `TCC_WRITEBACK[n]`                    |Req     | Number of lines written back to the main memory, including writebacks of dirty lines and uncached write/atomic requests. Value range for n: [0-31]. |
-| `TCC_EA_WRREQ[n]`                     |Req     | Number of 32-byte and 64-byte transactions going over the `TC_EA_wrreq` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. Value range for n: [0-31].   |
-| `TCC_EA_WRREQ_64B[n]`                 |Req     | Total number of 64-byte transactions (write or `CMPSWAP`) going over the `TC_EA_wrreq` interface. Value range for n: [0-31].  |
-| `TCC_EA_WR_UNCACHED_32B[n]`           |Req     | Number of 32-byte write/atomic going over the `TC_EA_wrreq` interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. Value range for n: [0-31].|
-| `TCC_EA_WRREQ_STALL[n]`               | Cycles | Number of cycles a write request is stalled. Value range for n: [0-31].                 |
-| `TCC_EA_WRREQ_IO_CREDIT_STALL[n]`  | Cycles | Number of cycles an EA write request is stalled due to the interface running out of IO credits. Value range for n: [0-31].  |
-| `TCC_EA_WRREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of GMI credits. Value range for n: [0-31].  |
-| `TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]`   | Cycles | Number of cycles an EA write request is stalled due to the interface running out of DRAM credits. Value range for n: [0-31]. |
-| `TCC_TOO_MANY_EA_WRREQS_STALL[n]`  | Cycles | Number of cycles the L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests. Value range for n: [0-31]. |
-| `TCC_EA_WRREQ_LEVEL[n]`               | Req    | The accumulated number of EA write requests in flight. This is primarily intended to measure average EA write latency. Average write latency = `TCC_PERF_SEL_EA_WRREQ_LEVEL`/`TCC_PERF_SEL_EA_WRREQ`. Value range for n: [0-31].  |
-| `TCC_EA_ATOMIC[n]`                    | Req    | Number of 32-byte or 64-byte atomic requests going over the `TC_EA_wrreq` interface. Value range for n: [0-31].           |
-| `TCC_EA_ATOMIC_LEVEL[n]`              | Req    | The accumulated number of EA atomic requests in flight. This is primarily intended to measure average EA atomic latency. Average atomic latency = `TCC_PERF_SEL_EA_WRREQ_ATOMIC_LEVEL`/`TCC_PERF_SEL_EA_WRREQ_ATOMIC`. Value range for n: [0-31].  |
-| `TCC_EA_RDREQ[n]`                     | Req    | Number of 32-byte or 64-byte read requests to EA. Value range for n: [0-31].   |
-| `TCC_EA_RDREQ_32B[n]`                 | Req    | Number of 32-byte read requests to EA. Value range for n: [0-31].  |
-| `TCC_EA_RD_UNCACHED_32B[n]`           | Req    | Number of 32-byte EA reads due to uncached traffic. A 64-byte request is counted as 2. Value range for n: [0-31]. |
-| `TCC_EA_RDREQ_IO_CREDIT_STALL[n]`  | Cycles | Number of cycles there is a stall due to the read request interface running out of IO credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
-| `TCC_EA_RDREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of GMI credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
-| `TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]`   | Cycles | Number of cycles there is a stall due to the read request interface running out of DRAM credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
-| `TCC_EA_RDREQ_LEVEL[n]`               | Req    | The accumulated number of EA read requests in flight. This is primarily intended to measure average EA read latency. Average read latency = `TCC_PERF_SEL_EA_RDREQ_LEVEL`/`TCC_PERF_SEL_EA_RDREQ`. Value range for n: [0-31].    |
-| `TCC_EA_RDREQ_DRAM[n]`                | Req    | Number of 32-byte or 64-byte EA read requests to High Bandwidth Memory (HBM). Value range for n: [0-31].   |
-| `TCC_EA_WRREQ_DRAM[n]`                | Req    | Number of 32-byte or 64-byte EA write requests to HBM. Value range for n: [0-31].  |
-| `TCC_TAG_STALL[n]`                    | Cycles | Number of cycles the normal request pipeline in the tag is stalled for any reason.  Normally, stalls of this nature are measured exactly at one point in the pipeline however in case of this counter, probes can stall the pipeline at a variety of places and there is no single point that can reasonably measure the total stalls accurately. Value range for n: [0-31]. |
-| `TCC_NORMAL_WRITEBACK[n]`             | Req    | Number of writebacks due to requests that are not writeback requests. Value range for n: [0-31].    |
-| `TCC_ALL_TC_OP_WB_WRITEBACK[n]`    | Req    | Number of writebacks due to all `TC_OP` writeback requests. Value range for n: [0-31].       |
-| `TCC_NORMAL_EVICT[n]`                 | Req    | Number of evictions due to requests that are not invalidate or probe requests. Value range for n: [0-31].        |
-| `TCC_ALL_TC_OP_INV_EVICT[n]`       | Req    | Number of evictions due to all `TC_OP` invalidate requests. Value range for n: [0-31].           |
-
-## MI200 derived metrics list
-
-| Derived Metric   | Description                                                                            |
-|:----------------|:-------------------------------------------------------------------------------------|
-| `ALUStalledByLDS` | Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready. Reduce this by reducing the LDS bank conflicts or the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
-| `FetchSize` | Total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
-| `FlatLDSInsts`     | Average number of FLAT instructions that read from or write to LDS, executed per work item (affected by flow control). |
-| `FlatVMemInsts`    | Average number of FLAT instructions that read from or write to the video memory, executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
-| `GDSInsts` | Average number of GDS read/write instructions executed per work item (affected by flow control). |
-| `GPUBusy` | Percentage of time GPU is busy. |
-| `L2CacheHit`       | Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
-| `LDSBankConflict`  | Percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
-| `LDSInsts`         | Average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
-| `MemUnitBusy` | Percentage of GPU time the memory unit is active. The result includes the stall time (`MemUnitStalled`). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
-| `MemUnitStalled`   | Percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
-| `MemWrites32B`     | Total number of effective 32B write transactions to the memory.                      |
-| `SALUBusy`         | Percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
-| `SALUInsts` | Average number of scalar ALU instructions executed per work item (affected by flow control). |
-| `SFetchInsts` | Average number of scalar fetch instructions from the video memory executed per work item (affected by flow control). |
-| `TA_ADDR_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA address path is stalled by TC, over all TA instances. |
-| `TA_ADDR_STALLED_BY_TD_CYCLES_sum` | Total number of cycles TA address path is stalled by TD, over all TA instances. |
-| `TA_BUFFER_WAVEFRONTS_sum` | Total number of buffer wavefronts processed by all TA instances. |
-| `TA_BUFFER_READ_WAVEFRONTS_sum` | Total number of buffer read wavefronts processed by all TA instances. |
-| `TA_BUFFER_WRITE_WAVEFRONTS_sum` | Total number of buffer write wavefronts processed by all TA instances. |
-| `TA_BUFFER_ATOMIC_WAVEFRONTS_sum` | Total number of buffer atomic wavefronts processed by all TA instances. |
-| `TA_BUFFER_TOTAL_CYCLES_sum` | Total number of buffer cycles (including read and write) issued to TC by all TA instances. |
-| `TA_BUFFER_COALESCED_READ_CYCLES_sum` | Total number of coalesced buffer read cycles issued to TC by all TA instances. |
-| `TA_BUFFER_COALESCED_WRITE_CYCLES_sum` | Total number of coalesced buffer write cycles issued to TC by all TA instances. |
-| `TA_BUSY_avr` | Average number of busy cycles over all TA instances. |
-| `TA_BUSY_max` | Maximum number of TA busy cycles over all TA instances. |
-| `TA_BUSY_min` | Minimum number of TA busy cycles over all TA instances. |
-| `TA_DATA_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA data path is stalled by TC, over all TA instances. |
-| `TA_FLAT_READ_WAVEFRONTS_sum` | Sum of flat opcode reads processed by all TA instances. |
-| `TA_FLAT_WRITE_WAVEFRONTS_sum` | Sum of flat opcode writes processed by all TA instances. |
-| `TA_FLAT_WAVEFRONTS_sum` | Total number of flat opcode wavefronts processed by all TA instances. |
-| `TA_FLAT_READ_WAVEFRONTS_sum` | Total number of flat opcode read wavefronts processed by all TA instances. |
-| `TA_FLAT_ATOMIC_WAVEFRONTS_sum` | Total number of flat opcode atomic wavefronts processed by all TA instances. |
-| `TA_TA_BUSY_sum` | Total number of TA busy cycles over all TA instances. |
-| `TA_TOTAL_WAVEFRONTS_sum` | Total number of wavefronts processed by all TA instances. |
-| `TCA_BUSY_sum` | Total number of cycles TCA has a pending request, over all TCA instances. |
-| `TCA_CYCLE_sum` | Total number of cycles over all TCA instances. |
-| `TCC_ALL_TC_OP_WB_WRITEBACK_sum` | Total number of writebacks due to all TC_OP writeback requests, over all TCC instances. |
-| `TCC_ALL_TC_OP_INV_EVICT_sum` | Total number of evictions due to all TC_OP invalidate requests, over all TCC instances. |
-| `TCC_ATOMIC_sum` | Total number of L2 cache atomic requests of all types, over all TCC instances. |
-| `TCC_BUSY_avr` | Average number of L2 cache busy cycles, over all TCC instances. |
-| `TCC_BUSY_sum` | Total number of L2 cache busy cycles, over all TCC instances. |
-| `TCC_CC_REQ_sum` | Total number of CC requests over all TCC instances. |
-| `TCC_CYCLE_sum` | Total number of L2 cache free running clocks, over all TCC instances. |
-| `TCC_EA_WRREQ_sum` | Total number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface, over all TCC instances. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. |
-| `TCC_EA_WRREQ_64B_sum` | Total number of 64-byte transactions (write or `CMPSWAP`) going over the TC_EA_wrreq interface, over all TCC instances. |
-| `TCC_EA_WR_UNCACHED_32B_sum` | Total Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic, over all TCC instances. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
-| `TCC_EA_WRREQ_STALL_sum` | Total Number of cycles a write request is stalled, over all instances. |
-| `TCC_EA_WRREQ_IO_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of IO credits, over all instances. |
-| `TCC_EA_WRREQ_GMI_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of GMI credits, over all instances. |
-| `TCC_EA_WRREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of DRAM credits, over all instances. |
-| `TCC_EA_WRREQ_LEVEL_sum` | Total number of EA write requests in flight over all TCC instances. |
-| `TCC_EA_RDREQ_LEVEL_sum` | Total number of EA read requests in flight over all TCC instances. |
-| `TCC_EA_ATOMIC_sum` | Total Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface, over all TCC instances. |
-| `TCC_EA_ATOMIC_LEVEL_sum` | Total number of EA atomic requests in flight, over all TCC instances. |
-| `TCC_EA_RDREQ_sum` | Total number of 32-byte or 64-byte read requests to EA, over all TCC instances. |
-| `TCC_EA_RDREQ_32B_sum` | Total number of 32-byte read requests to EA, over all TCC instances. |
-| `TCC_EA_RD_UNCACHED_32B_sum` | Total number of 32-byte EA reads due to uncached traffic, over all TCC instances. |
-| `TCC_EA_RDREQ_IO_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of IO credits, over all TCC instances. |
-| `TCC_EA_RDREQ_GMI_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of GMI credits, over all TCC instances. |
-| `TCC_EA_RDREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of DRAM credits, over all TCC instances. |
-| `TCC_EA_RDREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA read requests to HBM, over all TCC instances. |
-| `TCC_EA_WRREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA write requests to HBM, over all TCC instances. |
-| `TCC_HIT_sum` | Total number of L2 cache hits over all TCC instances. |
-| `TCC_MISS_sum` | Total number of L2 cache misses over all TCC instances. |
-| `TCC_NC_REQ_sum` | Total number of NC requests over all TCC instances. |
-| `TCC_NORMAL_WRITEBACK_sum` | Total number of writebacks due to requests that are not writeback requests, over all TCC instances. |
-| `TCC_NORMAL_EVICT_sum` | Total number of evictions due to requests that are not invalidate or probe requests, over all TCC instances. |
-| `TCC_PROBE_sum` | Total number of probe requests over all TCC instances. |
-| `TCC_PROBE_ALL_sum` | Total number of external probe requests with EA_TCC_preq_all== 1, over all TCC instances. |
-| `TCC_READ_sum` | Total number of L2 cache read requests (including compressed reads but not metadata reads) over all TCC instances. |
-| `TCC_REQ_sum` | Total number of all types of L2 cache requests over all TCC instances. |
-| `TCC_RW_REQ_sum` | Total number of RW requests over all TCC instances. |
-| `TCC_STREAMING_REQ_sum` | Total number of L2 cache streaming requests over all TCC instances. |
-| `TCC_TAG_STALL_sum` | Total number of cycles the normal request pipeline in the tag is stalled for any reason, over all TCC instances. |
-| `TCC_TOO_MANY_EA_WRREQS_STALL_sum` | Total number of cycles L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests, over all TCC instances. |
-| `TCC_UC_REQ_sum` | Total number of UC requests over all TCC instances. |
-| `TCC_WRITE_sum` | Total number of L2 cache write requests over all TCC instances. |
-| `TCC_WRITEBACK_sum` | Total number of lines written back to the main memory including writebacks of dirty lines and uncached write/atomic requests, over all TCC instances. |
-| `TCC_WRREQ_STALL_max` | Maximum number of cycles a write request is stalled, over all TCC instances. |
-| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on an atomic, over all TCP instances. |
-| `TCP_GATE_EN1_sum` | Total number of cycles vL1D interface clocks are turned on, over all TCP instances. |
-| `TCP_GATE_EN2_sum` | Total number of cycles vL1D core clocks are turned on, over all TCP instances. |
-| `TCP_PENDING_STALL_CYCLES_sum` | Total number of cycles vL1D cache is stalled due to data pending from L2 Cache, over all TCP instances. |
-| `TCP_READ_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a read, over all TCP instances. |
-| `TCP_TA_TCP_STATE_READ_sum` | Total number of state reads by all TCP instances. |
-| `TCP_TCC_ATOMIC_WITH_RET_REQ_sum` | Total number of atomic requests to L2 cache with return, over all TCP instances. |
-| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum` | Total number of atomic requests to L2 cache without return, over all TCP instances. |
-| `TCP_TCC_CC_READ_REQ_sum` | Total number of CC read requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_CC_WRITE_REQ_sum` | Total number of CC write requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_CC_ATOMIC_REQ_sum` | Total number of CC atomic requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_NC_READ_REQ_sum` | Total number of NC read requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_NC_WRITE_REQ_sum` | Total number of NC write requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_NC_ATOMIC_REQ_sum` | Total number of NC atomic requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_READ_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return for all TCP instances. |
-| `TCP_TCC_READ_REQ_sum` | Total number of read requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_RW_READ_REQ_sum` | Total number of RW read requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_RW_WRITE_REQ_sum` | Total number of RW write requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_RW_ATOMIC_REQ_sum` | Total number of RW atomic requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_UC_READ_REQ_sum` | Total number of UC read requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_UC_WRITE_REQ_sum` | Total number of UC write requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_UC_ATOMIC_REQ_sum` | Total number of UC atomic requests to L2 cache, over all TCP instances. |
-| `TCP_TCC_WRITE_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return for all TCP instances. |
-| `TCP_TCC_WRITE_REQ_sum` | Total number of write requests to L2 cache, over all TCP instances. |
-| `TCP_TCP_LATENCY_sum` | Total wave access latency to vL1D over all wavefronts for all TCP instances. |
-| `TCP_TCR_TCP_STALL_CYCLES_sum` | Total number of cycles TCR stalls vL1D, over all TCP instances. |
-| `TCP_TD_TCP_STALL_CYCLES_sum` | Total number of cycles TD stalls vL1D, over all TCP instances. |
-| `TCP_TOTAL_ACCESSES_sum` | Total number of vL1D accesses, over all TCP instances. |
-| `TCP_TOTAL_READ_sum` | Total number of vL1D read accesses, over all TCP instances. |
-| `TCP_TOTAL_WRITE_sum` | Total number of vL1D write accesses, over all TCP instances. |
-| `TCP_TOTAL_ATOMIC_WITH_RET_sum` | Total number of vL1D atomic requests with return, over all TCP instances. |
-| `TCP_TOTAL_ATOMIC_WITHOUT_RET_sum` | Total number of vL1D atomic requests without return, over all TCP instances. |
-| `TCP_TOTAL_CACHE_ACCESSES_sum` | Total number of vL1D cache accesses (including hits and misses) by all TCP instances. |
-| `TCP_TOTAL_WRITEBACK_INVALIDATES_sum` | Total number of vL1D writebacks and invalidates, over all TCP instances. |
-| `TCP_UTCL1_PERMISSION_MISS_sum` | Total number of UTCL1 permission misses by all TCP instances. |
-| `TCP_UTCL1_REQUEST_sum` | Total number of address translation requests to UTCL1 by all TCP instances. |
-| `TCP_UTCL1_TRANSLATION_MISS_sum` | Total number of UTCL1 translation misses by all TCP instances. |
-| `TCP_UTCL1_TRANSLATION_HIT_sum` | Total number of UTCL1 translation hits by all TCP instances. |
-| `TCP_VOLATILE_sum` | Total number of L1 volatile pixels/buffers from TA, over all TCP instances. |
-| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a write, over all TCP instances. |
-| `TD_ATOMIC_WAVEFRONT_sum` | Total number of atomic wavefront instructions, over all TD instances. |
-| `TD_COALESCABLE_WAVEFRONT_sum` | Total number of coalescable wavefronts according to TA, over all TD instances. |
-| `TD_LOAD_WAVEFRONT_sum` | Total number of wavefront instructions (read/write/atomic), over all TD instances. |
-| `TD_SPI_STALL_sum` | Total number of cycles TD is stalled by SPI, over all TD instances. |
-| `TD_STORE_WAVEFRONT_sum` | Total number of write wavefront instructions, over all TD instances. |
-| `TD_TC_STALL_sum` | Total number of cycles TD is stalled waiting for TC data, over all TD instances. |
-| `TD_TD_BUSY_sum` | Total number of TD busy cycles while it is processing or waiting for data, over all TD instances. |
-| `VALUBusy`         | Percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
-| `VALUInsts` | Average number of vector ALU instructions executed per work item (affected by flow control). |
-| `VALUUtilization`  | Percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
-| `VFetchInsts`      | Average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.               |
-| `VWriteInsts`      | Average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.                 |
-| `Wavefronts` | Total wavefronts. |
-| `WRITE_REQ_32B` | Total number of 32-byte effective memory writes. |
-| `WriteSize` | Total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
-| `WriteUnitStalled` | Percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad).      |
-
-## Abbreviations
-
-| Abbreviation | Meaning                                                                           |
-|:------------|:--------------------------------------------------------------------------------|
-| `ALU`          | Arithmetic Logic Unit                                                             |
-| `Arb`          | Arbiter                                                                           |
-| `BF16`         | Brain Floating Point - 16 bits                                                    |
-| `CC`           | Coherently Cached                                                                 |
-| `CP`           | Command Processor                                                                 |
-| `CPC`          | Command Processor - Compute                                                       |
-| `CPF`          | Command Processor - Fetcher                                                       |
-| `CS`           | Compute Shader                                                                    |
-| `CSC`          | Compute Shader Controller                                                         |
-| `CSn`          | Compute Shader, the n-th pipe                                                     |
-| `CU`           | Compute Unit                                                                      |
-| `DW`           | 32-bit Data Word, DWORD                                                           |
-| `EA`           | Efficiency Arbiter                                                                |
-| `F16`          | Half Precision Floating Point                                                     |
-| `F32`          | Full Precision Floating Point                                                     |
-| `FLAT`         | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>.   Global Memory<br>.   Scratch ("private")<br>.   LDS ("shared")<br>.   Invalid - MEM_VIOL TrapStatus |
-| `FMA`          | Fused Multiply Add                                                                |
-| `GDS`          | Global Data Share                                                                 |
-| `GRBM`         | Graphics Register Bus Manager                                                     |
-| `HBM`          | High Bandwidth Memory                                                             |
-| `Instr`        | Instructions                                                                      |
-| `IOP`          | Integer Operation                                                                 |
-| `L2`           | Level-2 Cache                                                                     |
-| `LDS`          | Local Data Share                                                                  |
-| `ME1`          | Micro Engine, running packet processing firmware on CPC                           |
-| `MFMA`         | Matrix Fused Multiply Add                                                         |
-| `NC`           | Noncoherently Cached                                                              |
-| `RW`           | Coherently Cached with Write                                                      |
-| `SALU`         | Scalar ALU                                                                        |
-| `SGPR`         | Scalar General Purpose Register                                                   |
-| `SIMD`         | Single Instruction Multiple Data                                                  |
-| `sL1D`         | Scalar Level-1 Data Cache                                                         |
-| `SMEM`         | Scalar Memory                                                                     |
-| `SPI`          | Shader Processor Input                                                            |
-| `SQ`           | Sequencer                                                                         |
-| `TA`           | Texture Addressing Unit                                                           |
-| `TC`           | Texture Cache                                                                     |
-| `TCA`          | Texture Cache Arbiter                                                             |
-| `TCC`          | Texture Cache per Channel, known as L2 Cache                                      |
-| `TCIU`         | Texture Cache Interface Unit (interface between CP and the memory system) |
-| `TCP`          | Texture Cache per Pipe, known as vector L1 Cache                                  |
-| `TCR`          | Texture Cache Router                                                              |
-| `TD`           | Texture Data Unit                                                                 |
-| `UC`           | Uncached                                                                          |
-| `UTCL1`        | Unified Translation Cache - Level 1                                               |
-| `UTCL2`        | Unified Translation Cache - Level 2                                               |
-| `VALU`         | Vector ALU                                                                        |
-| `VGPR`         | Vector General Purpose Register                                                   |
-| `vL1D`         | Vector Level -1 Data Cache                                                        |
-| `VMEM`         | Vector Memory                                                                     |
--- a/docs/conceptual/gpu-arch/mi250.md
+++ b/docs/conceptual/gpu-arch/mi250.md
@@ -33,8 +33,8 @@ Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further
 subdivided into four SIMD units that process SIMD instructions of 16 data
 elements per instruction (for the FP64 data type). This enables the CU to
 process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7
-GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 22.6
-TFLOPS for vector instructions. This equates to 45.3 TFLOPS for vector instructions for both GCDs together. The MI250 compute units also provide specialized
+GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3
+TFLOPS for vector instructions. The MI250 compute units also provide specialized
 execution units (also called matrix cores), which are geared toward executing
 matrix operations like matrix-matrix multiplications. For FP64, the peak
 performance of these units amounts to 90.5 TFLOPS.
--- a/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst
+++ b/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst
@@ -0,0 +1,758 @@
+.. meta::
+  :description: MI300 and MI200 series performance counters and metrics
+  :keywords: MI300, MI200, performance counters, command processor counters
+
+***************************************************************************************************
+MI300 and MI200 series performance counters and metrics
+***************************************************************************************************
+
+This document lists and describes the hardware performance counters and derived metrics available
+for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the
+:doc:`ROCProfiler tool <rocprofiler:rocprofv1>`.
+
+MI300 and MI200 series performance counters
+===============================================================
+
+Series performance counters include the following categories:
+
+* :ref:`command-processor-counters`
+* :ref:`graphics-register-bus-manager-counters`
+* :ref:`spi-counters`
+* :ref:`compute-unit-counters`
+* :ref:`l1i-and-sl1d-cache-counters`
+* :ref:`vector-l1-cache-subsystem-counters`
+* :ref:`l2-cache-access-counters`
+
+The following sections provide additional details for each category.
+
+.. note::
+
+  Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with
+  an asterisk (*) require further evaluation.
+
+.. _command-processor-counters:
+
+Command processor counters
+---------------------------------------------------------------------------------------------------------------
+
+Command processor counters are further classified into command processor-fetcher and command
+processor-compute.
+
+Command processor-fetcher counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``CPF_CMP_UTCL1_STALL_ON_TRANSLATION``", "Cycles", "Number of cycles one of the compute unified translation caches (L1) is stalled waiting on translation"
+  "``CPF_CPF_STAT_BUSY``", "Cycles", "Number of cycles command processor-fetcher is busy"
+  "``CPF_CPF_STAT_IDLE``", "Cycles", "Number of cycles command processor-fetcher is idle"
+  "``CPF_CPF_STAT_STALL``", "Cycles", "Number of cycles command processor-fetcher is stalled"
+  "``CPF_CPF_TCIU_BUSY``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is busy"
+  "``CPF_CPF_TCIU_IDLE``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is idle"
+  "``CPF_CPF_TCIU_STALL``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is stalled waiting on free tags"
+
+The texture cache interface unit is the interface between the command processor and the memory
+system.
+
+Command processor-compute counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``CPC_ME1_BUSY_FOR_PACKET_DECODE``", "Cycles", "Number of cycles command processor-compute micro engine is busy decoding packets"
+  "``CPC_UTCL1_STALL_ON_TRANSLATION``", "Cycles", "Number of cycles one of the unified translation caches (L1) is stalled waiting on translation"
+  "``CPC_CPC_STAT_BUSY``", "Cycles", "Number of cycles command processor-compute is busy"
+  "``CPC_CPC_STAT_IDLE``", "Cycles", "Number of cycles command processor-compute is idle"
+  "``CPC_CPC_STAT_STALL``", "Cycles", "Number of cycles command processor-compute is stalled"
+  "``CPC_CPC_TCIU_BUSY``", "Cycles", "Number of cycles command processor-compute texture cache interface unit interface is busy"
+  "``CPC_CPC_TCIU_IDLE``", "Cycles", "Number of cycles command processor-compute texture cache interface unit interface is idle"
+  "``CPC_CPC_UTCL2IU_BUSY``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is busy"
+  "``CPC_CPC_UTCL2IU_IDLE``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is idle"
+  "``CPC_CPC_UTCL2IU_STALL``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is stalled"
+  "``CPC_ME1_DC0_SPI_BUSY``", "Cycles", "Number of cycles command processor-compute micro engine processor is busy"
+
+The micro engine runs packet-processing firmware on the command processor-compute counter.
+
+.. _graphics-register-bus-manager-counters:
+
+Graphics register bus manager counters
+---------------------------------------------------------------------------------------------------------------
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``GRBM_COUNT``", "Cycles","Number of free-running GPU cycles"
+  "``GRBM_GUI_ACTIVE``", "Cycles", "Number of GPU active cycles"
+  "``GRBM_CP_BUSY``", "Cycles", "Number of cycles any of the command processor blocks are busy"
+  "``GRBM_SPI_BUSY``", "Cycles", "Number of cycles any of the shader processor input is busy in the shader engines"
+  "``GRBM_TA_BUSY``", "Cycles", "Number of cycles any of the texture addressing unit is busy in the shader engines"
+  "``GRBM_TC_BUSY``", "Cycles", "Number of cycles any of the texture cache blocks are busy"
+  "``GRBM_CPC_BUSY``", "Cycles", "Number of cycles the command processor-compute is busy"
+  "``GRBM_CPF_BUSY``", "Cycles", "Number of cycles the command processor-fetcher is busy"
+  "``GRBM_UTCL2_BUSY``", "Cycles", "Number of cycles the unified translation cache (Level 2 [L2]) block is busy"
+  "``GRBM_EA_BUSY``", "Cycles", "Number of cycles the efficiency arbiter block is busy"
+
+Texture cache blocks include:
+
+* Texture cache arbiter
+* Texture cache per pipe, also known as vector Level 1 (L1) cache
+* Texture cache per channel, also known as known as L2 cache
+* Texture cache interface
+
+.. _spi-counters:
+
+Shader processor input counters
+---------------------------------------------------------------------------------------------------------------
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SPI_CSN_BUSY``", "Cycles", "Number of cycles with outstanding waves"
+  "``SPI_CSN_WINDOW_VALID``", "Cycles", "Number of cycles enabled by ``perfcounter_start`` event"
+  "``SPI_CSN_NUM_THREADGROUPS``", "Workgroups", "Number of dispatched workgroups"
+  "``SPI_CSN_WAVE``", "Wavefronts", "Number of dispatched wavefronts"
+  "``SPI_RA_REQ_NO_ALLOC``", "Cycles", "Number of arbiter cycles with requests but no allocation"
+  "``SPI_RA_REQ_NO_ALLOC_CSN``", "Cycles", "Number of arbiter cycles with compute shader (n\ :sup:`th` pipe) requests but no compute shader (n\ :sup:`th` pipe) allocation"
+  "``SPI_RA_RES_STALL_CSN``", "Cycles", "Number of arbiter stall cycles due to shortage of compute shader (n\ :sup:`th` pipe) pipeline slots"
+  "``SPI_RA_TMP_STALL_CSN``", "Cycles", "Number of stall cycles due to shortage of temp space"
+  "``SPI_RA_WAVE_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of single instruction, multiple data (SIMD) per cycle affected by shortage of wave slots for compute shader (n\ :sup:`th` pipe) wave dispatch"
+  "``SPI_RA_VGPR_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of SIMDs per cycle affected by shortage of vector general-purpose register (VGPR) slots for compute shader (n\ :sup:`th` pipe) wave dispatch"
+  "``SPI_RA_SGPR_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of SIMDs per cycle affected by shortage of scalar general-purpose register (SGPR) slots for compute shader (n\ :sup:`th` pipe) wave dispatch"
+  "``SPI_RA_LDS_CU_FULL_CSN``", "CU", "Number of compute units affected by shortage of local data share (LDS) space for compute shader (n\ :sup:`th` pipe) wave dispatch"
+  "``SPI_RA_BAR_CU_FULL_CSN``", "CU", "Number of compute units with compute shader (n\ :sup:`th` pipe) waves waiting at a BARRIER"
+  "``SPI_RA_BULKY_CU_FULL_CSN``", "CU", "Number of compute units with compute shader (n\ :sup:`th` pipe) waves waiting for BULKY resource"
+  "``SPI_RA_TGLIM_CU_FULL_CSN``", "Cycles", "Number of compute shader (n\ :sup:`th` pipe) wave stall cycles due to restriction of ``tg_limit`` for thread group size"
+  "``SPI_RA_WVLIM_STALL_CSN``", "Cycles", "Number of cycles compute shader (n\ :sup:`th` pipe) is stalled due to ``WAVE_LIMIT``"
+  "``SPI_VWC_CSC_WR``", "Qcycles", "Number of quad-cycles taken to initialize VGPRs when launching waves"
+  "``SPI_SWC_CSC_WR``", "Qcycles", "Number of quad-cycles taken to initialize SGPRs when launching waves"
+
+.. _compute-unit-counters:
+
+Compute unit counters
+---------------------------------------------------------------------------------------------------------------
+
+The compute unit counters are further classified into instruction mix, matrix fused multiply-add (FMA)
+operation counters, level counters, wavefront counters, wavefront cycle counters, and LDS counters.
+
+Instruction mix
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_INSTS``", "Instr", "Number of instructions issued"
+  "``SQ_INSTS_VALU``", "Instr", "Number of vector arithmetic logic unit (VALU) instructions including matrix FMA issued"
+  "``SQ_INSTS_VALU_ADD_F16``", "Instr", "Number of VALU half-precision floating-point (F16) ``ADD`` or ``SUB`` instructions issued"
+  "``SQ_INSTS_VALU_MUL_F16``", "Instr", "Number of VALU F16 Multiply instructions issued"
+  "``SQ_INSTS_VALU_FMA_F16``", "Instr", "Number of VALU F16 FMA or multiply-add instructions issued"
+  "``SQ_INSTS_VALU_TRANS_F16``", "Instr", "Number of VALU F16 Transcendental instructions issued"
+  "``SQ_INSTS_VALU_ADD_F32``", "Instr", "Number of VALU full-precision floating-point (F32) ``ADD`` or ``SUB`` instructions issued"
+  "``SQ_INSTS_VALU_MUL_F32``", "Instr", "Number of VALU F32 Multiply instructions issued"
+  "``SQ_INSTS_VALU_FMA_F32``", "Instr", "Number of VALU F32 FMAor multiply-add instructions issued"
+  "``SQ_INSTS_VALU_TRANS_F32``", "Instr", "Number of VALU F32 Transcendental instructions issued"
+  "``SQ_INSTS_VALU_ADD_F64``", "Instr", "Number of VALU F64 ``ADD`` or ``SUB`` instructions issued"
+  "``SQ_INSTS_VALU_MUL_F64``", "Instr", "Number of VALU F64 Multiply instructions issued"
+  "``SQ_INSTS_VALU_FMA_F64``", "Instr", "Number of VALU F64 FMA or multiply-add instructions issued"
+  "``SQ_INSTS_VALU_TRANS_F64``", "Instr", "Number of VALU F64 Transcendental instructions issued"
+  "``SQ_INSTS_VALU_INT32``", "Instr", "Number of VALU 32-bit integer instructions (signed or unsigned) issued"
+  "``SQ_INSTS_VALU_INT64``", "Instr", "Number of VALU 64-bit integer instructions (signed or unsigned) issued"
+  "``SQ_INSTS_VALU_CVT``", "Instr", "Number of VALU Conversion instructions issued"
+  "``SQ_INSTS_VALU_MFMA_I8``", "Instr", "Number of 8-bit Integer matrix FMA instructions issued"
+  "``SQ_INSTS_VALU_MFMA_F16``", "Instr", "Number of F16 matrix FMA instructions issued"
+  "``SQ_INSTS_VALU_MFMA_F32``", "Instr", "Number of F32 matrix FMA instructions issued"
+  "``SQ_INSTS_VALU_MFMA_F64``", "Instr", "Number of F64 matrix FMA instructions issued"
+  "``SQ_INSTS_MFMA``", "Instr", "Number of matrix FMA instructions issued"
+  "``SQ_INSTS_VMEM_WR``", "Instr", "Number of vector memory write instructions (including flat) issued"
+  "``SQ_INSTS_VMEM_RD``", "Instr", "Number of vector memory read instructions (including flat) issued"
+  "``SQ_INSTS_VMEM``", "Instr", "Number of vector memory instructions issued, including both flat and buffer instructions"
+  "``SQ_INSTS_SALU``", "Instr", "Number of scalar arithmetic logic unit (SALU) instructions issued"
+  "``SQ_INSTS_SMEM``", "Instr", "Number of scalar memory instructions issued"
+  "``SQ_INSTS_SMEM_NORM``", "Instr", "Number of scalar memory instructions normalized to match ``smem_level`` issued"
+  "``SQ_INSTS_FLAT``", "Instr", "Number of flat instructions issued"
+  "``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled."
+  "``SQ_INSTS_LDS``", "Instr", "Number of LDS instructions issued **(MI200: includes flat; MI300: does not include flat)**"
+  "``SQ_INSTS_GDS``", "Instr", "Number of global data share instructions issued"
+  "``SQ_INSTS_EXP_GDS``", "Instr", "Number of EXP and global data share instructions excluding skipped export instructions issued"
+  "``SQ_INSTS_BRANCH``", "Instr", "Number of branch instructions issued"
+  "``SQ_INSTS_SENDMSG``", "Instr", "Number of ``SENDMSG`` instructions including ``s_endpgm`` issued"
+  "``SQ_INSTS_VSKIPPED``", "Instr", "Number of vector instructions skipped"
+
+Flat instructions allow read, write, and atomic access to a generic memory address pointer that can
+resolve to any of the following physical memories:
+
+* Global Memory
+* Scratch ("private")
+* LDS ("shared")
+* Invalid - ``MEM_VIOL`` TrapStatus
+
+Matrix fused multiply-add operation counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_INSTS_VALU_MFMA_MOPS_I8``", "IOP", "Number of 8-bit integer matrix FMA ops in the unit of 512"
+  "``SQ_INSTS_VALU_MFMA_MOPS_F16``", "FLOP", "Number of F16 floating matrix FMA ops in the unit of 512"
+  "``SQ_INSTS_VALU_MFMA_MOPS_BF16``", "FLOP", "Number of BF16 floating matrix FMA ops in the unit of 512"
+  "``SQ_INSTS_VALU_MFMA_MOPS_F32``", "FLOP", "Number of F32 floating matrix FMA ops in the unit of 512"
+  "``SQ_INSTS_VALU_MFMA_MOPS_F64``", "FLOP", "Number of F64 floating matrix FMA ops in the unit of 512"
+
+Level counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. note::
+
+  All level counters must be followed by ``SQ_ACCUM_PREV_HIRES`` counter to measure average latency.
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_ACCUM_PREV``", "Count", "Accumulated counter sample value where accumulation takes place once every four cycles"
+  "``SQ_ACCUM_PREV_HIRES``", "Count", "Accumulated counter sample value where accumulation takes place once every cycle"
+  "``SQ_LEVEL_WAVES``", "Waves", "Number of inflight waves"
+  "``SQ_INST_LEVEL_VMEM``", "Instr", "Number of inflight vector memory (including flat) instructions"
+  "``SQ_INST_LEVEL_SMEM``", "Instr", "Number of inflight scalar memory instructions"
+  "``SQ_INST_LEVEL_LDS``", "Instr", "Number of inflight LDS (including flat) instructions"
+  "``SQ_IFETCH_LEVEL``", "Instr", "Number of inflight instruction fetch requests from the cache"
+
+Use the following formulae to calculate latencies:
+
+* Vector memory latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_VMEM``
+* Wave latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_WAVE``
+* LDS latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_LDS``
+* Scalar memory latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_SMEM_NORM``
+* Instruction fetch latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_IFETCH``
+
+Wavefront counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_WAVES``", "Waves", "Number of wavefronts dispatched to sequencers, including both new and restored wavefronts"
+  "``SQ_WAVES_SAVED``", "Waves", "Number of context-saved waves"
+  "``SQ_WAVES_RESTORED``", "Waves", "Number of context-restored waves sent to sequencers"
+  "``SQ_WAVES_EQ_64``", "Waves", "Number of wavefronts with exactly 64 active threads sent to sequencers"
+  "``SQ_WAVES_LT_64``", "Waves", "Number of wavefronts with less than 64 active threads sent to sequencers"
+  "``SQ_WAVES_LT_48``", "Waves", "Number of wavefronts with less than 48 active threads sent to sequencers"
+  "``SQ_WAVES_LT_32``", "Waves", "Number of wavefronts with less than 32 active threads sent to sequencers"
+  "``SQ_WAVES_LT_16``", "Waves", "Number of wavefronts with less than 16 active threads sent to sequencers"
+
+Wavefront cycle counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_CYCLES``", "Cycles", "Clock cycles"
+  "``SQ_BUSY_CYCLES``", "Cycles", "Number of cycles while sequencers reports it to be busy"
+  "``SQ_BUSY_CU_CYCLES``", "Qcycles", "Number of quad-cycles each compute unit is busy"
+  "``SQ_VALU_MFMA_BUSY_CYCLES``", "Cycles", "Number of cycles the matrix FMA arithmetic logic unit (ALU) is busy"
+  "``SQ_WAVE_CYCLES``", "Qcycles", "Number of quad-cycles spent by waves in the compute units"
+  "``SQ_WAIT_ANY``", "Qcycles", "Number of quad-cycles spent waiting for anything"
+  "``SQ_WAIT_INST_ANY``", "Qcycles", "Number of quad-cycles spent waiting for any instruction to be issued"
+  "``SQ_ACTIVE_INST_ANY``", "Qcycles", "Number of quad-cycles spent by each wave to work on an instruction"
+  "``SQ_ACTIVE_INST_VMEM``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a vector memory instruction"
+  "``SQ_ACTIVE_INST_LDS``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on an LDS instruction"
+  "``SQ_ACTIVE_INST_VALU``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a VALU instruction"
+  "``SQ_ACTIVE_INST_SCA``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a SALU or scalar memory instruction"
+  "``SQ_ACTIVE_INST_EXP_GDS``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on an ``EXPORT`` or ``GDS`` instruction"
+  "``SQ_ACTIVE_INST_MISC``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a ``BRANCH`` or ``SENDMSG`` instruction"
+  "``SQ_ACTIVE_INST_FLAT``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a flat instruction"
+  "``SQ_INST_CYCLES_VMEM_WR``", "Qcycles", "Number of quad-cycles spent to send addr and cmd data for vector memory write instructions"
+  "``SQ_INST_CYCLES_VMEM_RD``", "Qcycles", "Number of quad-cycles spent to send addr and cmd data for vector memory read instructions"
+  "``SQ_INST_CYCLES_SMEM``", "Qcycles", "Number of quad-cycles spent to execute scalar memory reads"
+  "``SQ_INST_CYCLES_SALU``", "Qcycles", "Number of quad-cycles spent to execute non-memory read scalar operations"
+  "``SQ_THREAD_CYCLES_VALU``", "Qcycles", "Number of quad-cycles spent to execute VALU operations on active threads"
+  "``SQ_WAIT_INST_LDS``", "Qcycles", "Number of quad-cycles spent waiting for LDS instruction to be issued"
+
+``SQ_THREAD_CYCLES_VALU`` is similar to ``INST_CYCLES_VALU``, but it's multiplied by the number of
+active threads.
+
+LDS counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_LDS_ATOMIC_RETURN``", "Cycles", "Number of atomic return cycles in LDS"
+  "``SQ_LDS_BANK_CONFLICT``", "Cycles", "Number of cycles LDS is stalled by bank conflicts"
+  "``SQ_LDS_ADDR_CONFLICT``", "Cycles", "Number of cycles LDS is stalled by address conflicts"
+  "``SQ_LDS_UNALIGNED_STALL``", "Cycles", "Number of cycles LDS is stalled processing flat unaligned load or store operations"
+  "``SQ_LDS_MEM_VIOLATIONS``", "Count", "Number of threads that have a memory violation in the LDS"
+  "``SQ_LDS_IDX_ACTIVE``", "Cycles", "Number of cycles LDS is used for indexed operations"
+
+Miscellaneous counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQ_IFETCH``", "Count", "Number of instruction fetch requests from L1i, in 32-byte width"
+  "``SQ_ITEMS``", "Threads", "Number of valid items per wave"
+
+.. _l1i-and-sl1d-cache-counters:
+
+L1 instruction cache (L1i) and scalar L1 data cache (L1d) counters
+---------------------------------------------------------------------------------------------------------------
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition"
+
+  "``SQC_ICACHE_REQ``", "Req", "Number of L1 instruction (L1i) cache requests"
+  "``SQC_ICACHE_HITS``", "Count", "Number of L1i cache hits"
+  "``SQC_ICACHE_MISSES``", "Count", "Number of non-duplicate L1i cache misses including uncached requests"
+  "``SQC_ICACHE_MISSES_DUPLICATE``", "Count", "Number of duplicate L1i cache misses whose previous lookup miss on the same cache line is not fulfilled yet"
+  "``SQC_DCACHE_REQ``", "Req", "Number of scalar L1d requests"
+  "``SQC_DCACHE_INPUT_VALID_READYB``", "Cycles", "Number of cycles while sequencer input is valid but scalar L1d is not ready"
+  "``SQC_DCACHE_HITS``", "Count", "Number of scalar L1d hits"
+  "``SQC_DCACHE_MISSES``", "Count", "Number of non-duplicate scalar L1d misses including uncached requests"
+  "``SQC_DCACHE_MISSES_DUPLICATE``", "Count", "Number of duplicate scalar L1d misses"
+  "``SQC_DCACHE_REQ_READ_1``", "Req", "Number of constant cache read requests in a single 32-bit data word"
+  "``SQC_DCACHE_REQ_READ_2``", "Req", "Number of constant cache read requests in two 32-bit data words"
+  "``SQC_DCACHE_REQ_READ_4``", "Req", "Number of constant cache read requests in four 32-bit data words"
+  "``SQC_DCACHE_REQ_READ_8``", "Req", "Number of constant cache read requests in eight 32-bit data words"
+  "``SQC_DCACHE_REQ_READ_16``", "Req", "Number of constant cache read requests in 16 32-bit data words"
+  "``SQC_DCACHE_ATOMIC``", "Req", "Number of atomic requests"
+  "``SQC_TC_REQ``", "Req", "Number of texture cache requests that were issued by instruction and constant caches"
+  "``SQC_TC_INST_REQ``", "Req", "Number of instruction requests to the L2 cache"
+  "``SQC_TC_DATA_READ_REQ``", "Req", "Number of data Read requests to the L2 cache"
+  "``SQC_TC_DATA_WRITE_REQ``", "Req", "Number of data write requests to the L2 cache"
+  "``SQC_TC_DATA_ATOMIC_REQ``", "Req", "Number of data atomic requests to the L2 cache"
+  "``SQC_TC_STALL``", "Cycles", "Number of cycles while the valid requests to the L2 cache are stalled"
+
+.. _vector-l1-cache-subsystem-counters:
+
+Vector L1 cache subsystem counters
+---------------------------------------------------------------------------------------------------------------
+
+The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data
+unit, vector L1d or texture cache per pipe, and texture cache arbiter counters.
+
+Texture addressing unit counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+  "``TA_TA_BUSY[n]``", "Cycles", "Texture addressing unit busy cycles", "0-15"
+  "``TA_TOTAL_WAVEFRONTS[n]``", "Instr", "Number of wavefronts processed by texture addressing unit", "0-15"
+  "``TA_BUFFER_WAVEFRONTS[n]``", "Instr", "Number of buffer wavefronts processed by texture addressing unit", "0-15"
+  "``TA_BUFFER_READ_WAVEFRONTS[n]``", "Instr", "Number of buffer read wavefronts processed by texture addressing unit", "0-15"
+  "``TA_BUFFER_WRITE_WAVEFRONTS[n]``", "Instr", "Number of buffer write wavefronts processed by texture addressing unit", "0-15"
+  "``TA_BUFFER_ATOMIC_WAVEFRONTS[n]``", "Instr", "Number of buffer atomic wavefronts processed by texture addressing unit", "0-15"
+  "``TA_BUFFER_TOTAL_CYCLES[n]``", "Cycles", "Number of buffer cycles (including read and write) issued to texture cache", "0-15"
+  "``TA_BUFFER_COALESCED_READ_CYCLES[n]``", "Cycles", "Number of coalesced buffer read cycles issued to texture cache", "0-15"
+  "``TA_BUFFER_COALESCED_WRITE_CYCLES[n]``", "Cycles", "Number of coalesced buffer write cycles issued to texture cache", "0-15"
+  "``TA_ADDR_STALLED_BY_TC_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit address path is stalled by texture cache", "0-15"
+  "``TA_DATA_STALLED_BY_TC_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit data path is stalled by texture cache", "0-15"
+  "``TA_ADDR_STALLED_BY_TD_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit address path is stalled by texture data unit", "0-15"
+  "``TA_FLAT_WAVEFRONTS[n]``", "Instr", "Number of flat opcode wavefronts processed by texture addressing unit", "0-15"
+  "``TA_FLAT_READ_WAVEFRONTS[n]``", "Instr", "Number of flat opcode read wavefronts processed by texture addressing unit", "0-15"
+  "``TA_FLAT_WRITE_WAVEFRONTS[n]``", "Instr", "Number of flat opcode write wavefronts processed by texture addressing unit", "0-15"
+  "``TA_FLAT_ATOMIC_WAVEFRONTS[n]``", "Instr", "Number of flat opcode atomic wavefronts processed by texture addressing unit", "0-15"
+
+Texture data unit counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+  "``TD_TD_BUSY[n]``", "Cycle", "Texture data unit busy cycles while it is processing or waiting for data", "0-15"
+  "``TD_TC_STALL[n]``", "Cycle", "Number of cycles texture data unit is stalled waiting for texture cache data", "0-15"
+  "``TD_SPI_STALL[n]``", "Cycle", "Number of cycles texture data unit is stalled by shader processor input", "0-15"
+  "``TD_LOAD_WAVEFRONT[n]``", "Instr", "Number of wavefront instructions (read, write, atomic)", "0-15"
+  "``TD_STORE_WAVEFRONT[n]``", "Instr", "Number of write wavefront instructions", "0-15"
+  "``TD_ATOMIC_WAVEFRONT[n]``", "Instr", "Number of atomic wavefront instructions", "0-15"
+  "``TD_COALESCABLE_WAVEFRONT[n]``", "Instr", "Number of coalescable wavefronts according to texture addressing unit", "0-15"
+
+Texture cache per pipe counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+  "``TCP_GATE_EN1[n]``", "Cycles", "Number of cycles vector L1d interface clocks are turned on", "0-15"
+  "``TCP_GATE_EN2[n]``", "Cycles", "Number of cycles vector L1d core clocks are turned on", "0-15"
+  "``TCP_TD_TCP_STALL_CYCLES[n]``", "Cycles", "Number of cycles texture data unit stalls vector L1d", "0-15"
+  "``TCP_TCR_TCP_STALL_CYCLES[n]``", "Cycles", "Number of cycles texture cache router stalls vector L1d", "0-15"
+  "``TCP_READ_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on a read", "0-15"
+  "``TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on a write", "0-15"
+  "``TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on an atomic", "0-15"
+  "``TCP_PENDING_STALL_CYCLES[n]``", "Cycles", "Number of cycles vector L1d is stalled due to data pending from L2 Cache", "0-15"
+  "``TCP_TCP_TA_DATA_STALL_CYCLES``", "Cycles", "Number of cycles texture cache per pipe stalls texture addressing unit data interface", "NA"
+  "``TCP_TA_TCP_STATE_READ[n]``", "Req", "Number of state reads", "0-15"
+  "``TCP_VOLATILE[n]``", "Req", "Number of L1 volatile pixels or buffers from texture addressing unit", "0-15"
+  "``TCP_TOTAL_ACCESSES[n]``", "Req", "Number of vector L1d accesses. Equals ``TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD``", "0-15"
+  "``TCP_TOTAL_READ[n]``", "Req", "Number of vector L1d read accesses", "0-15"
+  "``TCP_TOTAL_WRITE[n]``", "Req", "Number of vector L1d write accesses", "0-15"
+  "``TCP_TOTAL_ATOMIC_WITH_RET[n]``", "Req", "Number of vector L1d atomic requests with return", "0-15"
+  "``TCP_TOTAL_ATOMIC_WITHOUT_RET[n]``", "Req", "Number of vector L1d atomic without return", "0-15"
+  "``TCP_TOTAL_WRITEBACK_INVALIDATES[n]``", "Count", "Total number of vector L1d writebacks and invalidates", "0-15"
+  "``TCP_UTCL1_REQUEST[n]``", "Req", "Number of address translation requests to unified translation cache (L1)", "0-15"
+  "``TCP_UTCL1_TRANSLATION_HIT[n]``", "Req", "Number of unified translation cache (L1) translation hits", "0-15"
+  "``TCP_UTCL1_TRANSLATION_MISS[n]``", "Req", "Number of unified translation cache (L1) translation misses", "0-15"
+  "``TCP_UTCL1_PERMISSION_MISS[n]``", "Req", "Number of unified translation cache (L1) permission misses", "0-15"
+  "``TCP_TOTAL_CACHE_ACCESSES[n]``", "Req", "Number of vector L1d cache accesses including hits and misses", "0-15"
+  "``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15"
+  "``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15"
+  "``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15"
+  "``TCP_TCC_READ_REQ[n]``", "Req", "Number of read requests to L2 cache", "0-15"
+  "``TCP_TCC_WRITE_REQ[n]``", "Req", "Number of write requests to L2 cache", "0-15"
+  "``TCP_TCC_ATOMIC_WITH_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache with return", "0-15"
+  "``TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache without return", "0-15"
+  "``TCP_TCC_NC_READ_REQ[n]``", "Req", "Number of non-coherently cached read requests to L2 cache", "0-15"
+  "``TCP_TCC_UC_READ_REQ[n]``", "Req", "Number of uncached read requests to L2 cache", "0-15"
+  "``TCP_TCC_CC_READ_REQ[n]``", "Req", "Number of coherently cached read requests to L2 cache", "0-15"
+  "``TCP_TCC_RW_READ_REQ[n]``", "Req", "Number of coherently cached with write read requests to L2 cache", "0-15"
+  "``TCP_TCC_NC_WRITE_REQ[n]``", "Req", "Number of non-coherently cached write requests to L2 cache", "0-15"
+  "``TCP_TCC_UC_WRITE_REQ[n]``", "Req", "Number of uncached write requests to L2 cache", "0-15"
+  "``TCP_TCC_CC_WRITE_REQ[n]``", "Req", "Number of coherently cached write requests to L2 cache", "0-15"
+  "``TCP_TCC_RW_WRITE_REQ[n]``", "Req", "Number of coherently cached with write write requests to L2 cache", "0-15"
+  "``TCP_TCC_NC_ATOMIC_REQ[n]``", "Req", "Number of non-coherently cached atomic requests to L2 cache", "0-15"
+  "``TCP_TCC_UC_ATOMIC_REQ[n]``", "Req", "Number of uncached atomic requests to L2 cache", "0-15"
+  "``TCP_TCC_CC_ATOMIC_REQ[n]``", "Req", "Number of coherently cached atomic requests to L2 cache", "0-15"
+  "``TCP_TCC_RW_ATOMIC_REQ[n]``", "Req", "Number of coherently cached with write atomic requests to L2 cache", "0-15"
+
+Note that:
+
+* ``TCP_TOTAL_READ[n]`` = ``TCP_PERF_SEL_TOTAL_HIT_LRU_READ`` + ``TCP_PERF_SEL_TOTAL_MISS_LRU_READ`` + ``TCP_PERF_SEL_TOTAL_MISS_EVICT_READ``
+* ``TCP_TOTAL_WRITE[n]`` = ``TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE``+ ``TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE``
+* ``TCP_TOTAL_WRITEBACK_INVALIDATES[n]`` = ``TCP_PERF_SEL_TOTAL_WBINVL1``+ ``TCP_PERF_SEL_TOTAL_WBINVL1_VOL``+ ``TCP_PERF_SEL_CP_TCP_INVALIDATE``+ ``TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL``
+
+Texture cache arbiter counters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+  :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+  "``TCA_CYCLE[n]``", "Cycles", "Number of texture cache arbiter cycles", "0-31"
+  "``TCA_BUSY[n]``", "Cycles", "Number of cycles texture cache arbiter has a pending request", "0-31"
+
+.. _l2-cache-access-counters:
+
+L2 cache access counters
+---------------------------------------------------------------------------------------------------------------
+
+L2 cache is also known as texture cache per channel.
+
+.. tab-set::
+
+    .. tab-item:: MI300 hardware counter
+
+      .. csv-table::
+        :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+        "``TCC_CYCLE[n]``", "Cycles", "Number of L2 cache free-running clocks", "0-31"
+        "``TCC_BUSY[n]``", "Cycles", "Number of L2 cache busy cycles", "0-31"
+        "``TCC_REQ[n]``", "Req", "Number of L2 cache requests of all types (measured at the tag block)", "0-31"
+        "``TCC_STREAMING_REQ[n]``", "Req", "Number of L2 cache streaming requests (measured at the tag block)", "0-31"
+        "``TCC_NC_REQ[n]``", "Req", "Number of non-coherently cached requests (measured at the tag block)", "0-31"
+        "``TCC_UC_REQ[n]``", "Req", "Number of uncached requests. This is measured at the tag block", "0-31"
+        "``TCC_CC_REQ[n]``", "Req", "Number of coherently cached requests. This is measured at the tag block", "0-31"
+        "``TCC_RW_REQ[n]``", "Req", "Number of coherently cached with write requests. This is measured at the tag block", "0-31"
+        "``TCC_PROBE[n]``", "Req", "Number of probe requests", "0-31"
+        "``TCC_PROBE_ALL[n]``", "Req", "Number of external probe requests with ``EA_TCC_preq_all == 1``", "0-31"
+        "``TCC_READ[n]``", "Req", "Number of L2 cache read requests (includes compressed reads but not metadata reads)", "0-31"
+        "``TCC_WRITE[n]``", "Req", "Number of L2 cache write requests", "0-31"
+        "``TCC_ATOMIC[n]``", "Req", "Number of L2 cache atomic requests of all types", "0-31"
+        "``TCC_HIT[n]``", "Req", "Number of L2 cache hits", "0-31"
+        "``TCC_MISS[n]``", "Req", "Number of L2 cache misses", "0-31"
+        "``TCC_WRITEBACK[n]``", "Req", "Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests", "0-31"
+        "``TCC_EA0_WRREQ[n]``", "Req", "Number of 32-byte and 64-byte transactions going over the ``TC_EA_wrreq`` interface (doesn't include probe commands)", "0-31"
+        "``TCC_EA0_WRREQ_64B[n]``", "Req", "Total number of 64-byte transactions (write or ``CMPSWAP``) going over the ``TC_EA_wrreq`` interface", "0-31"
+        "``TCC_EA0_WR_UNCACHED_32B[n]``", "Req", "Number of 32 or 64-byte write or atomic going over the ``TC_EA_wrreq`` interface due to uncached traffic", "0-31"
+        "``TCC_EA0_WRREQ_STALL[n]``", "Cycles", "Number of cycles a write request is stalled", "0-31"
+        "``TCC_EA0_WRREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits", "0-31"
+        "``TCC_EA0_WRREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits", "0-31"
+        "``TCC_EA0_WRREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits", "0-31"
+        "``TCC_TOO_MANY_EA_WRREQS_STALL[n]``", "Cycles", "Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests", "0-31"
+        "``TCC_EA0_WRREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter write requests in flight", "0-31"
+        "``TCC_EA0_ATOMIC[n]``", "Req", "Number of 32-byte or 64-byte atomic requests going over the ``TC_EA_wrreq`` interface", "0-31"
+        "``TCC_EA0_ATOMIC_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter atomic requests in flight", "0-31"
+        "``TCC_EA0_RDREQ[n]``", "Req", "Number of 32-byte or 64-byte read requests to efficiency arbiter", "0-31"
+        "``TCC_EA0_RDREQ_32B[n]``", "Req", "Number of 32-byte read requests to efficiency arbiter", "0-31"
+        "``TCC_EA0_RD_UNCACHED_32B[n]``", "Req", "Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2", "0-31"
+        "``TCC_EA0_RDREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of IO credits", "0-31"
+        "``TCC_EA0_RDREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of GMI credits", "0-31"
+        "``TCC_EA0_RDREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of DRAM credits", "0-31"
+        "``TCC_EA0_RDREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter read requests in flight", "0-31"
+        "``TCC_EA0_RDREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)", "0-31"
+        "``TCC_EA0_WRREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter write requests to HBM", "0-31"
+        "``TCC_TAG_STALL[n]``", "Cycles", "Number of cycles the normal request pipeline in the tag is stalled for any reason", "0-31"
+        "``TCC_NORMAL_WRITEBACK[n]``", "Req", "Number of writebacks due to requests that are not writeback requests", "0-31"
+        "``TCC_ALL_TC_OP_WB_WRITEBACK[n]``", "Req", "Number of writebacks due to all ``TC_OP`` writeback requests", "0-31"
+        "``TCC_NORMAL_EVICT[n]``", "Req", "Number of evictions due to requests that are not invalidate or probe requests", "0-31"
+        "``TCC_ALL_TC_OP_INV_EVICT[n]``", "Req", "Number of evictions due to all ``TC_OP`` invalidate requests", "0-31"
+
+    .. tab-item:: MI200 hardware counter
+
+      .. csv-table::
+        :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``"
+
+        "``TCC_CYCLE[n]``", "Cycles", "Number of L2 cache free-running clocks", "0-31"
+        "``TCC_BUSY[n]``", "Cycles", "Number of L2 cache busy cycles", "0-31"
+        "``TCC_REQ[n]``", "Req", "Number of L2 cache requests of all types (measured at the tag block)", "0-31"
+        "``TCC_STREAMING_REQ[n]``", "Req", "Number of L2 cache streaming requests (measured at the tag block)", "0-31"
+        "``TCC_NC_REQ[n]``", "Req", "Number of non-coherently cached requests (measured at the tag block)", "0-31"
+        "``TCC_UC_REQ[n]``", "Req", "Number of uncached requests. This is measured at the tag block", "0-31"
+        "``TCC_CC_REQ[n]``", "Req", "Number of coherently cached requests. This is measured at the tag block", "0-31"
+        "``TCC_RW_REQ[n]``", "Req", "Number of coherently cached with write requests. This is measured at the tag block", "0-31"
+        "``TCC_PROBE[n]``", "Req", "Number of probe requests", "0-31"
+        "``TCC_PROBE_ALL[n]``", "Req", "Number of external probe requests with ``EA_TCC_preq_all == 1``", "0-31"
+        "``TCC_READ[n]``", "Req", "Number of L2 cache read requests (includes compressed reads but not metadata reads)", "0-31"
+        "``TCC_WRITE[n]``", "Req", "Number of L2 cache write requests", "0-31"
+        "``TCC_ATOMIC[n]``", "Req", "Number of L2 cache atomic requests of all types", "0-31"
+        "``TCC_HIT[n]``", "Req", "Number of L2 cache hits", "0-31"
+        "``TCC_MISS[n]``", "Req", "Number of L2 cache misses", "0-31"
+        "``TCC_WRITEBACK[n]``", "Req", "Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests", "0-31"
+        "``TCC_EA_WRREQ[n]``", "Req", "Number of 32-byte and 64-byte transactions going over the ``TC_EA_wrreq`` interface (doesn't include probe commands)", "0-31"
+        "``TCC_EA_WRREQ_64B[n]``", "Req", "Total number of 64-byte transactions (write or ``CMPSWAP``) going over the ``TC_EA_wrreq`` interface", "0-31"
+        "``TCC_EA_WR_UNCACHED_32B[n]``", "Req", "Number of 32 write or atomic going over the ``TC_EA_wrreq`` interface due to uncached traffic. A 64-byte request will be counted as 2", "0-31"
+        "``TCC_EA_WRREQ_STALL[n]``", "Cycles", "Number of cycles a write request is stalled", "0-31"
+        "``TCC_EA_WRREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits", "0-31"
+        "``TCC_EA_WRREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits", "0-31"
+        "``TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits", "0-31"
+        "``TCC_TOO_MANY_EA_WRREQS_STALL[n]``", "Cycles", "Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests", "0-31"
+        "``TCC_EA_WRREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter write requests in flight", "0-31"
+        "``TCC_EA_ATOMIC[n]``", "Req", "Number of 32-byte or 64-byte atomic requests going over the ``TC_EA_wrreq`` interface", "0-31"
+        "``TCC_EA_ATOMIC_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter atomic requests in flight", "0-31"
+        "``TCC_EA_RDREQ[n]``", "Req", "Number of 32-byte or 64-byte read requests to efficiency arbiter", "0-31"
+        "``TCC_EA_RDREQ_32B[n]``", "Req", "Number of 32-byte read requests to efficiency arbiter", "0-31"
+        "``TCC_EA_RD_UNCACHED_32B[n]``", "Req", "Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2", "0-31"
+        "``TCC_EA_RDREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of IO credits", "0-31"
+        "``TCC_EA_RDREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of GMI credits", "0-31"
+        "``TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of DRAM credits", "0-31"
+        "``TCC_EA_RDREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter read requests in flight", "0-31"
+        "``TCC_EA_RDREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)", "0-31"
+        "``TCC_EA_WRREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter write requests to HBM", "0-31"
+        "``TCC_TAG_STALL[n]``", "Cycles", "Number of cycles the normal request pipeline in the tag is stalled for any reason", "0-31"
+        "``TCC_NORMAL_WRITEBACK[n]``", "Req", "Number of writebacks due to requests that are not writeback requests", "0-31"
+        "``TCC_ALL_TC_OP_WB_WRITEBACK[n]``", "Req", "Number of writebacks due to all ``TC_OP`` writeback requests", "0-31"
+        "``TCC_NORMAL_EVICT[n]``", "Req", "Number of evictions due to requests that are not invalidate or probe requests", "0-31"
+        "``TCC_ALL_TC_OP_INV_EVICT[n]``", "Req", "Number of evictions due to all ``TC_OP`` invalidate requests", "0-31"
+
+Note the following:
+
+* ``TCC_REQ[n]`` may be more than the number of requests arriving at the texture cache per channel,
+  but it's a good indication of the total amount of work that needs to be performed.
+* For ``TCC_EA0_WRREQ[n]``, atomics may travel over the same interface and are generally classified as
+  write requests.
+* CC mtypes can produce uncached requests, and those are included in
+  ``TCC_EA0_WR_UNCACHED_32B[n]``
+* ``TCC_EA0_WRREQ_LEVEL[n]`` is primarily intended to measure average efficiency arbiter write latency.
+
+  * Average write latency = ``TCC_PERF_SEL_EA0_WRREQ_LEVEL`` divided by ``TCC_PERF_SEL_EA0_WRREQ``
+
+* ``TCC_EA0_ATOMIC_LEVEL[n]`` is primarily intended to measure average efficiency arbiter atomic
+  latency
+
+  * Average atomic latency = ``TCC_PERF_SEL_EA0_WRREQ_ATOMIC_LEVEL`` divided by ``TCC_PERF_SEL_EA0_WRREQ_ATOMIC``
+
+* ``TCC_EA0_RDREQ_LEVEL[n]`` is primarily intended to measure average efficiency arbiter read latency.
+
+  * Average read latency = ``TCC_PERF_SEL_EA0_RDREQ_LEVEL`` divided by ``TCC_PERF_SEL_EA0_RDREQ``
+
+* Stalls can occur regardless of the need for a read to be performed
+* Normally, stalls are measured exactly at one point in the pipeline however in the case of
+  ``TCC_TAG_STALL[n]``, probes can stall the pipeline at a variety of places. There is no single point that
+  can accurately measure the total stalls
+
+MI300 and MI200 series derived metrics list
+==============================================================
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``ALUStalledByLDS``", "Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready (value range: 0% (optimal) to 100%)"
+  "``FetchSize``", "Total kilobytes fetched from the video memory; measured with all extra fetches and any cache or memory effects taken into account"
+  "``FlatLDSInsts``", "Average number of flat instructions that read from or write to LDS, run per work item (affected by flow control)"
+  "``FlatVMemInsts``", "Average number of flat instructions that read from or write to the video memory, run per work item (affected by flow control). Includes flat instructions that read from or write to scratch"
+  "``GDSInsts``", "Average number of global data share read or write instructions run per work item (affected by flow control)"
+  "``GPUBusy``", "Percentage of time GPU is busy"
+  "``L2CacheHit``", "Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache (value range: 0% (no hit) to 100% (optimal))"
+  "``LDSBankConflict``", "Percentage of GPU time LDS is stalled by bank conflicts (value range: 0% (optimal) to 100%)"
+  "``LDSInsts``", "Average number of LDS read or write instructions run per work item (affected by flow control). Excludes flat instructions that read from or write to LDS."
+  "``MemUnitBusy``", "Percentage of GPU time the memory unit is active, which is measured with all extra fetches and writes and any cache or memory effects taken into account (value range: 0% to 100% (fetch-bound))"
+  "``MemUnitStalled``", "Percentage of GPU time the memory unit is stalled (value range: 0% (optimal) to 100%)"
+  "``MemWrites32B``", "Total number of effective 32B write transactions to the memory"
+  "``TCA_BUSY_sum``", "Total number of cycles texture cache arbiter has a pending request, over all texture cache arbiter instances"
+  "``TCA_CYCLE_sum``", "Total number of cycles over all texture cache arbiter instances"
+  "``SALUBusy``", "Percentage of GPU time scalar ALU instructions are processed (value range: 0% to 100% (optimal))"
+  "``SALUInsts``", "Average number of scalar ALU instructions run per work item (affected by flow control)"
+  "``SFetchInsts``", "Average number of scalar fetch instructions from the video memory run per work item (affected by flow control)"
+  "``VALUBusy``", "Percentage of GPU time vector ALU instructions are processed (value range: 0% to 100% (optimal))"
+  "``VALUInsts``", "Average number of vector ALU instructions run per work item (affected by flow control)"
+  "``VALUUtilization``", "Percentage of active vector ALU threads in a wave, where a lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64 (value range: 0%, 100% (optimal - no thread divergence))"
+  "``VFetchInsts``", "Average number of vector fetch instructions from the video memory run per work-item (affected by flow control); excludes flat instructions that fetch from video memory"
+  "``VWriteInsts``", "Average number of vector write instructions to the video memory run per work-item (affected by flow control); excludes flat instructions that write to video memory"
+  "``Wavefronts``", "Total wavefronts"
+  "``WRITE_REQ_32B``", "Total number of 32-byte effective memory writes"
+  "``WriteSize``", "Total kilobytes written to the video memory; measured with all extra fetches and any cache or memory effects taken into account"
+  "``WriteUnitStalled``", "Percentage of GPU time the write unit is stalled (value range: 0% (optimal) to 100%)"
+
+You can lower ``ALUStalledByLDS`` by reducing LDS bank conflicts or number of LDS accesses.
+You can lower ``MemUnitStalled`` by reducing the number or size of fetches and writes.
+``MemUnitBusy`` includes the stall time (``MemUnitStalled``).
+
+Hardware counters by and over all texture addressing unit instances
+---------------------------------------------------------------------------------------------------------------
+
+The following table shows the hardware counters *by* all texture addressing unit instances.
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TA_BUFFER_WAVEFRONTS_sum``", "Total number of buffer wavefronts processed"
+  "``TA_BUFFER_READ_WAVEFRONTS_sum``", "Total number of buffer read wavefronts processed"
+  "``TA_BUFFER_WRITE_WAVEFRONTS_sum``", "Total number of buffer write wavefronts processed"
+  "``TA_BUFFER_ATOMIC_WAVEFRONTS_sum``", "Total number of buffer atomic wavefronts processed"
+  "``TA_BUFFER_TOTAL_CYCLES_sum``", "Total number of buffer cycles (including read and write) issued to texture cache"
+  "``TA_BUFFER_COALESCED_READ_CYCLES_sum``", "Total number of coalesced buffer read cycles issued to texture cache"
+  "``TA_BUFFER_COALESCED_WRITE_CYCLES_sum``", "Total number of coalesced buffer write cycles issued to texture cache"
+  "``TA_FLAT_READ_WAVEFRONTS_sum``", "Sum of flat opcode reads processed"
+  "``TA_FLAT_WRITE_WAVEFRONTS_sum``", "Sum of flat opcode writes processed"
+  "``TA_FLAT_WAVEFRONTS_sum``", "Total number of flat opcode wavefronts processed"
+  "``TA_FLAT_READ_WAVEFRONTS_sum``", "Total number of flat opcode read wavefronts processed"
+  "``TA_FLAT_ATOMIC_WAVEFRONTS_sum``", "Total number of flat opcode atomic wavefronts processed"
+  "``TA_TOTAL_WAVEFRONTS_sum``", "Total number of wavefronts processed"
+
+The following table shows the hardware counters *over* all texture addressing unit instances.
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TA_ADDR_STALLED_BY_TC_CYCLES_sum``", "Total number of cycles texture addressing unit address path is stalled by texture cache"
+  "``TA_ADDR_STALLED_BY_TD_CYCLES_sum``", "Total number of cycles texture addressing unit address path is stalled by texture data unit"
+  "``TA_BUSY_avr``", "Average number of busy cycles"
+  "``TA_BUSY_max``", "Maximum number of texture addressing unit busy cycles"
+  "``TA_BUSY_min``", "Minimum number of texture addressing unit busy cycles"
+  "``TA_DATA_STALLED_BY_TC_CYCLES_sum``", "Total number of cycles texture addressing unit data path is stalled by texture cache"
+  "``TA_TA_BUSY_sum``", "Total number of texture addressing unit busy cycles"
+
+Hardware counters over all texture cache per channel instances
+---------------------------------------------------------------------------------------------------------------
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TCC_ALL_TC_OP_WB_WRITEBACK_sum``", "Total number of writebacks due to all ``TC_OP`` writeback requests."
+  "``TCC_ALL_TC_OP_INV_EVICT_sum``", "Total number of evictions due to all ``TC_OP`` invalidate requests."
+  "``TCC_ATOMIC_sum``", "Total number of L2 cache atomic requests of all types."
+  "``TCC_BUSY_avr``", "Average number of L2 cache busy cycles."
+  "``TCC_BUSY_sum``", "Total number of L2 cache busy cycles."
+  "``TCC_CC_REQ_sum``", "Total number of coherently cached requests."
+  "``TCC_CYCLE_sum``", "Total number of L2 cache free running clocks."
+  "``TCC_EA0_WRREQ_sum``", "Total number of 32-byte and 64-byte transactions going over the ``TC_EA0_wrreq`` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands."
+  "``TCC_EA0_WRREQ_64B_sum``", "Total number of 64-byte transactions (write or `CMPSWAP`) going over the ``TC_EA0_wrreq`` interface."
+  "``TCC_EA0_WR_UNCACHED_32B_sum``", "Total Number of 32-byte write or atomic going over the ``TC_EA0_wrreq`` interface due to uncached traffic. Note that coherently cached mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2."
+  "``TCC_EA0_WRREQ_STALL_sum``", "Total Number of cycles a write request is stalled, over all instances."
+  "``TCC_EA0_WRREQ_IO_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of IO credits, over all instances."
+  "``TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits, over all instances."
+  "``TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits, over all instances."
+  "``TCC_EA0_WRREQ_LEVEL_sum``", "Total number of efficiency arbiter write requests in flight."
+  "``TCC_EA0_RDREQ_LEVEL_sum``", "Total number of efficiency arbiter read requests in flight."
+  "``TCC_EA0_ATOMIC_sum``", "Total Number of 32-byte or 64-byte atomic requests going over the ``TC_EA0_wrreq`` interface."
+  "``TCC_EA0_ATOMIC_LEVEL_sum``", "Total number of efficiency arbiter atomic requests in flight."
+  "``TCC_EA0_RDREQ_sum``", "Total number of 32-byte or 64-byte read requests to efficiency arbiter."
+  "``TCC_EA0_RDREQ_32B_sum``", "Total number of 32-byte read requests to efficiency arbiter."
+  "``TCC_EA0_RD_UNCACHED_32B_sum``", "Total number of 32-byte efficiency arbiter reads due to uncached traffic."
+  "``TCC_EA0_RDREQ_IO_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of IO credits."
+  "``TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of GMI credits."
+  "``TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of DRAM credits."
+  "``TCC_EA0_RDREQ_DRAM_sum``", "Total number of 32-byte or 64-byte efficiency arbiter read requests to HBM."
+  "``TCC_EA0_WRREQ_DRAM_sum``", "Total number of 32-byte or 64-byte efficiency arbiter write requests to HBM."
+  "``TCC_HIT_sum``", "Total number of L2 cache hits."
+  "``TCC_MISS_sum``", "Total number of L2 cache misses."
+  "``TCC_NC_REQ_sum``", "Total number of non-coherently cached requests."
+  "``TCC_NORMAL_WRITEBACK_sum``", "Total number of writebacks due to requests that are not writeback requests."
+  "``TCC_NORMAL_EVICT_sum``", "Total number of evictions due to requests that are not invalidate or probe requests."
+  "``TCC_PROBE_sum``", "Total number of probe requests."
+  "``TCC_PROBE_ALL_sum``", "Total number of external probe requests with ``EA0_TCC_preq_all == 1``."
+  "``TCC_READ_sum``", "Total number of L2 cache read requests (including compressed reads but not metadata reads)."
+  "``TCC_REQ_sum``", "Total number of all types of L2 cache requests."
+  "``TCC_RW_REQ_sum``", "Total number of coherently cached with write requests."
+  "``TCC_STREAMING_REQ_sum``", "Total number of L2 cache streaming requests."
+  "``TCC_TAG_STALL_sum``", "Total number of cycles the normal request pipeline in the tag is stalled for any reason."
+  "``TCC_TOO_MANY_EA0_WRREQS_STALL_sum``", "Total number of cycles L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests."
+  "``TCC_UC_REQ_sum``", "Total number of uncached requests."
+  "``TCC_WRITE_sum``", "Total number of L2 cache write requests."
+  "``TCC_WRITEBACK_sum``", "Total number of lines written back to the main memory including writebacks of dirty lines and uncached write or atomic requests."
+  "``TCC_WRREQ_STALL_max``", "Maximum number of cycles a write request is stalled."
+
+Hardware counters by, for, or over all texture cache per pipe instances
+----------------------------------------------------------------------------------------------------------------
+
+The following table shows the hardware counters *by* all texture cache per pipe instances.
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TCP_TA_TCP_STATE_READ_sum``", "Total number of state reads by ATCPPI"
+  "``TCP_TOTAL_CACHE_ACCESSES_sum``", "Total number of vector L1d accesses (including hits and misses)"
+  "``TCP_UTCL1_PERMISSION_MISS_sum``", "Total number of unified translation cache (L1) permission misses"
+  "``TCP_UTCL1_REQUEST_sum``", "Total number of address translation requests to unified translation cache (L1)"
+  "``TCP_UTCL1_TRANSLATION_MISS_sum``", "Total number of unified translation cache (L1) translation misses"
+  "``TCP_UTCL1_TRANSLATION_HIT_sum``", "Total number of unified translation cache (L1) translation hits"
+
+The following table shows the hardware counters *for* all texture cache per pipe instances.
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TCP_TCC_READ_REQ_LATENCY_sum``", "Total vector L1d to L2 request latency over all wavefronts for reads and atomics with return"
+  "``TCP_TCC_WRITE_REQ_LATENCY_sum``", "Total vector L1d to L2 request latency over all wavefronts for writes and atomics without return"
+  "``TCP_TCP_LATENCY_sum``", "Total wave access latency to vector L1d over all wavefronts"
+
+The following table shows the hardware counters *over* all texture cache per pipe instances.
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on an atomic"
+  "``TCP_GATE_EN1_sum``", "Total number of cycles vector L1d interface clocks are turned on"
+  "``TCP_GATE_EN2_sum``", "Total number of cycles vector L1d core clocks are turned on"
+  "``TCP_PENDING_STALL_CYCLES_sum``", "Total number of cycles vector L1d cache is stalled due to data pending from L2 Cache"
+  "``TCP_READ_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on a read"
+  "``TCP_TCC_ATOMIC_WITH_RET_REQ_sum``", "Total number of atomic requests to L2 cache with return"
+  "``TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum``", "Total number of atomic requests to L2 cache without return"
+  "``TCP_TCC_CC_READ_REQ_sum``", "Total number of coherently cached read requests to L2 cache"
+  "``TCP_TCC_CC_WRITE_REQ_sum``", "Total number of coherently cached write requests to L2 cache"
+  "``TCP_TCC_CC_ATOMIC_REQ_sum``", "Total number of coherently cached atomic requests to L2 cache"
+  "``TCP_TCC_NC_READ_REQ_sum``", "Total number of non-coherently cached read requests to L2 cache"
+  "``TCP_TCC_NC_WRITE_REQ_sum``", "Total number of non-coherently cached write requests to L2 cache"
+  "``TCP_TCC_NC_ATOMIC_REQ_sum``", "Total number of non-coherently cached atomic requests to L2 cache"
+  "``TCP_TCC_READ_REQ_sum``", "Total number of read requests to L2 cache"
+  "``TCP_TCC_RW_READ_REQ_sum``", "Total number of coherently cached with write read requests to L2 cache"
+  "``TCP_TCC_RW_WRITE_REQ_sum``", "Total number of coherently cached with write write requests to L2 cache"
+  "``TCP_TCC_RW_ATOMIC_REQ_sum``", "Total number of coherently cached with write atomic requests to L2 cache"
+  "``TCP_TCC_UC_READ_REQ_sum``", "Total number of uncached read requests to L2 cache"
+  "``TCP_TCC_UC_WRITE_REQ_sum``", "Total number of uncached write requests to L2 cache"
+  "``TCP_TCC_UC_ATOMIC_REQ_sum``", "Total number of uncached atomic requests to L2 cache"
+  "``TCP_TCC_WRITE_REQ_sum``", "Total number of write requests to L2 cache"
+  "``TCP_TCR_TCP_STALL_CYCLES_sum``", "Total number of cycles texture cache router stalls vector L1d"
+  "``TCP_TD_TCP_STALL_CYCLES_sum``", "Total number of cycles texture data unit stalls vector L1d"
+  "``TCP_TOTAL_ACCESSES_sum``", "Total number of vector L1d accesses"
+  "``TCP_TOTAL_READ_sum``", "Total number of vector L1d read accesses"
+  "``TCP_TOTAL_WRITE_sum``", "Total number of vector L1d write accesses"
+  "``TCP_TOTAL_ATOMIC_WITH_RET_sum``", "Total number of vector L1d atomic requests with return"
+  "``TCP_TOTAL_ATOMIC_WITHOUT_RET_sum``", "Total number of vector L1d atomic requests without return"
+  "``TCP_TOTAL_WRITEBACK_INVALIDATES_sum``", "Total number of vector L1d writebacks and invalidates"
+  "``TCP_VOLATILE_sum``", "Total number of L1 volatile pixels or buffers from texture addressing unit"
+  "``TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on a write"
+
+Hardware counter over all texture data unit instances
+--------------------------------------------------------
+
+.. csv-table::
+  :header: "Hardware counter", "Definition"
+
+  "``TD_ATOMIC_WAVEFRONT_sum``", "Total number of atomic wavefront instructions"
+  "``TD_COALESCABLE_WAVEFRONT_sum``", "Total number of coalescable wavefronts according to texture addressing unit"
+  "``TD_LOAD_WAVEFRONT_sum``", "Total number of wavefront instructions (read, write, atomic)"
+  "``TD_SPI_STALL_sum``", "Total number of cycles texture data unit is stalled by shader processor input"
+  "``TD_STORE_WAVEFRONT_sum``", "Total number of write wavefront instructions"
+  "``TD_TC_STALL_sum``", "Total number of cycles texture data unit is stalled waiting for texture cache data"
+  "``TD_TD_BUSY_sum``", "Total number of texture data unit busy cycles while it is processing or waiting for data"
--- a/docs/conceptual/gpu-arch/mi300.md
+++ b/docs/conceptual/gpu-arch/mi300.md
@@ -0,0 +1,122 @@
+# AMD Instinct™ MI300 series microarchitecture
+
+The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3
+architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine
+learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running
+on everything from individual servers to the world’s largest exascale supercomputers.
+
+With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the
+GPU computational elements of the processor along with the lower levels of the cache hierarchy.
+
+The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series.
+
+```{figure} ../../data/conceptual/gpu-arch/image007.png
+---
+name: mi300-xcd
+align: center
+---
+XCD-level system architecture showing 40 Compute Units, each with 32 KB L1 cache, a Unified Compute System with 4 ACE Compute Accelerators, shared 4MB of L2 cache and an HWS Hardware Scheduler.
+```
+
+On the XCD, four Asynchronous Compute Engines (ACEs) send compute shader workgroups to the
+Compute Units (CUs). The XCD has 40 CUs: 38 active CUs at the aggregate level and 2 disabled CUs for
+yield management. The CUs all share a 4 MB L2 cache that serves to coalesce all memory traffic for the
+die. With less than half of the CUs of the AMD Instinct MI200 Series compute die, the AMD CDNA™ 3
+XCD die is a smaller building block. However, it uses more advanced packaging and the processor
+can include 6 or 8 XCDs for up to 304 CUs, roughly 40% more than MI250X.
+
+The MI300 Series integrate up to 8 vertically stacked XCDs, 8 stacks of
+High-Bandwidth Memory 3 (HBM3) and 4 I/O dies (containing system
+infrastructure) using the AMD Infinity Fabric™ technology as interconnect.
+
+The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine
+learning, enhancing throughput of existing data types while adding support for new data types.
+CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X
+accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a
+performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32,
+while TF32 has a gain of 4 times compared to FP32.
+
+```{list-table} Peak-performance capabilities of the MI300X for different data types.
+:header-rows: 1
+:name: mi300x-perf-table
+
+*
+  - Computation and Data Type
+  - FLOPS/CLOCK/CU
+  - Peak TFLOPS
+*
+  - Matrix FP64
+  - 256
+  - 163.4
+*
+  - Vector FP64
+  - 128
+  - 81.7
+*
+  - Matrix FP32
+  - 256
+  - 163.4
+*
+  - Vector FP32
+  - 256
+  - 163.4
+*
+  - Vector TF32
+  - 1024
+  - 653.7
+*
+  - Matrix FP16
+  - 2048
+  - 1307.4
+*
+  - Matrix BF16
+  - 2048
+  - 1307.4
+*
+  - Matrix FP8
+  - 4096
+  - 2614.9
+*
+  - Matrix INT8
+  - 4096
+  - 2614.9
+```
+
+The above table summarizes the aggregated peak performance of the AMD Instinct MI300X Open
+Compute Platform (OCP) Open Accelerator Modules (OAMs) for different data types and command
+processors. The middle column lists the peak performance (number of data elements processed in a
+single instruction) of a single compute unit if a SIMD (or matrix) instruction is submitted in each clock
+cycle. The third column lists the theoretical peak performance of the OAM. The theoretical aggregated
+peak memory bandwidth of the GPU is 5.3 TB per second.
+
+The following image shows the block diagram of the APU (left) and the OAM package (right) both
+connected via AMD Infinity Fabric™ network on-chip.
+
+```{figure} ../../data/conceptual/gpu-arch/image008.png
+---
+name: mi300-arch
+alt:
+align: center
+---
+MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.
+```
+
+## Node-level architecture
+
+```{figure} ../../data/conceptual/gpu-arch/image009.png
+---
+name: mi300-node
+
+align: center
+---
+MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.
+```
+
+The image above shows the node-level architecture of a system with AMD EPYC processors in a
+dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the
+host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth,
+low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system.
+
+<!---
+We need performance data about the P2P communication here.
+-->
--- a/docs/conceptual/gpu-memory.md
+++ b/docs/conceptual/gpu-memory.md
@@ -177,8 +177,8 @@ Fine-grained memory implies that up-to-date data may be made visible to others r

 | API                     | Flag                         | Coherence      |
 |-------------------------|------------------------------|----------------|
-| `hipExtMallocWithFlags` | `hipHostMallocDefault`       | Fine-grained   |
-| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Coarse-grained |
+| `hipExtMallocWithFlags` | `hipDeviceMallocDefault`     | Coarse-grained |
+| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Fine-grained   |

 | API                     | `hipMemAdvise` argument      | Coherence      |
 |-------------------------|------------------------------|----------------|
--- a/docs/conceptual/using-gpu-sanitizer.md
+++ b/docs/conceptual/using-gpu-sanitizer.md
@@ -5,12 +5,13 @@
  libraries, instrumented applications, AMD, ROCm">
 </head>

-# Using the AddressSanitizer on a GPU (beta release)
+# Using the LLVM ASan on a GPU (beta release)

 The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
-Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
-This document provides documentation on using ROCm ASan.

+Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
+
+This document provides documentation on using ROCm ASan.
 For information about LLVM ASan, see the [LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).

 :::{note}
@@ -25,28 +26,17 @@ Recommendations for doing this are:

 * Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`.
 * Add the following options to the existing compiler and linker options:
-  
  * `-fsanitize=address` - enables instrumentation
-
  * `-shared-libsan` - use shared version of runtime
-
  * `-g` - add debug info for improved reporting
-
 * Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+`
-
 Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted.

-:::{tip}
 It is not an error to compile some files without ASan instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the ASan runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section.
-:::
-
-:::{note}
-When compiling OpenMP programs with ASan instrumentation, it is currently necessary to set the environment variable `LIBRARY_PATH` to `/opt/rocm-<version>/lib/llvm/lib/asan:/opt/rocm-<version>/lib/asan`. At runtime, it may be necessary to add `/opt/rocm-<version>/lib/llvm/lib/asan` to `LD_LIBRARY_PATH`.
-:::

 ### About compilation time

-When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.
+When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all of the downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.

 There are a few options if the compile time becomes unacceptable:

@@ -66,7 +56,7 @@ For a complete ROCm GPU Sanitizer installation, including packages, instrumented
 ## Using AMD-supplied ASan instrumented libraries

 ROCm releases have optional packages that contain additional ASan instrumented builds of the ROCm libraries (usually found in `/opt/rocm-<version>/lib`). The instrumented libraries have identical names to the regular uninstrumented libraries, and are located in `/opt/rocm-<version>/lib/asan`.
-These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with `g++`. The preexisting build options are used but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
+These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.

 These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.

@@ -96,25 +86,16 @@ If it does not appear, when executed the application will quickly output an ASan

 * Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-<version>/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).

-There is an environment variable, `ASAN_OPTIONS`, that can be used to adjust the runtime behavior of the ASan runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASan runtime. The device runtime only currently supports the default settings for the few relevant options.
+There is an environment variable, `ASAN_OPTIONS`, that can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.

-There are three `ASAN_OPTION` flags of note.
+There are two `ASAN_OPTION` flags of particular note.

 * `halt_on_error=0/1 default 1`.

-  This tells the ASan runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional ASan instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.
+This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional ASan instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.

 * `detect_leaks=0/1 default 1`.
-
-  This option directs the ASan runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSan). For heterogeneous applications, this default results in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSan suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.
-
-* `quarantine_size_mb=N default 256`
-
-  This option defines the number of megabytes (MB) `N` of memory that the ASan runtime will hold after it is `freed` to detect use-after-free situations. This memory is unavailable for other purposes. The default of 256 MB may be too small to detect some use-after-free situations, especially given that the large size of many GPU memory allocations may push `freed` allocations out of quarantine before the attempted use.
-
-  :::{note}
-  Setting the value of `quarantine_size_mb` larger may enable more problematic uses to be detected, but at the cost of reducing memory available for other purposes.
-  :::
+This option directs the ASan runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.

 ## Runtime overhead

@@ -153,7 +134,7 @@ instrumentation.

 ## Runtime reporting

-It is not the intention of this document to provide a detailed explanation of all the types of reports that can be output by the ASan runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.
+It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the ASan runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.

 An invalid address detection report for the CPU always starts with

@@ -200,7 +181,7 @@ or

 currently may include one or two surprising CPU side tracebacks mentioning :`hostcall`". This is due to how `malloc` and `free` are implemented for GPU code and these call stacks can be ignored.

-## Running ASan with `rocgdb`
+### Running with `rocgdb`

 `rocgdb` can be used to further investigate ASan detected errors, with some preparation.

@@ -217,7 +198,7 @@ This is solved by setting environment variable `LD_PRELOAD` to the path to the A
 amdclang++ -print-file-name=libclang_rt.asan-x86_64.so
 ```

-You should also set the environment variable `HIP_ENABLE_DEFERRED_LOADING=0` before debugging HIP applications.
+It is also recommended to set the environment variable `HIP_ENABLE_DEFERRED_LOADING=0` before debugging HIP applications.

 After starting `rocgdb` breakpoints can be set on the ASan runtime error reporting entry points of interest. For example, if an ASan error report includes

@@ -252,176 +233,18 @@ $ rocgdb <path to application>
 (gdb) c
 ```

-## Using ASan with a short HIP application
+### Using ASan with a short HIP application

-Consider the following simple and short demo of using the Address Sanitizer with a HIP application:
+Refer to the following example to use ASan with a short HIP application,

-```C++
+https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md

-#include <cstdlib>
-#include <hip/hip_runtime.h>
+### Known issues with using GPU sanitizer

-__global__ void
-set1(int *p)
-{
-    int i = blockDim.x*blockIdx.x + threadIdx.x;
-    p[i] = 1;
-}
-
-int
-main(int argc, char **argv)
-{
-    int m = std::atoi(argv[1]);
-    int n1 = std::atoi(argv[2]);
-    int n2 = std::atoi(argv[3]);
-    int c = std::atoi(argv[4]);
-    int *dp;
-    hipMalloc(&dp, m*sizeof(int));
-    hipLaunchKernelGGL(set1, dim3(n1), dim3(n2), 0, 0, dp);
-    int *hp = (int*)malloc(c * sizeof(int));
-    hipMemcpy(hp, dp, m*sizeof(int), hipMemcpyDeviceToHost);
-    hipDeviceSynchronize();
-    hipFree(dp);
-    free(hp);
-    std::puts("Done.");
-    return 0;
-}
-```
-
-This application will attempt to access invalid addresses for certain command line arguments. In particular, if `m < n1 * n2` some device threads will attempt to access
-unallocated device memory.
-
-Or, if `c < m`, the `hipMemcpy` function will copy past the end of the `malloc` allocated memory.
-
-**Note**: The `hipcc` compiler is used here for simplicity.
-
-Compiling without XNACK results in a warning.
-
-```bash
-$ hipcc -g --offload-arch=gfx90a:xnack- -fsanitize=address -shared-libsan mini.hip -o mini
-clang++: warning: ignoring` `-fsanitize=address' option for offload arch 'gfx90a:xnack-`, as it is not currently supported there. Use it with an offload arch containing 'xnack+' instead [-Woption-ignored]`.
-```
-
-The binary compiled above will run, but the GPU code will not be instrumented and the `m < n1 * n2` error will not be detected. Switching to `--offload-arch=gfx90a:xnack+` in the command above results in a warning-free compilation and an instrumented application. After setting `PATH`, `LD_LIBRARY_PATH` and `HSA_XNACK` as described earlier, a check of the binary with `ldd` yields the following,
-
-```bash
-$ ldd mini
-        linux-vdso.so.1 (0x00007ffd1a5ae000)
-        libclang_rt.asan-x86_64.so => /opt/rocm-6.1.0-99999/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so (0x00007fb9c14b6000)
-        libamdhip64.so.5 => /opt/rocm-6.1.0-99999/lib/asan/libamdhip64.so.5 (0x00007fb9bedd3000)
-        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb9beba8000)
-        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb9bea59000)
-        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb9bea3e000)
-        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb9be84a000)
-        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb9be844000)
-        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb9be821000)
-        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb9be817000)
-        libamd_comgr.so.2 => /opt/rocm-6.1.0-99999/lib/asan/libamd_comgr.so.2 (0x00007fb9b4382000)
-        libhsa-runtime64.so.1 => /opt/rocm-6.1.0-99999/lib/asan/libhsa-runtime64.so.1 (0x00007fb9b3b00000)
-        libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fb9b3af3000)
-        /lib64/ld-linux-x86-64.so.2 (0x00007fb9c2027000)
-        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb9b3ad7000)
-        libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007fb9b3aa7000)
-        libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007fb9b3a89000)
-        libdrm.so.2 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2 (0x00007fb9b3a70000)
-        libdrm_amdgpu.so.1 => /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1 (0x00007fb9b3a62000)
-
-```
-
-This confirms that the address sanitizer runtime is linked in, and the ASan instrumented version of the runtime libraries are used.
-Checking the `PATH` yields
-
-```bash
-$ which llvm-symbolizer
-/opt/rocm-6.1.0-99999/llvm/bin/llvm-symbolizer
-```
-
-Lastly, a check of the OS kernel version yields
-
-```bash
-$ uname -rv
-5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023
-```
-
-which indicates that the required HMM support (kernel version > 5.6) is available. This completes the necessary setup. Running with `m = 100`, `n1 = 11`, `n2 = 10` and `c = 100` should produce
-a report for an invalid access by the last 10 threads.
-
-```bash
-=================================================================
-==3141==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fb1410d2cc4
-WRITE of size 4 in workgroup id (10,0,0)
-  #0 0x7fb1410d2cc4 in set1(int*) at /home/dave/mini/mini.cpp:0:10
-
-Thread ids and accessed addresses:
-00 : 0x7fb14371d190 01 : 0x7fb14371d194 02 : 0x7fb14371d198 03 : 0x7fb14371d19c 04 : 0x7fb14371d1a0 05 : 0x7fb14371d1a4 06 : 0x7fb14371d1a8 07 : 0x7fb14371d1ac
-08 : 0x7fb14371d1b0 09 : 0x7fb14371d1b4
-
-0x7fb14371d190 is located 0 bytes after 400-byte region [0x7fb14371d000,0x7fb14371d190)
-allocated by thread T0 here:
-    #0 0x7fb151c76828 in hsa_amd_memory_pool_allocate /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:692:3
-    #1 ...
-
-    #12 0x7fb14fb99ec4 in hipMalloc /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:568:3
-    #13 0x226630 in hipError_t hipMalloc<int>(int**, unsigned long) /opt/rocm-6.1.0-99999/include/hip/hip_runtime_api.h:8367:12
-    #14 0x226630 in main /home/dave/mini/mini.cpp:19:5
-    #15 0x7fb14ef02082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
-
-Shadow bytes around the buggy address:
-  0x7fb14371cf00: ...
-
-=>0x7fb14371d180: 00 00[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa
-  0x7fb14371d200: ...
-
-Shadow byte legend (one shadow byte represents 8 application bytes):
-  Addressable:           00
-  Partially addressable: 01 02 03 04 05 06 07
-  Heap left redzone:       fa
-  ...
-==3141==ABORTING
-```
-
-Running with `m = 100`, `n1 = 10`, `n2 = 10` and `c = 99` should produce a report for an invalid copy.
-
-```shell
-=================================================================
-==2817==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x514000150dcc at pc 0x7f5509551aca bp 0x7ffc90a7ae50 sp 0x7ffc90a7a610
-WRITE of size 400 at 0x514000150dcc thread T0
-    #0 0x7f5509551ac9 in __asan_memcpy /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:61:3
-    #1 ...
-
-    #9 0x7f5507462a28 in hipMemcpy_common(void*, void const*, unsigned long, hipMemcpyKind, ihipStream_t*) /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:637:10
-    #10 0x7f5507464205 in hipMemcpy /work/dave/git/compute/external/clr/hipamd/src/hip_memory.cpp:642:3
-    #11 0x226844 in main /home/dave/mini/mini.cpp:22:5
-    #12 0x7f55067c3082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
-    #13 0x22605d in _start (/home/dave/mini/mini+0x22605d)
-
-0x514000150dcc is located 0 bytes after 396-byte region [0x514000150c40,0x514000150dcc)
-allocated by thread T0 here:
-    #0 0x7f5509553dcf in malloc /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:69:3
-    #1 0x226817 in main /home/dave/mini/mini.cpp:21:21
-    #2 0x7f55067c3082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
-
-SUMMARY: AddressSanitizer: heap-buffer-overflow /work/dave/git/compute/external/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:61:3 in __asan_memcpy
-Shadow bytes around the buggy address:
-  0x514000150b00: ...
-
-=>0x514000150d80: 00 00 00 00 00 00 00 00 00[04]fa fa fa fa fa fa
-  0x514000150e00: ...
-
-Shadow byte legend (one shadow byte represents 8 application bytes):
-  Addressable:           00
-  Partially addressable: 01 02 03 04 05 06 07
-  Heap left redzone:       fa
-  ...
-==2817==ABORTING
-```
-
-## Known issues with using GPU sanitizer
-
-* Red zones must have limited size. It is possible for an invalid access to completely miss a red zone and not be detected.
+* Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.

 * Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.

 * Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented.

-* It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.
+* It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -21,7 +21,6 @@ for template in templates:
    with open(os.path.splitext(template)[0], 'w') as file:
        file.write(rendered)

-shutil.copy2('../CONTRIBUTING.md','./contribute/index.md')
 shutil.copy2('../RELEASE.md','./about/release-notes.md')
 # Keep capitalization due to similar linking on GitHub's markdown preview.
 shutil.copy2('../CHANGELOG.md','./about/CHANGELOG.md')
@@ -39,8 +38,8 @@ latex_elements = {
 project = "ROCm Documentation"
 author = "Advanced Micro Devices, Inc."
 copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
-version = "6.0.0"
-release = "6.0.0"
+version = "6.0.1"
+release = "6.0.1"
 setting_all_article_info = True
 all_article_info_os = ["linux", "windows"]
 all_article_info_author = ""
@@ -48,9 +47,14 @@ all_article_info_author = ""
 # pages with specific settings
 article_pages = [
    {
-        "file":"release",
+        "file":"about/release-notes",
        "os":["linux", "windows"],
-        "date":"2024-01-09"
+        "date":"2024-01-31"
+    },
+    {
+        "file":"about/CHANGELOG",
+        "os":["linux", "windows"],
+        "date":"2024-01-31"
    },

    {"file":"install/windows/install-quick", "os":["windows"]},
@@ -81,8 +85,6 @@ article_pages = [
    {"file":"how-to/tuning-guides", "os":["linux", "windows"]},

    {"file":"rocm-a-z", "os":["linux", "windows"]},
-
-    {"file":"about/release-notes", "os":["linux"]},
 ]

 exclude_patterns = ['temp']
--- a/docs/contribute/building.md
+++ b/docs/contribute/building.md
@@ -31,11 +31,6 @@ Use the Python Virtual Environment (`venv`) and run the following commands from
 ```sh
 python3 -mvenv .venv

-# Windows
-.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
-.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
-
-# Linux
 .venv/bin/python     -m pip install -r docs/sphinx/requirements.txt
 .venv/bin/python     -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
 ```
@@ -129,12 +124,12 @@ documentation locally using Visual Studio (VS) Code. Follow these steps to confi
      }
    ```

-    > (Implementation detail: two problem matchers were needed to be defined,
+    > Implementation detail: two problem matchers were needed to be defined,
    > because VS Code doesn't tolerate some problem information being potentially
    > absent. While a single regex could match all types of errors, if a capture
    > group remains empty (the line number doesn't show up in all warning/error
    > messages) but the `pattern` references said empty capture group, VS Code
-    > discards the message completely.)
+    > discards the message completely.

 4. Configure the Python virtual environment (`venv`).

--- a/docs/contribute/contributing.md
+++ b/docs/contribute/contributing.md
@@ -0,0 +1,122 @@
+<head>
+  <meta charset="UTF-8">
+  <meta name="description" content="Contributing to ROCm">
+  <meta name="keywords" content="ROCm, contributing, contribute, maintainer, contributor">
+</head>
+
+# Contribute to ROCm documentation
+
+All ROCm projects are GitHub-based, so if you want to contribute, you can do so by:
+
+* [Submitting a pull request in the appropriate GitHub repository](#submit-a-pull-request)
+* [Creating an issue in the appropriate GitHub repository](#create-an-issue)
+* [Suggesting a new feature](#suggest-a-new-feature)
+
+```{important}
+By creating a pull request (PR), you agree to allow your contribution to be licensed under the terms of the
+LICENSE.txt file in the corresponding repository. Different repositories may use different licenses.
+```
+
+## Submit a pull request
+
+To make edits to our documentation via PR, follow these steps:
+
+1. Identify the repository and the file you want to update. For example, to update this page, you would
+  need to modify content located in this file:
+  `https://github.com/ROCm/ROCm/blob/develop/docs/contribute/contributing.md`
+
+2. (optional, but recommended) Fork the repository.
+
+3. Clone the repository locally and (optionally) add your fork. Select the green 'Code' button and copy
+   the URL (e.g., `git@github.com:ROCm/ROCm.git`).
+
+   * From your terminal, run:
+
+      ```bash
+      git clone git@github.com:ROCm/ROCm.git
+      ```
+
+   * Add your fork to this local copy of the repository. Run:
+
+      ```bash
+      git add remote <name-of-my-fork> <git@github.com:my-username/ROCm.git>
+      ```
+
+      To get the URL of your fork, go to your GitHub profile, select the fork and click the green 'Code'
+      button (the same process you followed to get the main GitHub repository URL).
+
+4. Check out the **develop** branch and run 'git pull' (and/or 'git pull origin develop' to ensure your
+  local version has the most recent content.
+
+5. Create a new branch.
+
+    ```bash
+    git checkout -b my-new-branch
+    ```
+
+6. Make your changes locally using your preferred code editor. Follow the guidelines listed on the
+   [documentation structure](./doc-structure.md) page.
+
+7. (optional) We recommend running a local test build to ensure the content looks the way you expect.
+
+    In your terminal, run the following commands from within your cloned repository:
+
+     ```bash
+     cd docs/   # The other commands are run from within the ./docs folder
+     
+     pip3 install -r sphinx/requirements.txt  # You only need to run this command once
+     
+     python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
+     ```
+
+    The build files are located in the `docs/_build` folder. To preview your build, open the index file
+    (`docs/_build/html/index.html`) file. For more information, see
+    [Building documentation](building.md). To learn
+    more about our build tools, see
+    [Documentation toolchain](toolchain.md).
+
+8. Commit your changes and push them to GitHub. Run:
+
+    ```bash
+    git add <path-to-my-modified-file> # To add all modified files, you can use: git add .
+
+    git commit -m "my-updates"
+
+    git push <name-of-my-fork>
+    ```
+
+    After pushing, you will get a GitHub link in the terminal output. Copy this link and paste it into a
+    browser to create your PR.
+
+## Create an issue
+
+1. To create a new GitHub issue, select the 'Issues' tab in the appropriate repository
+  (e.g., https://github.com/ROCm/ROCm/issues).
+2. Use the search bar to make sure the issue doesn't already exist.
+3. If your issue is not already listed, select the green 'New issue' button to the right of the page. Select
+  the type of issue and fill in the resulting template.
+
+### General issue guidelines
+
+* Use your best judgement for issue creation. If your issue is already listed, upvote the issue and
+  comment or post to provide additional details, such as how you reproduced this issue.
+* If you're not sure if your issue is the same, err on the side of caution and file your issue.
+  You can add a comment to include the issue number (and link) for the similar issue. If we evaluate
+  your issue as being the same as the existing issue, we'll close the duplicate.
+* If your issue doesn't exist, use the issue template to file a new issue.
+  * When filing an issue, be sure to provide as much information as possible, including script output so
+    we can collect information about your configuration. This helps reduce the time required to
+    reproduce your issue.
+  * Check your issue regularly, as we may require additional information to successfully reproduce the
+    issue.
+
+## Suggest a new feature
+
+Use the [GitHub Discussion forum](https://github.com/ROCm/ROCm/discussions)
+(Ideas category) to propose new features. Our maintainers are happy to provide direction and
+feedback on feature development.
+
+## Future development workflow
+
+The current ROCm development workflow is GitHub-based. If, in the future, we change this platform,
+the tools and links may change. In this instance, we will update contribution guidelines accordingly.
--- a/docs/contribute/contribute-docs.md
+++ b/docs/contribute/contribute-docs.md
@@ -1,14 +1,4 @@
-# Contributing to ROCm documentation
-
-AMD values and encourages contributions to our code and documentation. If you choose to
-contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
-process, to which we are dedicated.
-
-If you have issues when trying to contribute, refer to the
-[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
-repository.
-
-## Folder structure and naming convention
+# Documentation structure

 Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
 `/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
@@ -218,9 +208,9 @@ guide our content.

 Font size and type, page layout, white space control, and other formatting
 details are controlled via
-[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
+[rocm-docs-core](https://github.com/ROCm/rocm-docs-core). If you want to notify us
 of any formatting issues, create a pull request in our
-[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
+[rocm-docs-core](https://github.com/ROCm/rocm-docs-core) GitHub repository.

 ## Building our documentation

--- a/docs/contribute/feedback.md
+++ b/docs/contribute/feedback.md
@@ -4,9 +4,9 @@
  <meta name="keywords" content="documentation, pull request, GitHub, AMD, ROCm">
 </head>

-# Providing feedback for ROCm documentation
+# Providing feedback

-There are four standard ways to provide feedback for this repository.
+There are four standard ways to provide feedback on this repository.

 ## Pull request

@@ -15,18 +15,21 @@ All contributions to ROCm documentation should arrive via the
 targeting the develop branch of the repository. If you are unable to contribute
 via the GitHub Flow, feel free to email us at [rocm-feedback@amd.com](mailto:rocm-feedback@amd.com?subject=Documentation%20Feedback).

+For more in-depth information on creating a pull request (PR), see
+[Contributing](./contributing.md).
+
 ## GitHub discussions

 To ask questions or view answers to frequently asked questions, refer to
-[GitHub Discussions](https://github.com/RadeonOpenCompute/ROCm/discussions).
+[GitHub Discussions](https://github.com/ROCm/ROCm/discussions).
 On GitHub Discussions, in addition to asking and answering questions,
 members can share updates, have open-ended conversations,
 and follow along on via public announcements.

 ## GitHub issue

-Issues on existing or absent docs can be filed as
-[GitHub Issues](https://github.com/RadeonOpenCompute/ROCm/issues).
+Issues on existing or absent documentation can be filed in
+[GitHub Issues](https://github.com/ROCm/ROCm/issues).

 ## Email

--- a/docs/contribute/toolchain.md
+++ b/docs/contribute/toolchain.md
@@ -10,68 +10,56 @@ Our documentation relies on several open source toolchains and sites.

 ## `rocm-docs-core`

-[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) is an AMD-maintained
-project that applies customization for our documentation. This
-project is the tool most ROCm repositories use as part of the documentation
-build. It is also available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/).
+[rocm-docs-core](https://github.com/ROCm/rocm-docs-core) is an AMD-maintained
+project that applies customization for our documentation. This project is the tool most ROCm
+repositories use as part of the documentation build. It is also available as a
+[pip package on PyPI](https://pypi.org/project/rocm-docs-core/).

-See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation<rocm-docs-core:index>`.
+See the user and developer guides for rocm-docs-core at
+{doc}`rocm-docs-core documentation<rocm-docs-core:index>`.

 ## Sphinx

-[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator
-originally used for Python. It is now widely used in the open source community.
-Originally, Sphinx supported reStructuredText (RST) based documentation, but
-Markdown support is now available.
-ROCm documentation plans to default to Markdown for new projects.
-Existing projects using RST are under no obligation to convert to Markdown. New
-projects that believe Markdown is not suitable should contact the documentation
-team prior to selecting RST.
-
-## Read the Docs
-
-[Read the Docs](https://docs.readthedocs.io/en/stable/) is the service that builds
-and hosts the HTML documentation generated using Sphinx to our end users.
-
-## Doxygen
-
-[Doxygen](https://www.doxygen.nl/) is a documentation generator that extracts
-information from inline code.
-ROCm projects typically use Doxygen for public API documentation unless the
-upstream project uses a different tool.
-
-### Breathe
-
-[Breathe](https://www.breathe-doc.org/) is a Sphinx plugin to integrate Doxygen
-content.
-
-### MyST
-
-[Markedly Structured Text (MyST)](https://myst-tools.org/docs/spec) is an extended
-flavor of Markdown ([CommonMark](https://commonmark.org/)) influenced by reStructuredText (RST) and Sphinx.
-It is integrated into ROCm documentation by the Sphinx extension [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
-A cheat sheet that showcases how to use the MyST syntax is available over at
-the [Jupyter reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).
+[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator originally used for
+Python. It is now widely used in the open source community.

 ### Sphinx External ToC

-[Sphinx External ToC](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
-is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left
-based on a YAML file that specifies the table of contents.
-It was selected due to its flexibility that allows scripts to operate on the
-YAML file. Please transition to this file for the project's navigation. You can
-see the `_toc.yml.in` file in this repository in the `docs/sphinx` folder for an
-example.
+[Sphinx External ToC](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html) is a Sphinx
+extension used for ROCm documentation navigation. This tool generates a navigation menu on the left
+based on a YAML file (`_toc.yml.in`) that contains the table of contents.

 ### Sphinx-book-theme

-[Sphinx-book-theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme
-that defines the base appearance for ROCm documentation.
-ROCm documentation applies some customization,
-such as a custom header and footer on top of the Sphinx Book Theme.
+[Sphinx-book-theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme that
+defines the base appearance for ROCm documentation. ROCm documentation applies some
+customization, such as a custom header and footer on top of the Sphinx Book Theme.

-### Sphinx design
+### Sphinx Design

-[Sphinx design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design
-functionality.
-ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs.
+[Sphinx design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that
+adds design functionality. ROCm documentation uses Sphinx Design for grids, cards, and synchronized
+tabs.
+
+## Doxygen
+
+[Doxygen](https://www.doxygen.nl/) is a documentation generator that extracts information from inline
+code. ROCm projects typically use Doxygen for public API documentation (unless the upstream project
+uses a different tool).
+
+## Breathe
+
+[Breathe](https://www.breathe-doc.org/) is a Sphinx plugin to integrate Doxygen content.
+
+## MyST
+
+[Markedly Structured Text (MyST)](https://myst-tools.org/docs/spec) is an extended flavor of
+Markdown ([CommonMark](https://commonmark.org/)) influenced by reStructuredText (RST) and
+Sphinx. It's integrated into ROCm documentation by the Sphinx extension
+[`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
+A MyST syntax cheat sheet is available on the [Jupyter reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html) site.
+
+## Read the Docs
+
+[Read the Docs](https://docs.readthedocs.io/en/stable/) is the service that builds and hosts the HTML
+documentation generated using Sphinx to our end users.
--- a/docs/data/about/compatibility/floating-point-data-types.png
+++ b/docs/data/about/compatibility/floating-point-data-types.png
--- a/docs/data/banner-compatibility.jpg
+++ b/docs/data/banner-compatibility.jpg
--- a/docs/data/banner-conceptual.jpg
+++ b/docs/data/banner-conceptual.jpg
--- a/docs/data/banner-howto.jpg
+++ b/docs/data/banner-howto.jpg
--- a/docs/data/banner-installation.jpg
+++ b/docs/data/banner-installation.jpg
--- a/docs/data/banner-reference.jpg
+++ b/docs/data/banner-reference.jpg
--- a/docs/data/banner-text.xcf
+++ b/docs/data/banner-text.xcf
--- a/docs/data/banner.png
+++ b/docs/data/banner.png
--- a/docs/data/conceptual/gpu-arch/image007.png
+++ b/docs/data/conceptual/gpu-arch/image007.png
--- a/docs/data/conceptual/gpu-arch/image008.png
+++ b/docs/data/conceptual/gpu-arch/image008.png
--- a/docs/data/conceptual/gpu-arch/image009.png
+++ b/docs/data/conceptual/gpu-arch/image009.png
--- a/docs/data/contribute/clone-repo.png
+++ b/docs/data/contribute/clone-repo.png
--- a/docs/data/contribute/fork-repo.png
+++ b/docs/data/contribute/fork-repo.png
--- a/docs/data/install/magma-install/magma005.png
+++ b/docs/data/install/magma-install/magma005.png
--- a/docs/data/install/linux/linux001.png
+++ b/docs/data/install/linux/linux001.png
--- a/docs/data/install/linux/linux002.png
+++ b/docs/data/install/linux/linux002.png
--- a/docs/data/install/linux/linux003.png
+++ b/docs/data/install/linux/linux003.png
--- a/docs/data/install/linux/linux004.png
+++ b/docs/data/install/linux/linux004.png
--- a/docs/data/install/magma-install/magma006.png
+++ b/docs/data/install/magma-install/magma006.png
--- a/docs/data/install/windows/000-settings-dark.png
+++ b/docs/data/install/windows/000-settings-dark.png
--- a/docs/data/install/windows/000-settings-light.png
+++ b/docs/data/install/windows/000-settings-light.png
--- a/docs/data/install/windows/000-setup-icon.png
+++ b/docs/data/install/windows/000-setup-icon.png
--- a/docs/data/install/windows/001-about-dark.png
+++ b/docs/data/install/windows/001-about-dark.png
--- a/docs/data/install/windows/001-about-light.png
+++ b/docs/data/install/windows/001-about-light.png
--- a/docs/data/install/windows/001-uac-dark.png
+++ b/docs/data/install/windows/001-uac-dark.png
--- a/docs/data/install/windows/001-uac-light.png
+++ b/docs/data/install/windows/001-uac-light.png
--- a/docs/data/install/windows/002-initializing.png
+++ b/docs/data/install/windows/002-initializing.png
--- a/docs/data/install/windows/003-detecting-system-config.png
+++ b/docs/data/install/windows/003-detecting-system-config.png
--- a/docs/data/install/windows/004-installer-window.png
+++ b/docs/data/install/windows/004-installer-window.png
--- a/docs/data/install/windows/012-install-progress.png
+++ b/docs/data/install/windows/012-install-progress.png
--- a/docs/data/install/windows/013-install-complete.png
+++ b/docs/data/install/windows/013-install-complete.png
--- a/docs/data/install/windows/014-uninstall-dark.png
+++ b/docs/data/install/windows/014-uninstall-dark.png
--- a/docs/data/install/windows/014-uninstall-light.png
+++ b/docs/data/install/windows/014-uninstall-light.png
--- a/docs/data/reference/banner-ai.jpg
+++ b/docs/data/reference/banner-ai.jpg
--- a/docs/data/reference/banner-ai.png
+++ b/docs/data/reference/banner-ai.png
--- a/docs/data/reference/banner-communication.jpg
+++ b/docs/data/reference/banner-communication.jpg
--- a/docs/data/reference/banner-cpp-primitives.jpg
+++ b/docs/data/reference/banner-cpp-primitives.jpg
--- a/docs/data/reference/banner-development.jpg
+++ b/docs/data/reference/banner-development.jpg
--- a/docs/data/reference/banner-hip.jpg
+++ b/docs/data/reference/banner-hip.jpg
--- a/docs/data/reference/banner-math.jpg
+++ b/docs/data/reference/banner-math.jpg
--- a/docs/data/reference/banner-performance.jpg
+++ b/docs/data/reference/banner-performance.jpg
--- a/docs/data/reference/banner-random-number.jpg
+++ b/docs/data/reference/banner-random-number.jpg
--- a/docs/data/reference/banner-system.jpg
+++ b/docs/data/reference/banner-system.jpg
--- a/docs/how-to/deep-learning-rocm.md
+++ b/docs/how-to/deep-learning-rocm.md
@@ -13,7 +13,7 @@ the sequential flow for the use of each framework. Refer to the ROCm Compatible
 Frameworks Release Notes for each framework's most current release notes at
 {doc}`Third-party support<rocm-install-on-linux:reference/3rd-party-support-matrix>`.

-![ROCm Compatible Frameworks Flowchart](../data/install/magma-install/magma005.png "ROCm Compatible Frameworks")
+![ROCm Compatible Frameworks Flowchart](../data/how-to/magma005.png "ROCm Compatible Frameworks")

 ## Frameworks installation

--- a/docs/how-to/tuning-guides/mi100.md
+++ b/docs/how-to/tuning-guides/mi100.md
@@ -368,7 +368,7 @@ For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
 {doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. To verify that the installation was
 successful, refer to the
 {doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
-[Validation Tools](../../reference/library-index.md). Should verification
+[system tools](../../reference/rocm-tools.md). Should verification
 fail, consult the [System Debugging Guide](../system-debugging.md).

 (mi100-hw-verification)=
@@ -455,7 +455,7 @@ sudo zypper install rocm-bandwidth-test
 ::::

 Alternatively, the source code can be downloaded and built from
-[source](https://github.com/RadeonOpenCompute/rocm_bandwidth_test).
+[source](https://github.com/ROCm/rocm_bandwidth_test).

 The output will list the available compute devices (CPUs and GPUs):

--- a/docs/how-to/tuning-guides/mi200.md
+++ b/docs/how-to/tuning-guides/mi200.md
@@ -353,7 +353,7 @@ For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
 {doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. For verifying that the
 installation was successful, refer to the
 {doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
-[Validation Tools](../../reference/library-index.md). Should verification
+[system tools](../../reference/rocm-tools.md). Should verification
 fail, consult the [System Debugging Guide](../system-debugging.md).

 (mi200-hw-verification)=
@@ -440,7 +440,7 @@ sudo zypper install rocm-bandwidth-test
 ::::

 Alternatively, the source code can be downloaded and built from
-[source](https://github.com/RadeonOpenCompute/rocm_bandwidth_test).
+[source](https://github.com/ROCm/rocm_bandwidth_test).

 The output will list the available compute devices (CPUs and GPUs), including
 their device ID and PCIe ID:
--- a/docs/how-to/tuning-guides/w6000-v620.md
+++ b/docs/how-to/tuning-guides/w6000-v620.md
@@ -122,8 +122,7 @@ sudo reboot
 ```

 Install the GPU-IOV Module (GIM, where IOV is I/O Virtualization) driver and
-follow the steps below. To obtain the GIM driver, write to us
-[here](mailto:CloudGPUsupport@amd.com):
+follow the steps below.z

 ```shell
 sudo dpkg -i <gim_driver>
@@ -167,6 +166,4 @@ First, assign GPU virtual function (VF) to VM using the following steps.
 Then start the VM.

 Finally install ROCm on the virtual machine (VM). For detailed instructions,
-refer to the {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`. For any
-issue encountered during installation, write to us
-[here](mailto:CloudGPUsupport@amd.com).
+refer to the {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`.
--- a/docs/index.md
+++ b/docs/index.md
@@ -10,20 +10,25 @@
 Welcome to the ROCm docs home page! If you're new to ROCm, you can review the following
 resources to learn more about our products and what we support:

-* [What is ROCm?](./what-is-rocm.md)
+* [What is ROCm?](./what-is-rocm.rst)
 * [Release notes](./about/release-notes.md)

+You can install ROCm on our Radeon™, Radeon™ PRO, and Instinct™ GPUs. If you're using Radeon
+GPUs, we recommend reading the
+{doc}`Radeon-specific ROCm documentation<radeon:index>`.
+
+For hands-on applications, refer to our [ROCm blogs](https://rocm.blogs.amd.com/) site.
+
 Our documentation is organized into the following categories:

 ::::{grid} 1 2 2 2
 :class-container: rocm-doc-grid

 :::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ./data/banner-installation.jpg
+:img-alt: Install documentation
 :padding: 2
-**Installation**
-
-Installation guides
-^^^

 * Linux
  * {doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`
@@ -37,31 +42,51 @@ Installation guides
 * {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
 * {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`
 * {doc}`ROCm & Spack<rocm-install-on-linux:how-to/spack>`
-
 :::

 :::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ./data/banner-compatibility.jpg
+:img-alt: Compatibility information
 :padding: 2
-**Compatibility & support**
-
-ROCm compatibility information
-^^^

 * {doc}`System requirements (Linux)<rocm-install-on-linux:reference/system-requirements>`
 * {doc}`System requirements (Windows)<rocm-install-on-windows:reference/system-requirements>`
-* {doc}`Third-party<rocm-install-on-linux:reference/3rd-party-support-matrix>`
+* {doc}`Third-party support<rocm-install-on-linux:reference/3rd-party-support-matrix>`
 * {doc}`User/kernel space<rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`
 * {doc}`Docker<rocm-install-on-linux:reference/docker-image-support-matrix>`
 * [OpenMP](./about/compatibility/openmp.md)
-
+* [Precision support](./about/compatibility/data-type-support.rst)
+* {doc}`ROCm on Radeon GPUs<radeon:index>`
 :::

 :::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ./data/banner-reference.jpg
+:img-alt: Reference documentation
 :padding: 2
-**How-to**

-Task-oriented walkthroughs
-^^^
+* [API libraries](./reference/api-libraries.md)
+  * Artificial intelligence
+  * C++ primitives
+  * Communication
+  * Fast Fourier transforms
+  * HIP
+  * Linear algebra
+  * Random number generators
+* [Tools](./reference/rocm-tools.md)
+  * Development
+  * Performance analysis
+  * System
+* [GPU architectures](./reference/gpu-arch.rst)
+  * [GPU architecture hardware specification overview](./reference/gpu-arch/gpu-arch-spec-overview.rst)
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ./data/banner-howto.jpg
+:img-alt: How-to documentation
+:padding: 2

 * [System tuning for various architectures](./how-to/tuning-guides.md)
  * [MI100](./how-to/tuning-guides/mi100.md)
@@ -71,31 +96,18 @@ Task-oriented walkthroughs
 * [GPU-enabled MPI](./how-to/gpu-enabled-mpi.rst)
 * [System level debugging](./how-to/system-debugging.md)
 * [GitHub examples](https://github.com/amd/rocm-examples)
-
 :::

 :::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ./data/banner-conceptual.jpg
+:img-alt: Conceptual documentation
 :padding: 2
-**Reference**
-
-Collated information
-^^^
-
-* [API Libraries](./reference/library-index.md)
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Conceptual**
-
-Topic overviews & background information
-^^^

 * [GPU architecture](./conceptual/gpu-arch.md)
  * [MI100](./conceptual/gpu-arch/mi100.md)
-  * [MI200](./conceptual/gpu-arch/mi200-performance-counters.md)
  * [MI250](./conceptual/gpu-arch/mi250.md)
+  * [MI300](./conceptual/gpu-arch/mi300.md)
 * [GPU memory](./conceptual/gpu-memory.md)
 * [Compiler disambiguation](./conceptual/compiler-disambiguation.md)
 * [File structure (Linux FHS)](./conceptual/file-reorg.md)
@@ -106,13 +118,6 @@ Topic overviews & background information
 * [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
 * [Inference optimization with MIGraphX](./conceptual/ai-migraphx-optimization.md)
 * [OpenMP support in ROCm](./about/compatibility/openmp.md)
-
 :::

 ::::
-
-We welcome collaboration! If you'd like to contribute to our documentation, you can find instructions
-on our [Contribute to ROCm docs](./contribute/index.md) page. Known issues are listed on
-[GitHub](https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue).
-
-Licensing information for all ROCm components is listed on our [Licensing](./about/license.md) page.
--- a/docs/reference/api-libraries.md
+++ b/docs/reference/api-libraries.md
@@ -0,0 +1,87 @@
+<head>
+  <meta charset="UTF-8">
+  <meta name="description" content="ROCm API libraries & tools">
+  <meta name="keywords" content="ROCm, API, libraries, tools, artificial intelligence, development,
+  Communications, C++ primitives, Fast Fourier transforms, FFTs, random number generators, linear
+  algebra, AMD">
+</head>
+
+# ROCm API libraries
+
+::::{grid} 1 2 2 2
+:class-container: rocm-doc-grid
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-ai.jpg
+:img-alt: Artificial intelligence APIs
+:padding: 2
+
+* {doc}`Composable Kernel <composable_kernel:index>`
+* {doc}`MIGraphX <amdmigraphx:index>`
+* {doc}`MIOpen <miopen:index>`
+* {doc}`MIVisionX <mivisionx:doxygen/html/index>`
+* [ROCm Performance Primitives (RPP)](https://rocm.docs.amd.com/projects/rpp/en/latest/)
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-cpp-primitives.jpg
+:img-alt: C++ primitives
+:padding: 2
+
+* {doc}`hipCUB <hipcub:index>`
+* {doc}`hipTensor <hiptensor:index>`
+* {doc}`rocPRIM <rocprim:index>`
+* {doc}`rocThrust <rocthrust:index>`
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-communication.jpg
+:img-alt: Communication APIs
+:padding: 2
+
+* {doc}`RCCL <rccl:index>`
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-hip.jpg
+:img-alt: HIP APIs
+:padding: 2
+
+* {doc}`HIP runtime <hip:index>`
+* {doc}`HIPIFY <hipify:index>`
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-math.jpg
+:img-alt: Math APIs
+:padding: 2
+
+* [half](https://github.com/ROCm/half)
+* {doc}`hipBLAS <hipblas:index>` / {doc}`rocBLAS <rocblas:index>`
+* {doc}`hipBLASLt <hipblaslt:index>`
+* {doc}`hipFFT <hipfft:index>` / {doc}`rocFFT <rocfft:index>`
+* [hipfort](https://rocm.docs.amd.com/projects/hipfort/en/latest/)
+* {doc}`hipSOLVER <hipsolver:index>` / {doc}`rocSOLVER <rocsolver:index>`
+* {doc}`hipSPARSE <hipsparse:index>` / {doc}`rocSPARSE <rocsparse:index>`
+* {doc}`hipSPARSELt <hipsparselt:index>`
+* {doc}`rocALUTION <rocalution:index>`
+* {doc}`rocWMMA <rocwmma:index>`
+* [Tensile](https://github.com/ROCm/Tensile)
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-random-number.jpg
+:img-alt: Random number APIs
+:padding: 2
+
+* {doc}`hipRAND <hiprand:index>`
+* {doc}`rocRAND <rocrand:index>`
+:::
+
+::::
--- a/docs/reference/gpu-arch.rst
+++ b/docs/reference/gpu-arch.rst
@@ -0,0 +1,13 @@
+.. meta::
+    :description: GPU Architecture reference
+    :keywords: AMD, GPU, architecture, hardware, CDNA, Instinct, reference
+
+.. _gpu-arch-reference:
+
+GPU architecture reference
+##########################
+
+General overview
+""""""""""""""""
+
+* :doc:`GPU architecture hardware specifications overview<gpu-arch/gpu-arch-spec-overview>`
--- a/docs/reference/gpu-arch/gpu-arch-spec-overview.rst
+++ b/docs/reference/gpu-arch/gpu-arch-spec-overview.rst
@@ -0,0 +1,241 @@
+.. meta::
+   :description: AMD Instinct™ GPU architecture information
+   :keywords: Instinct, CDNA, GPU, architecture, VRAM, Compute Units, Cache, Registers, LDS, Register File
+
+GPU architecture hardware specifications
+########################################
+
+The following table provides an overview over the hardware specifications for the AMD Instinct accelerators.
+
+.. list-table:: AMD Instinct architecture specification table
+    :header-rows: 1
+    :name: instinct-arch-spec-table
+
+    *
+      - Model
+      - Architecture
+      - LLVM target name
+      - VRAM
+      - Compute Units
+      - Wavefront Size
+      - LDS
+      - L3 Cache
+      - L2 Cache
+      - L1 Vector Cache
+      - L1 Scalar Cache
+      - L1 Instruction Cache
+      - VGPR File
+      - SGPR File
+    *
+      - MI300X
+      - CDNA3
+      - gfx941 or gfx942
+      - 192 GiB
+      - 304
+      - 64
+      - 64 KiB
+      - 256 MiB
+      - 32 MiB
+      - 32 KiB
+      - 16 KiB per 2 CUs
+      - 64 KiB per 2 CUs
+      - 512 KiB
+      - 12.5 KiB
+    *
+      - MI300A
+      - CDNA3
+      - gfx940 or gfx942
+      - 128 GiB
+      - 228
+      - 64
+      - 64 KiB
+      - 256 MiB
+      - 24 MiB
+      - 32 KiB
+      - 16 KiB per 2 CUs
+      - 64 KiB per 2 CUs
+      - 512 KiB
+      - 12.5 KiB
+    *
+      - MI250X
+      - CDNA2
+      - gfx90a
+      - 128 GiB
+      - 220 (110 per GCD)
+      - 64
+      - 64 KiB
+      -
+      - 16 MiB (8 MiB per GCD)
+      - 16 KiB
+      - 16 KiB per 2 CUs
+      - 32 KiB per 2 CUs
+      - 512 KiB
+      - 12.5 KiB
+    *
+      - MI250
+      - CDNA2
+      - gfx90a
+      - 128 GiB
+      - 208
+      - 64
+      - 64 KiB
+      -
+      - 16 MiB (8 MiB per GCD)
+      - 16 KiB
+      - 16 KiB per 2 CUs
+      - 32 KiB per 2 CUs
+      - 512 KiB
+      - 12.5 KiB
+    *
+       - MI210
+       - CDNA2
+       - gfx90a
+       - 64 GiB
+       - 104
+       - 64
+       - 64 KiB
+       -
+       - 8 MiB
+       - 16 KiB
+       - 16 KiB per 2 CUs
+       - 32 KiB per 2 CUs
+       - 512 KiB
+       - 12.5 KiB
+    *
+      - MI100
+      - CDNA
+      - gfx908
+      - 32 GiB
+      - 120
+      - 64
+      - 64 KiB
+      -
+      - 8 MiB
+      - 16 KiB
+      - 16 KiB per 3 CUs
+      - 32 KiB per 3 CUs
+      - 256 KiB VGPR and 256 KiB AccVGPR
+      - 12.5 KiB
+    *
+      - MI60
+      - GCN 5.1
+      - gfx906
+      - 32 GiB
+      - 64
+      - 64
+      - 64 KiB
+      -
+      - 4 MiB
+      - 16 KiB
+      - 16 KiB per 3 CUs
+      - 32 KiB per 3 CUs
+      - 256 KiB
+      - 12.5 KiB
+    *
+      - MI50 (32GB)
+      - GCN 5.1
+      - gfx906
+      - 32 GiB
+      - 60
+      - 64
+      - 64 KiB
+      -
+      - 4 MiB
+      - 16 KiB
+      - 16 KiB per 3 CUs
+      - 32 KiB per 3 CUs
+      - 256 KiB
+      - 12.5 KiB
+    *
+      - MI50 (16GB)
+      - GCN 5.1
+      - gfx906
+      - 16 GiB
+      - 60
+      - 64
+      - 64 KiB
+      -
+      - 4 MiB
+      - 16 KiB
+      - 16 KiB per 3 CUs
+      - 32 KiB per 3 CUs
+      - 256 KiB
+      - 12.5 KiB
+    *
+      - MI25
+      - GCN 5.0
+      - gfx900
+      - 16 GiB
+      - 64
+      - 64
+      - 64 KiB
+      -
+      - 4 MiB
+      - 16 KiB
+      - 16 KiB per 3 CUs
+      - 32 KiB per 3 CUs
+      - 256 KiB
+      - 12.5 KiB
+    *
+      - MI8
+      - GCN 3.0
+      - gfx803
+      - 4 GiB
+      - 64
+      - 64
+      - 64 KiB
+      -
+      - 2 MiB
+      - 16 KiB
+      - 16 KiB per 4 CUs
+      - 32 KiB per 4 CUs
+      - 256 KiB
+      - 12.5 KiB
+    *
+      - MI6
+      - GCN 4.0
+      - gfx803
+      - 16 GiB
+      - 36
+      - 64
+      - 64 KiB
+      -
+      - 2 MiB
+      - 16 KiB
+      - 16 KiB per 4 CUs
+      - 32 KiB per 4 CUs
+      - 256 KiB
+      - 12.5 KiB
+
+Glossary
+########
+
+For a more detailed explanation refer to the :ref:`specific documents and guides <gpu-arch-documentation>`.
+
+LLVM target name
+  Argument to pass to clang in `--offload-arch` to compile code for the given architecture.
+VRAM
+  Amount of memory available on the GPU.
+Compute Units
+  Number of compute units on the GPU.
+Wavefront Size
+  Amount of work-items that execute in parallel on a single compute unit. This is equivalent to the warp size in HIP.
+LDS
+  The Local Data Share (LDS) is a low-latency, high-bandwidth scratch pad memory. It is local to the compute units, shared by all work-items in a work group. In HIP this is the shared memory, which is shared by all threads in a block.
+L3 Cache
+  Size of the level 3 cache. Shared by all compute units on the same GPU. Caches vector and scalar data and instructions.
+L2 Cache
+  Size of the level 3 cache. Shared by all compute units on the same GCD. Caches vector and scalar data and instructions.
+L1 Vector Cache
+  Size of the level 1 vector data cache. Local to a compute unit. Caches vector data.
+L1 Scalar Cache
+  Size of the level 1 scalar data cache. Usually shared by several compute units. Caches scalar data.
+L1 Instruction Cache
+  Size of the level 1 instruction cache. Usually shared by several compute units.
+VGPR File
+  Size of the Vector General Purpose Register (VGPR) file. Holds data used in vector instructions.
+  GPUs with matrix cores also have AccVGPRs, which are Accumulation General Purpose Vector Registers, specifically used in matrix instructions.
+SGPR File
+  Size of the Scalar General Purpose Register (SGPR) file. Holds data used in scalar instructions.
+GCD
+  Graphics Compute Die.
--- a/docs/reference/library-index.md
+++ b/docs/reference/library-index.md
@@ -1,145 +0,0 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="ROCm API libraries & tools">
-  <meta name="keywords" content="ROCm, API, libraries, tools, artificial intelligence, development,
-  Communications, C++ primitives, Fast Fourier transforms, FFTs, random number generators, linear
-  algebra, AMD">
-</head>
-
-# ROCm API libraries & tools
-
-::::{grid} 1 3 3 3
-:class-container: rocm-doc-grid
-
-:::{grid-item-card}
-:padding: 2
-**Artificial intelligence**
-
-^^^
-
-* {doc}`Composable Kernel <composable_kernel:index>`
-* {doc}`MIGraphX <amdmigraphx:index>`
-* {doc}`MIOpen <miopen:index>`
-* {doc}`MIVisionX <mivisionx:doxygen/html/index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**C++ primitives**
-
-^^^
-
-* {doc}`hipCUB <hipcub:index>`
-* {doc}`hipTensor <hiptensor:index>`
-* {doc}`rocPRIM <rocprim:index>`
-* {doc}`rocThrust <rocthrust:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Communication**
-
-^^^
-
-* {doc}`RCCL <rccl:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Development tools**
-
-^^^
-
-* {doc}`ROCdbgapi <rocdbgapi:index>`
-* [ROCmCC](./rocmcc.md)
-* {doc}`ROCm debugger (ROCgdb) <rocgdb:index>`
-* {doc}`ROCTracer <roctracer:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Fast Fourier transforms (FFTs)**
-
-^^^
-
-* {doc}`hipFFT <hipfft:index>`
-* {doc}`rocFFT <rocfft:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**HIP**
-
-^^^
-
-* {doc}`HIP runtime <hip:index>`
-* {doc}`HIPIFY <hipify:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Linear algebra**
-
-^^^
-
-* {doc}`hipBLAS <hipblas:index>`
-* {doc}`hipBLASLt <hipblaslt:index>`
-* {doc}`hipSOLVER <hipsolver:index>`
-* {doc}`hipSPARSE <hipsparse:index>`
-* {doc}`hipSPARSELt <hipsparselt:index>`
-* {doc}`rocALUTION <rocalution:index>`
-* {doc}`rocBLAS <rocblas:index>`
-* {doc}`rocSOLVER <rocsolver:index>`
-* {doc}`rocSPARSE <rocsparse:index>`
-* {doc}`rocWMMA <rocwmma:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Performance analysis**
-
-^^^
-
-* {doc}`ROCProfiler <rocprofiler:profiler_home_page>`
-* {doc}`ROCTracer <roctracer:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**Random number generators**
-
-^^^
-
-* {doc}`hipRAND <hiprand:index>`
-* {doc}`rocRAND <rocrand:index>`
-
-:::
-
-:::{grid-item-card}
-:padding: 2
-**System tools**
-
-^^^
-
-* {doc}`AMD SMI <amdsmi:index>`
-* {doc}`ROCm Data Center Tool <rdc:index>`
-* {doc}`ROCm SMI <rocm_smi_lib:index>`
-* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
-* {doc}`TransferBench <transferbench:index>`
-
-:::
-::::
-
-We welcome collaboration! If you'd like to contribute to our documentation, you can find instructions
-on our [Contribute to ROCm docs](../contribute/index.md) page. Known issues are listed on
-[GitHub](https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue).
-
-Licensing information for all ROCm components is listed on our [Licensing](../about/license.md) page.
--- a/docs/reference/rocm-tools.md
+++ b/docs/reference/rocm-tools.md
@@ -0,0 +1,51 @@
+<head>
+  <meta charset="UTF-8">
+  <meta name="description" content="ROCm API libraries & tools">
+  <meta name="keywords" content="ROCm, API, libraries, tools, artificial intelligence, development,
+  Communications, C++ primitives, Fast Fourier transforms, FFTs, random number generators, linear
+  algebra, AMD">
+</head>
+
+# ROCm tools
+
+::::{grid} 1 2 2 2
+:class-container: rocm-doc-grid
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-development.jpg
+:img-alt: Development tools
+:padding: 2
+
+* {doc}`ROCdbgapi <rocdbgapi:index>`
+* [ROCmCC](./rocmcc.md)
+* [ROCm Debug Agent](https://github.com/ROCm/rocr_debug_agent)
+* {doc}`ROCm debugger (ROCgdb) <rocgdb:index>`
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-performance.jpg
+:img-alt: Performance tools
+:padding: 2
+
+* [RocBandwidthTest](https://github.com/ROCm/rocm_bandwidth_test)
+* {doc}`ROCProfiler <rocprofiler:profiler_home_page>`
+* {doc}`ROCTracer <roctracer:index>`
+:::
+
+:::{grid-item-card}
+:class-card: sd-text-black
+:img-top: ../data/reference/banner-system.jpg
+:img-alt: System tools
+:padding: 2
+
+* {doc}`AMD SMI <amdsmi:index>`
+* {doc}`ROCm Data Center Tool <rdc:index>`
+* [ROCm Info](https://github.com/ROCm/rocminfo)
+* {doc}`ROCm SMI <rocm_smi_lib:index>`
+* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
+* {doc}`TransferBench <transferbench:index>`
+:::
+
+::::
--- a/docs/reference/rocmcc.md
+++ b/docs/reference/rocmcc.md
@@ -27,7 +27,7 @@ The differences are listed in [the table below](rocm-llvm-vs-alt).
 For more details, see:

 * AMD GPU usage: [llvm.org/docs/AMDGPUUsage.html](https://llvm.org/docs/AMDGPUUsage.html)
-* Releases and source: <https://github.com/RadeonOpenCompute/llvm-project>
+* Releases and source: <https://github.com/ROCm/llvm-project>

 ### ROCm compiler interfaces

@@ -56,7 +56,7 @@ The major differences between `hipcc` and `amdclang++` are listed below:
 | Finding a HIP installation         | Finds the HIP installation based on its own location and its knowledge about the ROCm directory structure                | First looks for HIP under the same parent directory as its own LLVM directory and then falls back on `/opt/rocm`. Users can use the `--rocm-path` option to instruct the compiler to use HIP from the specified ROCm installation. |
 | Linking to the HIP runtime library | Is configured to automatically link to the HIP runtime from the detected HIP installation                                | Requires the `--hip-link` flag to be specified to link to the HIP runtime. Alternatively, users can use the `-l<dir> -lamdhip64` option to link to a HIP runtime library. |
 | Device function inlining           | Inlines all GPU device functions, which provide greater performance and compatibility for codes that contain file scoped or device function scoped `__shared__` variables. However, it may increase compile time. | Relies on inlining heuristics to control inlining. Users experiencing performance or compilation issues with code using file scoped or device function scoped `__shared__` variables could try `-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false` to work around the issue. There are plans to address these issues with future compiler improvements. |
-| Source code location               | <https://github.com/ROCm-Developer-Tools/HIPCC>                                                                          | <https://github.com/RadeonOpenCompute/llvm-project> |
+| Source code location               | <https://github.com/ROCm/HIPCC>                                                                          | <https://github.com/ROCm/llvm-project> |
 ::::

 ## Compiler options and features
@@ -461,7 +461,7 @@ supports ASM statements, their use is not recommended for the following reasons:
 :::{note}
 For developers who choose to include ASM statements in the code, AMD is
 interested in understanding the use case and appreciates feedback at
-[https://github.com/RadeonOpenCompute/ROCm/issues](https://github.com/RadeonOpenCompute/ROCm/issues)
+[https://github.com/ROCm/ROCm/issues](https://github.com/ROCm/ROCm/issues)
 :::

 ### Miscellaneous OpenMP compiler features
--- a/docs/release/versions.md
+++ b/docs/release/versions.md
@@ -8,6 +8,7 @@

 | Version | Release date |
 | ------- | ------------ |
+| [6.0.2](https://rocm.docs.amd.com/en/docs-6.0.2/) | Jan 31, 2024 |
 | [6.0.0](https://rocm.docs.amd.com/en/docs-6.0.0/) | Dec 15, 2023 |
 | [5.7.1](https://rocm.docs.amd.com/en/docs-5.7.1/) | Oct 13, 2023 |
 | [5.7.0](https://rocm.docs.amd.com/en/docs-5.7.0/) | Sep 15, 2023 |
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -6,14 +6,14 @@ defaults:
 root: index
 subtrees:
 - entries:
-  - file: what-is-rocm.md
+  - file: what-is-rocm.rst
  - file: about/release-notes.md
    title: Release notes
    subtrees:
    - entries:
      - file: about/CHANGELOG.md
        title: Changelog
-  - url: https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue
+  - url: https://github.com/ROCm/ROCm/labels/Verified%20Issue
    title: Known issues

 - caption: Install
@@ -23,17 +23,29 @@ subtrees:
  - url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/
    title: HIP SDK on Windows

- caption: Supported configurations
+- caption: Compatibility
  entries:
  - url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/reference/system-requirements.html
    title: Linux
  - url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/reference/system-requirements.html
    title: Windows
+  - file: about/compatibility/data-type-support.rst
+    title: Precision support
+  - url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/reference/3rd-party-support-matrix.html
+    title: Third-party

 - caption: Reference
  entries:
-    - file: reference/library-index.md
-      title: API libraries & tools
+    - file: reference/api-libraries.md
+      title: API libraries
+    - file: reference/rocm-tools.md
+      title: Tools
+    - file: reference/gpu-arch.rst
+      title: GPU architectures
+      subtrees:
+      - entries:
+        - file: reference/gpu-arch/gpu-arch-spec-overview.rst
+          title: Hardware specifications overview

 - caption: How-to
  entries:
@@ -62,6 +74,16 @@ subtrees:
    title: GPU architectures
    subtrees:
    - entries:
+      - file: conceptual/gpu-arch/mi300.md
+        title: MI300 microarchitecture
+        subtrees:
+        - entries:
+          - url: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf
+            title: AMD Instinct MI300/CDNA3 ISA
+          - url: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
+            title: White paper
+          - file: conceptual/gpu-arch/mi300-mi200-performance-counters.rst
+            title: MI300 and MI200 Performance counter
      - file: conceptual/gpu-arch/mi250.md
        title: MI250 microarchitecture
        subtrees:
@@ -70,8 +92,6 @@ subtrees:
            title: AMD Instinct MI200/CDNA2 ISA
          - url: https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf
            title: White paper
-          - file: conceptual/gpu-arch/mi200-performance-counters.md
-            title: Performance counter
      - file: conceptual/gpu-arch/mi100.md
        title: MI100 microarchitecture
        subtrees:
@@ -103,20 +123,18 @@ subtrees:

 - caption: Contribute
  entries:
-  - file: contribute/index.md
-    title: Contribute to ROCm
+  - file: contribute/contributing.md
+    title: Contribute to ROCm docs
    subtrees:
    - entries:
-      - file: contribute/contribute-docs.md
-        title: Contribute to ROCm docs
-        subtrees:
-        - entries:
-          - file: contribute/toolchain.md
-            title: Documentation tools
-          - file: contribute/building.md
-            title: Building documentation
-          - file: contribute/feedback.md
-            title: Provide feedback
+      - file: contribute/doc-structure.md
+        title: Documentation structure
+      - file: contribute/toolchain.md
+        title: Documentation toolchain
+      - file: contribute/building.md
+        title: Build our documentation
+  - file: contribute/feedback.md
+    title: Provide feedback
  - file: about/license.md
    title: ROCm license

--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,2 +1 @@
-rocm-docs-core==1.8.0
-sphinx-reredirects
+rocm-docs-core==0.37.1
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -1,14 +1,14 @@
 #
-# This file is autogenerated by pip-compile with Python 3.10
+# This file is autogenerated by pip-compile with Python 3.8
 # by the following command:
 #
 #    pip-compile requirements.in
 #
-accessible-pygments==0.0.5
+accessible-pygments==0.0.4
    # via pydata-sphinx-theme
-alabaster==1.0.0
+alabaster==0.7.13
    # via sphinx
-babel==2.16.0
+babel==2.14.0
    # via
    #   pydata-sphinx-theme
    #   sphinx
@@ -16,91 +16,132 @@ beautifulsoup4==4.12.3
    # via pydata-sphinx-theme
 breathe==4.35.0
    # via rocm-docs-core
-certifi==2024.8.30
+certifi==2024.2.2
    # via requests
-cffi==1.17.1
+cffi==1.16.0
    # via
    #   cryptography
    #   pynacl
 charset-normalizer==3.3.2
    # via requests
 click==8.1.7
-    # via sphinx-external-toc
-cryptography==43.0.1
+    # via
+    #   click-log
+    #   doxysphinx
+    #   sphinx-external-toc
+click-log==0.4.0
+    # via doxysphinx
+cryptography==42.0.5
    # via pyjwt
 deprecated==1.2.14
    # via pygithub
-docutils==0.21.2
+docutils==0.17.1
    # via
    #   breathe
    #   myst-parser
+    #   pybtex-docutils
    #   pydata-sphinx-theme
    #   sphinx
-fastjsonschema==2.20.0
+    #   sphinxcontrib-bibtex
+doxysphinx==3.3.7
+    # via rocm-docs-core
+fastjsonschema==2.19.1
    # via rocm-docs-core
 gitdb==4.0.11
    # via gitpython
-gitpython==3.1.43
+gitpython==3.1.42
    # via rocm-docs-core
-idna==3.10
+idna==3.6
    # via requests
 imagesize==1.4.1
    # via sphinx
-jinja2==3.1.4
+importlib-metadata==7.0.2
+    # via
+    #   sphinx
+    #   sphinxcontrib-bibtex
+importlib-resources==6.3.2
+    # via
+    #   mpire
+    #   rocm-docs-core
+jinja2==3.1.3
    # via
    #   myst-parser
    #   sphinx
-markdown-it-py==3.0.0
+latexcodec==3.0.0
+    # via pybtex
+libsass==0.22.0
+    # via doxysphinx
+lxml==4.9.4
+    # via doxysphinx
+markdown-it-py==2.2.0
    # via
    #   mdit-py-plugins
    #   myst-parser
 markupsafe==2.1.5
    # via jinja2
-mdit-py-plugins==0.4.2
+mdit-py-plugins==0.3.5
    # via myst-parser
 mdurl==0.1.2
    # via markdown-it-py
-myst-parser==4.0.0
+mpire==2.10.1
+    # via doxysphinx
+myst-parser==1.0.0
    # via rocm-docs-core
-packaging==24.1
+numpy==1.24.4
+    # via -r requirements.in
+packaging==24.0
    # via
    #   pydata-sphinx-theme
    #   sphinx
-pycparser==2.22
+pybtex==0.24.0
+    # via
+    #   pybtex-docutils
+    #   sphinxcontrib-bibtex
+pybtex-docutils==1.0.3
+    # via sphinxcontrib-bibtex
+pycparser==2.21
    # via cffi
-pydata-sphinx-theme==0.15.4
+pydata-sphinx-theme==0.14.4
    # via
    #   rocm-docs-core
    #   sphinx-book-theme
-pygithub==2.4.0
+pygithub==2.2.0
    # via rocm-docs-core
-pygments==2.18.0
+pygments==2.17.2
    # via
    #   accessible-pygments
+    #   mpire
    #   pydata-sphinx-theme
    #   sphinx
-pyjwt[crypto]==2.9.0
+pyjwt[crypto]==2.6.0
    # via pygithub
 pynacl==1.5.0
    # via pygithub
-pyyaml==6.0.2
+pyparsing==3.1.2
+    # via doxysphinx
+pytz==2024.1
+    # via babel
+pyyaml==6.0.1
    # via
    #   myst-parser
+    #   pybtex
    #   rocm-docs-core
    #   sphinx-external-toc
-requests==2.32.3
+requests==2.31.0
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.8.0
+rocm-docs-core==0.37.1
    # via -r requirements.in
+six==1.16.0
+    # via pybtex
 smmap==5.0.1
    # via gitdb
 snowballstemmer==2.2.0
    # via sphinx
-soupsieve==2.6
+soupsieve==2.5
    # via beautifulsoup4
-sphinx==8.0.2
+sphinx==5.3.0
    # via
    #   breathe
    #   myst-parser
@@ -111,40 +152,44 @@ sphinx==8.0.2
    #   sphinx-design
    #   sphinx-external-toc
    #   sphinx-notfound-page
-    #   sphinx-reredirects
-sphinx-book-theme==1.1.3
+    #   sphinxcontrib-bibtex
+sphinx-book-theme==1.0.1
    # via rocm-docs-core
 sphinx-copybutton==0.5.2
    # via rocm-docs-core
-sphinx-design==0.6.1
+sphinx-design==0.5.0
    # via rocm-docs-core
-sphinx-external-toc==1.0.1
+sphinx-external-toc==0.3.1
    # via rocm-docs-core
-sphinx-notfound-page==1.0.4
+sphinx-notfound-page==1.0.0
    # via rocm-docs-core
-sphinx-reredirects==0.1.5
+sphinxcontrib-applehelp==1.0.4
+    # via sphinx
+sphinxcontrib-bibtex==2.6.2
    # via -r requirements.in
-sphinxcontrib-applehelp==2.0.0
+sphinxcontrib-devhelp==1.0.2
    # via sphinx
-sphinxcontrib-devhelp==2.0.0
-    # via sphinx
-sphinxcontrib-htmlhelp==2.1.0
+sphinxcontrib-htmlhelp==2.0.1
    # via sphinx
 sphinxcontrib-jsmath==1.0.1
    # via sphinx
-sphinxcontrib-qthelp==2.0.0
+sphinxcontrib-qthelp==1.0.3
    # via sphinx
-sphinxcontrib-serializinghtml==2.0.0
+sphinxcontrib-serializinghtml==1.1.5
    # via sphinx
-tomli==2.0.1
-    # via sphinx
-typing-extensions==4.12.2
+tqdm==4.66.2
+    # via mpire
+typing-extensions==4.10.0
    # via
    #   pydata-sphinx-theme
    #   pygithub
-urllib3==2.2.3
+urllib3==2.2.1
    # via
    #   pygithub
    #   requests
 wrapt==1.16.0
    # via deprecated
+zipp==3.18.1
+    # via
+    #   importlib-metadata
+    #   importlib-resources
--- a/docs/temp/rocm-a-z.md
+++ b/docs/temp/rocm-a-z.md
@@ -5,15 +5,15 @@

 | ROCm product | Description |
 | :---------------- | :------------ |
-| [AMD Compute Language Runtimes (CLR)](https://github.com/ROCm-Developer-Tools/clr) | Contains source code for AMD's compute languages runtimes: {doc}`HIP <hip:index>` and OpenCL |
-| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | A graph inference engine that accelerates machine learning model inference |
-| [AOMP](https://github.com/ROCm-Developer-Tools/aomp/) | A scripted build of [LLVM](https://github.com/RadeonOpenCompute/llvm-project) and supporting software |
-| [Asynchronous Task and Memory Interface (ATMI)](https://github.com/RadeonOpenCompute/atmi/) | A runtime framework for efficient task management in heterogeneous CPU-GPU systems |
+| [AMD Compute Language Runtimes (CLR)](https://github.com/ROCm/clr) | Contains source code for AMD's compute languages runtimes: {doc}`HIP <hip:index>` and OpenCL |
+| [AMDMIGraphX](https://github.com/ROCm/AMDMIGraphX/) | A graph inference engine that accelerates machine learning model inference |
+| [AOMP](https://github.com/ROCm/aomp/) | A scripted build of [LLVM](https://github.com/ROCm/llvm-project) and supporting software |
+| [Asynchronous Task and Memory Interface (ATMI)](https://github.com/ROCm/atmi/) | A runtime framework for efficient task management in heterogeneous CPU-GPU systems |
 | [Composable Kernel](https://rocm.docs.amd.com/projects/composable_kernel/en/latest/) | A library that aims to provide a programming model for writing performance critical kernels for machine learning workloads across multiple architectures |
-| [Flang](https://github.com/ROCm-Developer-Tools/flang/) | An out-of-tree Fortran compiler targeting LLVM |
-| [Half-precision floating point library (half)](https://github.com/ROCmSoftwarePlatform/half/) | A C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions, and common mathematical functions |
+| [Flang](https://github.com/ROCm/flang/) | An out-of-tree Fortran compiler targeting LLVM |
+| [Half-precision floating point library (half)](https://github.com/ROCm/half/) | A C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions, and common mathematical functions |
 | {doc}`HIP <hip:index>` | AMD’s GPU programming language extension and the GPU runtime |
-| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/) | A BLAS-marshaling library that supports [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/) and cuBLAS backends |
+| [hipBLAS](https://github.com/ROCm/hipBLAS/) | A BLAS-marshaling library that supports [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/) and cuBLAS backends |
 | [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/) | A compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure |
 | [hipCUB](https://rocm.docs.amd.com/projects/hipCUB/en/latest/) | A thin header-only wrapper library on top of [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) or CUB that allows project porting using the CUB library to the HIP layer |
 | [hipFFT](https://rocm.docs.amd.com/projects/hipFFT/en/latest/) | An FFT-marshalling library that supports rocFFT or cuFFT backends |
@@ -23,42 +23,42 @@
 | [hipify-perl](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-perl.html) | An autogenerated, perl-based script that translates CUDA source code into portable HIP C++ |
 | [hipSOLVER](https://rocm.docs.amd.com/projects/hipSOLVER/en/latest/) | A LAPACK-marshalling library that supports [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) and cuSOLVER backends |
 | [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/)  | A SPARSE-marshalling library that supports [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) and cuSPARSE backends |
-| [hipTensor](https://github.com/ROCmSoftwarePlatform/hipTensor) | AMD's C++ library for accelerating tensor primitives based on the composable kernel library |
-| [LLVM](https://github.com/RadeonOpenCompute/llvm-project) | A toolkit for the construction of highly optimized compilers, optimizers, and run-time environments |
+| [hipTensor](https://github.com/ROCm/hipTensor) | AMD's C++ library for accelerating tensor primitives based on the composable kernel library |
+| [LLVM](https://github.com/ROCm/llvm-project) | A toolkit for the construction of highly optimized compilers, optimizers, and run-time environments |
 | [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/) | A graph inference engine that accelerates machine learning model inference |
 | [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/latest/) | An open source deep-learning library |
-| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM) | An OpenCL general matrix multiplication (GEMM) API and kernel generator |
-| [MIOpenTensile](https://github.com/ROCmSoftwarePlatform/MIOpenTensile) | Provides host-callable interfaces to Tensile library |
+| [MIOpenGEMM](https://github.com/ROCm/MIOpenGEMM) | An OpenCL general matrix multiplication (GEMM) API and kernel generator |
+| [MIOpenTensile](https://github.com/ROCm/MIOpenTensile) | Provides host-callable interfaces to Tensile library |
 | [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/latest/doxygen/html/index.html) | A set of comprehensive computer vision and machine learning libraries, utilities, and applications |
 | [Radeon Compute Profiler (RCP)](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | A performance analysis tool that gathers data from the API run-time and GPU for OpenCL and ROCm/HSA applications |
 | [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) | A standalone library that provides multi-GPU and multi-node collective communication primitives |
 | [rocAL](https://rocm.docs.amd.com/projects/rocAL/en/latest/doxygen/html/index.html) | An augmentation library designed to decode and process images and videos |
 | [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/latest/) | A sparse linear algebra library for exploring fine-grained parallelism on ROCm runtime and toolchains |
-| [RocBandwidthTest](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/) | Captures the performance characteristics of buffer copying and kernel read/write operations |
+| [RocBandwidthTest](https://github.com/ROCm/rocm_bandwidth_test/) | Captures the performance characteristics of buffer copying and kernel read/write operations |
 | [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/)| A BLAS implementation (in the HIP programming language) on the ROCm runtime and toolchains |
 | [rocFFT](https://rocm.docs.amd.com/projects/rocFFT/en/latest/) | A software library for computing fast Fourier transforms (FFTs) written in HIP |
-| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | An AMDGPU Driver with KFD that is used by ROCm |
+| [ROCK-Kernel-Driver](https://github.com/ROCm/ROCK-Kernel-Driver/) | An AMDGPU Driver with KFD that is used by ROCm |
 | [ROCm Augmentation Library (rocAL)](https://rocm.docs.amd.com/projects/rocAL/en/latest/doxygen/html/index.html) | An augmentation library designed to decode and process images and videos |
 | [ROCmCC](https://rocm.docs.amd.com/en/latest/reference/rocmcc/rocmcc.html) | A Clang/LLVM-based compiler |
-| [ROCm cmake](https://github.com/RadeonOpenCompute/rocm-cmake) | A collection of CMake modules for common build and development tasks |
+| [ROCm cmake](https://github.com/ROCm/rocm-cmake) | A collection of CMake modules for common build and development tasks |
 | [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/latest/) | Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments |
-| [ROCm Debug Agent Library (ROCdebug-agent)](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/) | A library that can print the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running |
+| [ROCm Debug Agent Library (ROCdebug-agent)](https://github.com/ROCm/rocr_debug_agent/) | A library that can print the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running |
 | [ROCm Debugger (ROCgdb)](https://rocm.docs.amd.com/projects/ROCgdb/en/latest/) | A source-level debugger for Linux, based on the GNU Debugger (GDB) |
 | [ROCm Debugger API (ROCdbgapi)](https://rocm.docs.amd.com/projects/ROCdbgapi/en/latest/) | The ROCm debugger library |
-| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/) | Reports system information |
-| [ROCm SMI](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | A C library for Linux that provides a user space interface for applications to monitor and control GPU applications |
+| [rocminfo](https://github.com/ROCm/rocminfo/) | Reports system information |
+| [ROCm SMI](https://github.com/ROCm/rocm_smi_lib/) | A C library for Linux that provides a user space interface for applications to monitor and control GPU applications |
 | [ROCm Validation Suite](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) | A tool for detecting and troubleshooting common problems affecting AMD GPUs running in a high-performance computing environment |
 | [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) | A header-only library for HIP parallel primitives |
 | [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/profiler_home_page.html) | A profiling tool for HIP applications |
 | [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/latest/) | Provides functions that generate pseudorandom and quasirandom numbers |
-| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents |
+| [ROCR-Runtime](https://github.com/ROCm/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents |
 | [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on the ROCm platform, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs |
 | [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) | Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language) |
 | [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/latest/) | A parallel algorithm library |
-| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | User-mode API interfaces used to interact with the ROCk driver |
+| [ROCT-Thunk-Interface](https://github.com/ROCm/ROCT-Thunk-Interface/) | User-mode API interfaces used to interact with the ROCk driver |
 | [ROCTracer](https://rocm.docs.amd.com/projects/roctracer/en/latest/) | Intercepts runtime API calls and traces asynchronous activity |
 | [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/latest/index.html) | A C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations |
-| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile) | A tool for creating benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions |
+| [Tensile](https://github.com/ROCm/Tensile) | A tool for creating benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions |
 | [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/) | A utility to benchmark simultaneous transfers between user-specified devices (CPUs/GPUs) |

 :::
--- a/docs/what-is-rocm.md
+++ b/docs/what-is-rocm.md
@@ -1,86 +0,0 @@
-<head>
-  <meta charset="UTF-8">
-  <meta name="description" content="What is ROCm">
-  <meta name="keywords" content="documentation, projects, introduction, ROCm, AMD">
-</head>
-
-# What is ROCm?
-
-ROCm is an open-source stack, composed primarily of open-source software, designed for
-graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
-tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
-
-With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
-collaborate, test, and deploy your applications in a free, open source, integrated, and secure software
-ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
-artificial intelligence (AI), scientific computing, and computer aided design (CAD).
-
-ROCm is powered by AMD’s
-[Heterogeneous-computing Interface for Portability (HIP)](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html),
-an open-source software C++ GPU programming environment and its corresponding runtime. HIP
-allows ROCm developers to create portable applications on different platforms by deploying code on a
-range of platforms, from dedicated gaming GPUs to exascale HPC clusters.
-
-ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary open
-source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning
-(ML) frameworks, such as PyTorch and TensorFlow.
-
-## ROCm projects
-
-ROCm consists of the following drivers, development tools, and APIs.
-
-| Project | Description |
-| :---------------- | :------------ |
-| [AMD Compute Language Runtimes (CLR)](https://github.com/ROCm-Developer-Tools/clr) | Contains source code for AMD's compute languages runtimes: {doc}`HIP <hip:index>` and OpenCL |
-| [AOMP](https://github.com/ROCm-Developer-Tools/aomp/) | A scripted build of [LLVM](https://github.com/RadeonOpenCompute/llvm-project) and supporting software |
-| [Asynchronous Task and Memory Interface (ATMI)](https://github.com/RadeonOpenCompute/atmi/) | A runtime framework for efficient task management in heterogeneous CPU-GPU systems |
-| [Composable Kernel](https://rocm.docs.amd.com/projects/composable_kernel/en/latest/) | A library that aims to provide a programming model for writing performance critical kernels for machine learning workloads across multiple architectures |
-| [Flang](https://github.com/ROCm-Developer-Tools/flang/) | An out-of-tree Fortran compiler targeting LLVM |
-| [Half-precision floating point library (half)](https://github.com/ROCmSoftwarePlatform/half/) | A C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type along with corresponding arithmetic operators, type conversions, and common mathematical functions |
-| {doc}`HIP <hip:index>` | AMD’s GPU programming language extension and the GPU runtime |
-| [hipBLAS](https://rocm.docs.amd.com/projects/hipBLAS/en/latest/) | A BLAS-marshaling library that supports [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/) and cuBLAS backends |
-| [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/) | A compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure |
-| [hipCUB](https://rocm.docs.amd.com/projects/hipCUB/en/latest/) | A thin header-only wrapper library on top of [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) or CUB that allows project porting using the CUB library to the HIP layer |
-| [hipFFT](https://rocm.docs.amd.com/projects/hipFFT/en/latest/) | A fast Fourier transforms (FFT)-marshalling library that supports rocFFT or cuFFT backends |
-| [hipfort](https://rocm.docs.amd.com/projects/hipfort/en/latest/) | A Fortran interface library for accessing GPU Kernels |
-| {doc}`HIPIFY <hipify:index>` | A set of tools for translating CUDA source code into portable HIP C++ |
-| [hipify-clang](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html) | A Clang-based tool for translating CUDA sources into HIP sources |
-| [hipify-perl](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-perl.html) | An autogenerated, perl-based script that translates CUDA source code into portable HIP C++ |
-| [hipSOLVER](https://rocm.docs.amd.com/projects/hipSOLVER/en/latest/) | An LAPACK-marshalling library that supports [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) and cuSOLVER backends |
-| [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/)  | A SPARSE-marshalling library that supports [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) and cuSPARSE backends |
-| [hipTensor](https://rocm.docs.amd.com/projects/hipTensor/en/latest/index.html) | AMD's C++ library for accelerating tensor primitives based on the composable kernel library |
-| [LLVM](https://github.com/RadeonOpenCompute/llvm-project) | A toolkit for the construction of highly optimized compilers, optimizers, and run-time environments |
-| [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/) | A graph inference engine that accelerates machine learning model inference |
-| [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/latest/) | An open source deep-learning library |
-| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM) | An OpenCL general matrix multiplication (GEMM) API and kernel generator |
-| [MIOpenTensile](https://github.com/ROCmSoftwarePlatform/MIOpenTensile) | Provides host-callable interfaces to Tensile library |
-| [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/latest/doxygen/html/index.html) | A set of comprehensive computer vision and machine learning libraries, utilities, and applications |
-| [Radeon Compute Profiler (RCP)](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | A performance analysis tool that gathers data from the API run-time and GPU for OpenCL and ROCm/HSA applications |
-| [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) | A standalone library that provides multi-GPU and multi-node collective communication primitives |
-| [rocAL](https://rocm.docs.amd.com/projects/rocAL/en/latest/doxygen/html/index.html) | An augmentation library designed to decode and process images and videos |
-| [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/latest/) | A sparse linear algebra library for exploring fine-grained parallelism on ROCm runtime and toolchains |
-| [RocBandwidthTest](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/) | Captures the performance characteristics of buffer copying and kernel read/write operations |
-| [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/)| A BLAS implementation (in the HIP programming language) on the ROCm runtime and toolchains |
-| [rocFFT](https://rocm.docs.amd.com/projects/rocFFT/en/latest/) | A software library for computing fast Fourier transforms (FFTs) written in HIP |
-| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | An AMDGPU Driver with KFD that is used by ROCm |
-| [ROCmCC](https://rocm.docs.amd.com/en/latest/reference/rocmcc/rocmcc.html) | A Clang/LLVM-based compiler |
-| [ROCm cmake](https://github.com/RadeonOpenCompute/rocm-cmake) | A collection of CMake modules for common build and development tasks |
-| [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/latest/) | Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments |
-| [ROCm Debug Agent Library (ROCdebug-agent)](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/) | A library that can print the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running |
-| [ROCm Debugger (ROCgdb)](https://rocm.docs.amd.com/projects/ROCgdb/en/latest/) | A source-level debugger for Linux, based on the GNU Debugger (GDB) |
-| [ROCdbgapi](https://rocm.docs.amd.com/projects/ROCdbgapi/en/latest/) | The ROCm debugger API library |
-| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/) | Reports system information |
-| [ROCm SMI](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | A C library for Linux that provides a user space interface for applications to monitor and control GPU applications |
-| [ROCm Validation Suite](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) | A tool for detecting and troubleshooting common problems affecting AMD GPUs running in a high-performance computing environment |
-| [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) | A header-only library for HIP parallel primitives |
-| [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/profiler_home_page.html) | A profiling tool for HIP applications |
-| [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/latest/) | Provides functions that generate pseudorandom and quasirandom numbers |
-| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents |
-| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs |
-| [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) | Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language) |
-| [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/latest/) | A parallel algorithm library |
-| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | User-mode API interfaces used to interact with the ROCk driver |
-| [ROCTracer](https://rocm.docs.amd.com/projects/roctracer/en/latest/) | Intercepts runtime API calls and traces asynchronous activity |
-| [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/latest/index.html) | A C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations |
-| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile) | A tool for creating benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions |
-| [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/) | A utility to benchmark simultaneous transfers between user-specified devices (CPUs/GPUs) |
--- a/docs/what-is-rocm.rst
+++ b/docs/what-is-rocm.rst
@@ -0,0 +1,82 @@
+.. meta::
+  :description: What is ROCm
+  :keywords: ROCm projects, introduction, ROCm, AMD, runtimes, compilers, tools, libraries, API
+
+***********************************************************
+What is ROCm?
+***********************************************************
+
+ROCm is an open-source stack, composed primarily of open-source software, designed for
+graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
+tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
+
+ROCm is powered by
+`Heterogeneous-computing Interface for Portability (HIP) <https://rocm.docs.amd.com/projects/HIP/en/latest/index.html>`_;
+it supports programming models, such as OpenMP and OpenCL, and includes all necessary open
+source software compilers, debuggers, and libraries. It's fully integrated into machine learning (ML)
+frameworks, such as PyTorch and TensorFlow.
+
+.. tip::
+  If you're using Radeon GPUs, refer to the
+  :doc:`Radeon-specific ROCm documentation <radeon:index>`.
+
+ROCm project list
+===============================================
+
+ROCm consists of the following projects. For information on the license associated with each project,
+see :doc:`ROCm licensing <./about/license>`.
+
+.. csv-table::
+  :header: "Project", "Type", "Description"
+
+  "`AMD Compute Language Runtimes (CLR) <https://github.com/ROCm/clr>`_", "Runtime", "Contains source code for AMD's compute languages runtimes: :doc:`HIP <hip:index>` and OpenCL"
+  ":doc:`AMD SMI <amdsmi:index>`", "Tool", "C library for Linux that provides a user space interface for applications to monitor and control AMD devices"
+  "`AOMP <https://github.com/ROCm/aomp/>`_", "Compiler", "Scripted build of `LLVM <https://github.com/ROCm/llvm-project>`_ and supporting software"
+  ":doc:`Composable Kernel <composable_kernel:index>`", "Library (AI/ML)", "Provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures"
+  "`FLANG <https://github.com/ROCm/flang/>`_", "Compiler", "An out-of-tree Fortran compiler targeting LLVM"
+  "`half <https://github.com/ROCm/half/>`_", "Library (math)", "C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type, along with corresponding arithmetic operators, type conversions, and common mathematical functions"
+  ":doc:`HIP <hip:index>`", "Runtime", AMD's GPU programming language extension and the GPU runtime"
+  ":doc:`hipBLAS <hipblas:index>`", "Library (math)", "BLAS-marshaling library that supports `rocBLAS <https://rocm.docs.amd.com/projects/rocBLAS/en/latest/>`_ and cuBLAS backends"
+  ":doc:`hipBLASLt <hipblaslt:index>`", "Library (math)", "Provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library"
+  "`hipCC <https://github.com/ROCm/HIPCC>`_ ", "Compiler", "Compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure"
+  ":doc:`hipCUB <hipcub:index>`", "Library (C++ primitive)", "Thin header-only wrapper library on top of `rocPRIM <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ or CUB that allows project porting using the CUB library to the HIP layer"
+  ":doc:`hipFFT <hipfft:index>`", "Library (math)", "Fast Fourier transforms (FFT)-marshalling library that supports rocFFT or cuFFT backends"
+  ":doc:`hipfort <hipfort:index>`", "Library (math)", "Fortran interface library for accessing GPU Kernels"
+  ":doc:`HIPIFY <hipify:index>`", "Compiler", "Translates CUDA source code into portable HIP C++"
+  ":doc:`hipRAND <hiprand:index>`", "Library (math)", "Ports CUDA applications that use the cuRAND library into the HIP layer"
+  ":doc:`hipSOLVER <hipsolver:index>`", "Library (math)", "An LAPACK-marshalling library that supports `rocSOLVER <https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/>`_ and cuSOLVER backends"
+  ":doc:`hipSPARSE <hipsparse:index>`", "Library (math)", "SPARSE-marshalling library that supports `rocSPARSE <https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/>`_ and cuSPARSE backends"
+  ":doc:`hipSPARSELt <hipsparselt:index>`", "Library (math)", "SPARSE-marshalling library with multiple supported backends"
+  ":doc:`hipTensor <hiptensor:index>`", "Library (C++ primitive)", "AMD's C++ library for accelerating tensor primitives based on the composable kernel library"
+  "`LLVM (amdclang) <https://github.com/ROCm/llvm-project>`_ ", "Compiler", "Toolkit for the construction of highly optimized compilers, optimizers, and run-time environments"
+  ":doc:`MIGraphX <amdmigraphx:index>`", "Library (AI/ML)", "Graph inference engine that accelerates machine learning model inference"
+  ":doc:`MIOpen <miopen:index>`", "Library (AI/ML)", "An open source deep-learning library"
+  ":doc:`MIVisionX <mivisionx:doxygen/html/index>`", "Library (AI/ML)", "Set of comprehensive computer vision and machine learning libraries, utilities, and applications"
+  "`Radeon Compute Profiler (RCP) <https://github.com/GPUOpen-Tools/radeon_compute_profiler/>`_ ", "Tool", "Performance analysis tool that gathers data from the API run-time and GPU for OpenCL and ROCm/HSA applications"
+  ":doc:`RCCL <rccl:index>`", "Library (communication)", "Standalone library that provides multi-GPU and multi-node collective communication primitives"
+  ":doc:`rocAL <rocal:index>`", "Library (AI/ML)", "An augmentation library designed to decode and process images and videos"
+  ":doc:`rocALUTION <rocalution:index>`", "Library (math)", "Sparse linear algebra library for exploring fine-grained parallelism on ROCm runtime and toolchains"
+  "`RocBandwidthTest <https://github.com/ROCm/rocm_bandwidth_test/>`_ ", "Tool", "Captures the performance characteristics of buffer copying and kernel read/write operations"
+  ":doc:`rocBLAS <rocblas:index>`", "Library (math)", "BLAS implementation (in the HIP programming language) on the ROCm runtime and toolchains"
+  ":doc:`rocFFT <rocfft:index>`", "Library (math)", "Software library for computing fast Fourier transforms (FFTs) written in HIP"
+  ":doc:`ROCmCC <./reference/rocmcc>`", "Tool", "Clang/LLVM-based compiler"
+  "`ROCm CMake <https://github.com/ROCm/rocm-cmake>`_ ", "Tool", "Collection of CMake modules for common build and development tasks"
+  ":doc:`ROCm Data Center Tool <rdc:index>`", "Tool", "Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments"
+  "`ROCm Debug Agent (ROCdebug-agent) <https://github.com/ROCm/rocr_debug_agent/>`_ ", "Tool", "Prints the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running"
+  ":doc:`ROCm debugger (ROCgdb) <rocgdb:index>`", "Tool", "Source-level debugger for Linux, based on the GNU Debugger (GDB)"
+  ":doc:`ROCdbgapi <rocdbgapi:index>`", "Tool", "ROCm debugger API library"
+  "`rocminfo <https://github.com/ROCm/rocminfo/>`_ ", "Tool", "Reports system information"
+  ":doc:`ROCm Performance Primitives (RPP) <rpp:index>`", "Library (AI/ML)", "Comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends"
+  ":doc:`ROCm SMI <rocm_smi_lib:index>`", "Tool", "C library for Linux that provides a user space interface for applications to monitor and control GPU applications"
+  ":doc:`ROCm Validation Suite <rocmvalidationsuite:index>`", "Tool", "Detects and troubleshoots common problems affecting AMD GPUs running in a high-performance computing environment"
+  ":doc:`rocPRIM <rocprim:index>`", "Library (C++ primitive)", "Header-only library for HIP parallel primitives"
+  ":doc:`ROCProfiler <rocprofiler:profiler_home_page>`", "Tool", "Profiling tool for HIP applications"
+  ":doc:`rocRAND <rocrand:index>`", "Library (math)", "Provides functions that generate pseudorandom and quasirandom numbers"
+  "`ROCR-Runtime <https://github.com/ROCm/ROCR-Runtime/>`_ ", "Runtime", "User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents"
+  ":doc:`rocSOLVER <rocsolver:index>`", "Library (math)", "An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMD's latest discrete GPUs"
+  ":doc:`rocSPARSE <rocsparse:index>`", "Library (math)", "Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language)"
+  ":doc:`rocThrust <rocthrust:index>`", "Library (C++ primitive)", "Parallel algorithm library"
+  ":doc:`ROCTracer <roctracer:index>`", "Tool", "Intercepts runtime API calls and traces asynchronous activity"
+  ":doc:`rocWMMA <rocwmma:index>`", "Library (math)", "C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations"
+  "`Tensile <https://github.com/ROCm/Tensile>`_ ", "Library (math)", "Creates benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions"
+  ":doc:`TransferBench <transferbench:index>`", "Tool", "Utility to benchmark simultaneous transfers between user-specified devices (CPUs/GPUs)"
--- a/tools/autotag/README.md
+++ b/tools/autotag/README.md
@@ -18,16 +18,16 @@
 * Run this for 5.6.0 (change for whatever version you require)
 * `GITHUB_ACCESS_TOKEN=my_token_here`

-To generate the changelog from 5.0.0 up to and including 6.0.0:
+To generate the changelog from 5.0.0 up to and including 6.0.1:

 ```sh
-python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.0
+python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
 ```

-To generate the changelog only for 6.0.0:
+To generate the changelog only for 6.0.1:

 ```sh
-python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.0
+python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
 ```

 ### Notes
--- a/tools/autotag/compile_changelogs.sh
+++ b/tools/autotag/compile_changelogs.sh
@@ -10,7 +10,7 @@ sed -i 's/^#{3} /##### /g' .changelogs.txt
 sed -i 's/^# /#### /g' .changelogs.txt

 [ -d ROCm ] && rm -rf ROCm
-git clone git@github.com:RadeonOpenCompute/ROCm.git
+git clone git@github.com:ROCm/ROCm.git

 awk -f- ROCm/README.md .changelogs.txt >.tmp.txt <<-'EOF'
 BEGIN {
--- a/tools/autotag/templates/rocm_changes/5.0.0.md
+++ b/tools/autotag/templates/rocm_changes/5.0.0.md
@@ -42,7 +42,7 @@ Refer to the HIP API documentation for more details on managed memory APIs.

 For the application, see

-<https://github.com/ROCm-Developer-Tools/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp>
+<https://github.com/ROCm/HIP/blob/rocm-4.5.x/tests/src/runtimeApi/memory/hipMallocManaged.cpp>

 #### New environment variable

--- a/tools/autotag/templates/rocm_changes/5.1.0.md
+++ b/tools/autotag/templates/rocm_changes/5.1.0.md
@@ -131,7 +131,7 @@ in this release.

 For more information, refer to the following websites:

-* <https://github.com/RadeonOpenCompute/criu/blob/amdgpu_plugin-03252022/Documentation/amdgpu_plugin.txt>
+* <https://github.com/ROCm/criu/blob/amdgpu_plugin-03252022/Documentation/amdgpu_plugin.txt>

 * <https://criu.org/Main_Page>

--- a/tools/autotag/templates/rocm_changes/5.2.0.md
+++ b/tools/autotag/templates/rocm_changes/5.2.0.md
@@ -22,7 +22,7 @@ and can grow until the available free memory on the device is consumed.
 The test codes at the following link show how to implement applications using malloc and free
 functions in device kernels:

-<https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp>
+<https://github.com/ROCm/HIP/blob/d6224a55390bf2d8fd0180c21bae44f0d718d1eb/tests/src/deviceLib/hipDeviceMalloc.cpp>

 ##### New HIP APIs in this release

@@ -305,7 +305,7 @@ illustrate example usages of the C++ API. GEMM matrix multiplication is used as
 given the heavy precedent for the library. However, the usage portfolio is growing significantly and
 demonstrates different ways rocWMMA may be consumed.

-For more information, refer to [Communication Libraries](../reference/library-index.md)
+For more information, refer to [Communication libraries](../reference/api-libraries.md).

 #### OpenMP enhancements in this release

--- a/tools/autotag/templates/rocm_changes/5.2.3.md
+++ b/tools/autotag/templates/rocm_changes/5.2.3.md
@@ -69,4 +69,4 @@ debugger deployment and management tools
 No notable changes in this release for deployment and management tools.

 For release information for older ROCm releases, refer to
-<https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>
+<https://github.com/ROCm/ROCm/blob/master/CHANGELOG.md>
--- a/Show More
+++ b/Show More