Merge branch 'develop' into merge-5.6.1

2026-01-09 14:48:06 -05:00 · 2023-09-05 16:19:10 -06:00
parent ddbe4cd38f 6c0419fb0d
commit d3049169de
39 changed files with 1464 additions and 471 deletions
--- a/.github/workflows/linting.yml
+++ b/.github/workflows/linting.yml
@@ -6,7 +6,7 @@ on:
    - develop
    - main
    - 'docs/*'
-    - 'roc**'    
+    - 'roc**'
  pull_request:
    branches: 
    - develop
@@ -14,47 +14,7 @@ on:
    - 'docs/*'
    - 'roc**'

-concurrency:
-  group: ${{ github.ref }}-${{ github.workflow }}
-  cancel-in-progress: true
-
 jobs:
-  lint-rest:
-    name: "RestructuredText"
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout code
-      uses: actions/checkout@v3
-    - name: Install rst-lint
-      run: pip install restructuredtext-lint
-    - name: Lint ResT files
-      run: rst-lint ${{ join(github.workspace, '/docs') }}
-
-  lint-md:
-    name: "Markdown"
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout code
-      uses: actions/checkout@v3
-    - name: Use markdownlint-cli2
-      uses: DavidAnson/markdownlint-cli2-action@v10.0.1
-      with:
-        globs: '**/*.md'
-
-  spelling:
-    name: "Spelling"
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout code
-      uses: actions/checkout@v3
-    - name: Fetch config
-      shell: sh
-      run: |
-        curl --silent --show-error --fail --location https://raw.github.com/RadeonOpenCompute/rocm-docs-core/develop/.spellcheck.yaml -O
-        curl --silent --show-error --fail --location https://raw.github.com/RadeonOpenCompute/rocm-docs-core/develop/.wordlist.txt >> .wordlist.txt
-    - name: Run spellcheck
-      uses: rojopolis/spellcheck-github-actions@0.30.0
-    - name: On fail
-      if: failure()
-      run: |
-        echo "Please check for spelling mistakes or add them to '.wordlist.txt' in either the root of this project or in rocm-docs-core."
+  call-workflow-passing-data:
+    name: Documentation
+    uses: RadeonOpenCompute/rocm-docs-core/.github/workflows/linting.yml@develop
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -1,8 +1,10 @@
+# building
+matchers
 # file_reorg
 FHS
+incrementing
 Filesystem
 filesystem
-incrementing
 rocm
 # gpu_aware_mpi
 DMA
@@ -27,6 +29,71 @@ MMA
 backends
 cuSOLVER
 cuSPARSE
+# mi200_performance_counters
+ALU
+Arb
+CP
+CPC
+CPF
+CSC
+CSn
+DW
+DWORD
+GDS
+GMI
+GPR
+GRBM
+IOP
+LDS
+MEM
+MFMA
+Noncoherently
+Qcycles
+RW
+Req
+SALU
+SCA
+SENDMSG
+SGPRs
+SMEM
+SPI
+SQs
+TCA
+TCC
+TCI
+TCIU
+TCP
+TCR
+TrapStatus
+UC
+UTCL
+Uncached
+VALU
+VMEM
+VSkipped
+Workgroups
+Writebacks
+addr
+alloc
+cmd
+coalescable
+csn
+endpgm
+inflight
+mtypes
+perfcounter
+preq
+req
+sL
+sendmsg
+tagram
+tg
+uncached
+vL
+workgroups
+writeback
+writebacks
+wrreq
 # openmp
 ICV
 Multithreaded
@@ -47,3 +114,9 @@ precompiled
 # gpu_os_support
 HWE
 el
+# using_gpu_sanitizer
+LSAN
+deallocation
+detections
+tracebacks
+workgroup
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -2,7 +2,7 @@

 AMD values and encourages the ROCm community to contribute to our code and
 documentation. This repository is focused on ROCm documentation and this
-contribution guide describes the recommend method for creating and modifying our
+contribution guide describes the recommended method for creating and modifying our
 documentation.

 While interacting with ROCm Documentation, we encourage you to be polite and
@@ -13,59 +13,47 @@ itself, refer to
 [discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) on the
 GitHub repository.

+For additional information on documentation functionalities,
+see the user and developer guides for rocm-docs-core
+at {doc}`rocm-docs-core documentation <rocm-docs-core:index>`.
+
 ## Supported Formats

-Our documentation includes both markdown and rst files. Markdown is encouraged
-over rst due to the lower barrier to participation. GitHub flavored markdown is preferred
-for all submissions as it will render accurately on our GitHub repositories. For existing documentation,
-[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html) markdown
-is used to implement certain features unsupported in GitHub markdown. This is
+Our documentation includes both Markdown and RST files. Markdown is encouraged
+over RST due to the lower barrier to participation. GitHub-flavored Markdown is preferred
+for all submissions as it renders accurately on our GitHub repositories. For existing documentation,
+[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html) Markdown
+is used to implement certain features unsupported in GitHub Markdown. This is
 not encouraged for new documentation. AMD will transition
-to stricter use of GitHub flavored markdown with a few caveats. ROCm documentation
-also uses [sphinx-design](https://sphinx-design.readthedocs.io/en/latest/index.html)
-in our markdown and rst files. We also will use breathe syntax for doxygen documentation
-in our markdown files. Other design elements for effective HTML rendering of the documents
-may be added to our markdown files. Please see
+to stricter use of GitHub-flavored Markdown with a few caveats. ROCm documentation
+also uses [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html)
+in our Markdown and RST files. We also use Breathe syntax for Doxygen documentation
+in our Markdown files. See
 [GitHub](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github)'s
 guide on writing and formatting on GitHub as a starting point.

-ROCm documentation adds additional requirements to markdown and rst based files
+ROCm documentation adds additional requirements to Markdown and RST based files
 as follows:

 - Level one headers are only used for page titles. There must be only one level
  1 header per file for both Markdown and Restructured Text.
 - Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via
-  our automated github action on a Pull Request (PR).
+  our automated GitHub action on a Pull Request (PR).
+  See the {doc}`rocm-docs-core linting user guide <rocm-docs-core:user_guide/linting>` for more details.

 ## Filenames and folder structure

-Please use snake case for file names. Our documentation follows pitchfork for
-folder structure. All documentation is in /docs except for special files like
-the contributing guide in the / folder. All images used in the documentation are
-place in the /docs/data folder.
-
-## How to provide feedback for for ROCm documentation
-
-There are three standard ways to provide feedback for this repository.
-
-### Pull Request
-
-All contributions to ROCm documentation should arrive via the
-[GitHub Flow](https://docs.github.com/en/get-started/quickstart/github-flow)
-targetting the develop branch of the repository. If you are unable to contribute
-via the GitHub Flow, feel free to email us. TODO, confirm email address.
-
-### GitHub Issue
-
-Issues on existing or absent docs can be filed as [GitHub issues
-](https://github.com/RadeonOpenCompute/ROCm/issues).
-
-### Email Feedback
+Please use snake case (all lower case letters and underscores instead of spaces)
+for file names. For example, `example_file_name.md`.
+Our documentation follows Pitchfork for folder structure.
+All documentation is in `/docs` except for special files like
+the contributing guide in the `/` folder. All images used in the documentation are
+placed in the `/docs/data` folder.

 ## Language and Style

-Adopting Microsoft CPP-Docs guidelines for [Voice and Tone
-](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md).
+Adopt Microsoft CPP-Docs guidelines for
+[Voice and Tone](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md).

 ROCm documentation templates to be made public shortly. ROCm templates dictate
 the recommended structure and flow of the documentation. Guidelines on how to
@@ -73,174 +61,11 @@ integrate figures, equations, and tables are all based off
 [MyST](https://myst-parser.readthedocs.io/en/latest/intro.html).

 Font size and selection, page layout, white space control, and other formatting
-details are controlled via rocm-docs-core, sphinx extention. Please raise issues
-in rocm-docs-core for any formatting concerns and changes requested.
+details are controlled via [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
+Raise issues in `rocm-docs-core` for any formatting concerns and changes requested.

-## Building Documentation
+## More

-While contributing, one may build the documentation locally on the command-line
-or rely on Continuous Integration for previewing the resulting HTML pages in a
-browser.
-
-### Command line documentation builds
-
-Python versions known to build documentation:
-
- 3.8
-
-To build the docs locally using Python Virtual Environment (`venv`), execute the
-following commands from the project root:
-
-```sh
-python3 -mvenv .venv
-# Windows
-.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
-.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
-# Linux
-.venv/bin/python     -m pip install -r docs/sphinx/requirements.txt
-.venv/bin/python     -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
-```
-
-Then open up `_build/html/index.html` in your favorite browser.
-
-### Pull Requests documentation builds
-
-When opening a PR to the `develop` branch on GitHub, the page corresponding to
-the PR (`https://github.com/RadeonOpenCompute/ROCm/pull/<pr_number>`) will have
-a summary at the bottom. This requires the user be logged in to GitHub.
-
- There, click `Show all checks` and `Details` of the Read the Docs pipeline. It
-  will take you to `https://readthedocs.com/projects/advanced-micro-devices-rocm/
-  builds/<some_build_num>/`
-  - The list of commands shown are the exact ones used by CI to produce a render
-    of the documentation.
- There, click on the small blue link `View docs` (which is not the same as the
-  bigger button with the same text). It will take you to the built HTML site with
-  a URL of the form `https://
-  advanced-micro-devices-demo--<pr_number>.com.readthedocs.build/projects/alpha/en
-  /<pr_number>/`.
-
-### Build the docs using VS Code
-
-One can put together a productive environment to author documentation and also
-test it locally using VS Code with only a handful of extensions. Even though the
-extension landscape of VS Code is ever changing, here is one example setup that
-proved useful at the time of writing. In it, one can change/add content, build a
-new version of the docs using a single VS Code Task (or hotkey), see all errors/
-warnings emitted by Sphinx in the Problems pane and immediately see the
-resulting website show up on a locally serving web server.
-
-#### Configuring VS Code
-
-1. Install the following extensions:
-
-    - Python (ms-python.python)
-    - Live Server (ritwickdey.LiveServer)
-
-2. Add the following entries in `.vscode/settings.json`
-
-    ```json
-    {
-      "liveServer.settings.root": "/.vscode/build/html",
-      "liveServer.settings.wait": 1000,
-      "python.terminal.activateEnvInCurrentTerminal": true
-    }
-    ```
-
-    The settings in order are set for the following reasons:
-    - Sets the root of the output website for live previews. Must be changed
-      alongside the `tasks.json` command.
-    - Tells live server to wait with the update to give time for Sphinx to
-      regenerate site contents and not refresh before all is don. (Empirical value)
-    - Automatic virtual env activation is a nice touch, should you want to build
-      the site from the integrated terminal.
-
-3. Add the following tasks in `.vscode/tasks.json`
-
-    ```json
-    {
-      "version": "2.0.0",
-      "tasks": [
-        {
-          "label": "Build Docs",
-          "type": "process",
-          "windows": {
-            "command": "${workspaceFolder}/.venv/Scripts/python.exe"
-          },
-          "command": "${workspaceFolder}/.venv/bin/python3",
-          "args": [
-            "-m",
-            "sphinx",
-            "-j",
-            "auto",
-            "-T",
-            "-b",
-            "html",
-            "-d",
-            "${workspaceFolder}/.vscode/build/doctrees",
-            "-D",
-            "language=en",
-            "${workspaceFolder}/docs",
-            "${workspaceFolder}/.vscode/build/html"
-          ],
-          "problemMatcher": [
-            {
-              "owner": "sphinx",
-              "fileLocation": "absolute",
-              "pattern": {
-                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
-                "file": 1,
-                "line": 2,
-                "severity": 3,
-                "message": 4
-              },
-            },
-            {
-              "owner": "sphinx",
-              "fileLocation": "absolute",
-              "pattern": {
-                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
-                "file": 1,
-                "severity": 2,
-                "message": 3
-              }
-            }
-          ],
-          "group": {
-            "kind": "build",
-            "isDefault": true
-          }
-        },
-      ],
-    }
-    ```
-
-    > (Implementation detail: two problem matchers were needed to be defined,
-    > because VS Code doesn't tolerate some problem information being potentially
-    > absent. While a single regex could match all types of errors, if a capture
-    > group remains empty (the line number doesn't show up in all warning/error
-    > messages) but the `pattern` references said empty capture group, VS Code
-    > discards the message completely.)
-
-4. Configure Python virtual environment (venv)
-
-    - From the Command Palette, run `Python: Create Environment`
-      - Select `venv` environment and the `docs/sphinx/requirements.txt` file.
-      _(Simply pressing enter while hovering over the file from the dropdown is
-      insufficient, one has to select the radio button with the 'Space' key if
-      using the keyboard.)_
-
-5. Build the docs
-
-    - Launch the default build Task using either:
-      - a hotkey _(default is 'Ctrl+Shift+B')_ or
-      - by issuing the `Tasks: Run Build Task` from the Command Palette.
-
-6. Open the live preview
-
-    - Navigate to the output of the site within VS Code, right-click on
-    `.vscode/build/html/index.html` and select `Open with Live Server`. The
-    contents should update on every rebuild without having to refresh the
-    browser.
-
-<!-- markdownlint-restore -->
+For more topics, such as submitting feedback and ways to build documentation,
+see the [Contributing Section](https://rocm.docs.amd.com/en/latest/contributing.html)
+at [rocm.docs.amd.com](https://rocm.docs.amd.com)
--- a/README.md
+++ b/README.md
@@ -1,42 +1,38 @@
 # AMD ROCm™ Platform

-ROCm™ is an open-source stack for GPU computation. ROCm is primarily Open-Source
-Software (OSS) that allows developers the freedom to customize and tailor their
-GPU software for their own needs while collaborating with a community of other
-developers, and helping each other find solutions in an agile, flexible, rapid
-and secure manner.
+ROCm is an open-source stack, composed primarily of open-source software (OSS), designed for
+graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
+tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

-ROCm is a collection of drivers, development tools and APIs enabling GPU
-programming from the low-level kernel to end-user applications. ROCm is powered
-by AMD’s Heterogeneous-computing Interface for Portability (HIP), an OSS C++ GPU
-programming environment and its corresponding runtime. HIP allows ROCm
-developers to create portable applications on different platforms by deploying
-code on a range of platforms, from dedicated gaming GPUs to exascale HPC
-clusters. ROCm supports programming models such as OpenMP and OpenCL, and
-includes all the necessary OSS compilers, debuggers and libraries. ROCm is fully
-integrated into ML frameworks such as PyTorch and TensorFlow. ROCm can be
-deployed in many ways, including through the use of containers such as Docker,
-Spack, and your own build from source.
+With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
+collaborate, test, and deploy your applications in a free, open-source, integrated, and secure software
+ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
+artificial intelligence (AI), scientific computing, and computer aided design (CAD).

-ROCm’s goal is to allow our users to maximize their GPU hardware investment.
-ROCm is designed to help develop, test and deploy GPU accelerated HPC, AI,
-scientific computing, CAD, and other applications in a free, open-source,
-integrated and secure software ecosystem.
+ROCm is powered by AMD’s
+[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
+an OSS C++ GPU programming environment and its corresponding runtime. HIP allows ROCm
+developers to create portable applications on different platforms by deploying code on a range of
+platforms, from dedicated gaming GPUs to exascale HPC clusters.

-This repository contains the manifest file for ROCm™ releases, changelogs, and
-release information. The file default.xml contains information for all
-repositories and the associated commit used to build the current ROCm release.
-
-The default.xml file uses the repo Manifest format.
-
-The develop branch of this repository contains content for the next
-ROCm release.
+ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary OSS
+compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks,
+such as PyTorch and TensorFlow.

 ## ROCm Documentation

-ROCm Documentation is available online at
-[rocm.docs.amd.com](https://rocm.docs.amd.com). Source code for the documenation
-is located in the docs folder of most repositories that are part of ROCm.
+The ROCm Documentation site is [rocm.docs.amd.com](https://rocm.docs.amd.com).
+
+Source code for the documentation is located in the docs folder of most repositories that are part of
+ROCm.
+
+This repository contains the manifest file for ROCm releases, changelogs, and release information.
+The file `default.xml` contains information for all repositories and the associated commit used to build
+the current ROCm release.
+
+The `default.xml` file uses the repo Manifest Format.
+
+The develop branch of this repository contains content for the next ROCm release.

 ### How to build documentation via Sphinx

@@ -48,7 +44,7 @@ pip3 install -r sphinx/requirements.txt
 python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
 ```

-## Older ROCm™ Releases
+## Older ROCm Releases

-For release information for older ROCm™ releases, refer to
-[CHANGELOG](./CHANGELOG.md).
+For release information for older ROCm releases, refer to
+[`CHANGELOG.md`](./CHANGELOG.md).
--- a/default.xml
+++ b/default.xml
@@ -57,6 +57,7 @@ fetch="https://github.com/KhronosGroup/" />
    <project groups="mathlibs" name="rocSOLVER" remote="rocm-swplat" />
    <project groups="mathlibs" name="hipSOLVER" remote="rocm-swplat" />
    <project groups="mathlibs" name="hipSPARSE" remote="rocm-swplat" />
+    <project groups="mathlibs" name="hipSPARSELt" remote="rocm-swplat" />
    <project groups="mathlibs" name="rocALUTION" remote="rocm-swplat" />
    <project name="MIOpen" remote="rocm-swplat" />
    <project groups="mathlibs" name="rccl" remote="rocm-swplat" />
--- a/docs/404.md
+++ b/docs/404.md
@@ -1,6 +1,7 @@
-# 404 Page Not Found
+# 404 - Page Not Found

-Page could not be found.
+```{figure} ./data/AMD-404.png
+:align: center
+```

-Return to [home](./index) or please use the links from the sidebar to find what
-you are looking for.
+Return [home](./index) or use the sidebar navigation to get back on track.
--- a/docs/about.md
+++ b/docs/about.md
@@ -5,70 +5,70 @@ Documentation is built using open source toolchains. Contributions to our
 documentation is encouraged and welcome. As a contributor, please familiarize
 yourself with our documentation toolchain.

-## ReadTheDocs
+## `rocm-docs-core`

-[ReadTheDocs](https://docs.readthedocs.io/en/stable/) is our front end for the
-our documentation. By front end, this is the tool that serves our HTML based
-documentation to our end users.
+[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) is an AMD-maintained
+project that applies customization for our documentation. This
+project is the tool most ROCm repositories use as part of the documentation
+build. It is also available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/).

-## Doxygen
-
-[Doxygen](https://www.doxygen.nl/) is the most common inline code documentation
-standard. ROCm projects are use Doxygen for public API documentation (unless the
-upstream project is using a different tool).
+See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation <rocm-docs-core:index>`.

 ## Sphinx

 [Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator
-originally used for python. It is now widely used in the Open Source community.
-Originally, sphinx supported RST based documentation. Markdown support is now
-available. ROCm documentation plans to default to markdown for new projects.
-Existing projects using RST are under no obligation to convert to markdown. New
-projects that believe markdown is not suitable should contact the documentation
+originally used for Python. It is now widely used in the Open Source community.
+Originally, Sphinx supported reStructuredText (RST) based documentation, but
+Markdown support is now available.
+ROCm documentation plans to default to Markdown for new projects.
+Existing projects using RST are under no obligation to convert to Markdown. New
+projects that believe Markdown is not suitable should contact the documentation
 team prior to selecting RST.

+## Read the Docs
+
+[Read the Docs](https://docs.readthedocs.io/en/stable/) is the service that builds
+and hosts the HTML documentation generated using Sphinx to our end users.
+
+## Doxygen
+
+[Doxygen](https://www.doxygen.nl/) is a documentation generator that extracts
+information from inline code.
+ROCm projects typically use Doxygen for public API documentation unless the
+upstream project uses a different tool.
+
+### Breathe
+
+[Breathe](https://www.breathe-doc.org/) is a Sphinx plugin to integrate Doxygen
+content.
+
 ### MyST

 [Markedly Structured Text (MyST)](https://myst-tools.org/docs/spec) is an extended
 flavor of Markdown ([CommonMark](https://commonmark.org/)) influenced by reStructuredText (RST) and Sphinx.
-It is integrated via [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
-A cheat sheet that showcases how to use the MyST syntax is available over at [the Jupyter
-reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).
-
-### Sphinx Theme
-
-ROCm is using the
-[Sphinx Book Theme](https://sphinx-book-theme.readthedocs.io/en/latest/). This
-theme is used by Jupyter books. ROCm documentation applies some customization
-include a header and footer on top of the Sphinx Book Theme. A future custom
-ROCm theme will be part of our documentation goals.
-
-### Sphinx Design
-
-Sphinx Design is an extension for sphinx based websites that add design
-functionality. Please see the documentation
-[here](https://sphinx-design.readthedocs.io/en/latest/index.html). ROCm
-documentation uses sphinx design for grids, cards, and synchronized tabs.
-Other features may be used in the future.
+It is integrated into ROCm documentation by the Sphinx extension [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
+A cheat sheet that showcases how to use the MyST syntax is available over at
+the [Jupyter reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).

 ### Sphinx External TOC

-ROCm uses the
-[sphinx-external-toc](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
-for our navigation. This tool allows a YAML file based left navigation menu. This
-tool was selected due to its flexibility that allows scripts to operate on the
+[Sphinx External Table of Contents (TOC)](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
+is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left
+based on a YAML file that specifies the table of contents.
+It was selected due to its flexibility that allows scripts to operate on the
 YAML file. Please transition to this file for the project's navigation. You can
-see the `_toc.yml.in` file in this repository in the docs/sphinx folder for an
+see the `_toc.yml.in` file in this repository in the `docs/sphinx` folder for an
 example.

-### Breathe
+### Sphinx Book Theme

-Sphinx uses [Breathe](https://www.breathe-doc.org/) to integrate Doxygen
-content.
+[Sphinx Book Theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme
+that defines the base appearance for ROCm documentation.
+ROCm documentation applies some customization,
+such as a custom header and footer on top of the Sphinx Book Theme.

-## `rocm-docs-core` pip package
+### Sphinx Design

-[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) is an AMD
-maintained project that applies customization for our documentation. This
-project is the tool most ROCm repositories will use as part of the documentation
-build.
+[Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design
+functionality.
+ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -20,9 +20,8 @@ latex_engine = "xelatex"
 project = "ROCm Documentation"
 author = "Advanced Micro Devices, Inc."
 copyright = "Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved."
-version = "5.6.1"
-release = "5.6.1"
-
+version = "5.7.0"
+release = "5.7.0"

 setting_all_article_info = True
 all_article_info_os = ["linux", "windows"]
--- a/docs/contribute/building.md
+++ b/docs/contribute/building.md
@@ -0,0 +1,165 @@
+# Building Documentation
+
+While contributing, one may build the documentation locally on the command-line
+or rely on Continuous Integration for previewing the resulting HTML pages in a
+browser.
+
+## Pull Request documentation builds
+
+When opening a PR to the `develop` branch on GitHub, the page corresponding to
+the PR (`https://github.com/RadeonOpenCompute/ROCm/pull/<pr_number>`) will have
+a summary at the bottom. This requires the user be logged in to GitHub.
+
+- There, click `Show all checks` and `Details` of the Read the Docs pipeline. It
+  will take you to a URL of the form
+  `https://readthedocs.com/projects/advanced-micro-devices-rocm/builds/<some_build_num>/`
+  - The list of commands shown are the exact ones used by CI to produce a render
+    of the documentation.
+- There, click on the small blue link `View docs` (which is not the same as the
+  bigger button with the same text). It will take you to the built HTML site with
+  a URL of the form
+  `https://advanced-micro-devices-demo--<pr_number>.com.readthedocs.build/projects/alpha/en/<pr_number>/`.
+
+## Build documentation from the Command Line
+
+Python versions known to build documentation:
+
+- 3.8
+
+To build the docs locally using Python Virtual Environment (`venv`), execute the
+following commands from the project root:
+
+```sh
+python3 -mvenv .venv
+# Windows
+.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
+.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
+# Linux
+.venv/bin/python     -m pip install -r docs/sphinx/requirements.txt
+.venv/bin/python     -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
+```
+
+Then open up `_build/html/index.html` in your favorite browser.
+
+## Build documentation using Visual Studio (VS) Code
+
+One can put together a productive environment to author documentation and also
+test it locally using VS Code with only a handful of extensions. Even though the
+extension landscape of VS Code is ever changing, here is one example setup that
+proved useful at the time of writing. In it, one can change/add content, build a
+new version of the docs using a single VS Code Task (or hotkey), see all errors/
+warnings emitted by Sphinx in the Problems pane and immediately see the
+resulting website show up on a locally-served web server.
+
+### Configuring VS Code
+
+1. Install the following extensions:
+
+    - Python `(ms-python.python)`
+    - Live Server `(ritwickdey.LiveServer)`
+
+2. Add the following entries in `.vscode/settings.json`
+
+    ```json
+    {
+      "liveServer.settings.root": "/.vscode/build/html",
+      "liveServer.settings.wait": 1000,
+      "python.terminal.activateEnvInCurrentTerminal": true
+    }
+    ```
+
+    The settings above are used for the following reasons:
+    - `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed
+      alongside the `tasks.json` command.
+    - `liveServer.settings.wait`: Tells live server to wait with the update to give time for Sphinx to
+      regenerate site contents and not refresh before all is done. (Empirical value)
+    - `python.terminal.activateEnvInCurrentTerminal`: Automatic virtual environment activation is a nice touch,
+      should you want to build the site from the integrated terminal.
+
+3. Add the following tasks in `.vscode/tasks.json`
+
+    ```json
+    {
+      "version": "2.0.0",
+      "tasks": [
+        {
+          "label": "Build Docs",
+          "type": "process",
+          "windows": {
+            "command": "${workspaceFolder}/.venv/Scripts/python.exe"
+          },
+          "command": "${workspaceFolder}/.venv/bin/python3",
+          "args": [
+            "-m",
+            "sphinx",
+            "-j",
+            "auto",
+            "-T",
+            "-b",
+            "html",
+            "-d",
+            "${workspaceFolder}/.vscode/build/doctrees",
+            "-D",
+            "language=en",
+            "${workspaceFolder}/docs",
+            "${workspaceFolder}/.vscode/build/html"
+          ],
+          "problemMatcher": [
+            {
+              "owner": "sphinx",
+              "fileLocation": "absolute",
+              "pattern": {
+                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
+                "file": 1,
+                "line": 2,
+                "severity": 3,
+                "message": 4
+              },
+            },
+            {
+              "owner": "sphinx",
+              "fileLocation": "absolute",
+              "pattern": {
+                "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
+                "file": 1,
+                "severity": 2,
+                "message": 3
+              }
+            }
+          ],
+          "group": {
+            "kind": "build",
+            "isDefault": true
+          }
+        },
+      ],
+    }
+    ```
+
+    > (Implementation detail: two problem matchers were needed to be defined,
+    > because VS Code doesn't tolerate some problem information being potentially
+    > absent. While a single regex could match all types of errors, if a capture
+    > group remains empty (the line number doesn't show up in all warning/error
+    > messages) but the `pattern` references said empty capture group, VS Code
+    > discards the message completely.)
+
+4. Configure Python virtual environment (`venv`)
+
+    - From the Command Palette, run `Python: Create Environment`
+      - Select `venv` environment and the `docs/sphinx/requirements.txt` file.
+      _(Simply pressing enter while hovering over the file from the drop down is
+      insufficient, one has to select the radio button with the 'Space' key if
+      using the keyboard.)_
+
+5. Build the docs
+
+    - Launch the default build Task using either:
+      - a hotkey _(default is `Ctrl+Shift+B`)_ or
+      - by issuing the `Tasks: Run Build Task` from the Command Palette.
+
+6. Open the live preview
+
+    - Navigate to the output of the site within VS Code, right-click on
+    `.vscode/build/html/index.html` and select `Open with Live Server`. The
+    contents should update on every rebuild without having to refresh the
+    browser.
--- a/docs/contribute/feedback.md
+++ b/docs/contribute/feedback.md
@@ -0,0 +1,27 @@
+# How to provide feedback for ROCm documentation
+
+There are four standard ways to provide feedback for this repository.
+
+## Pull Request
+
+All contributions to ROCm documentation should arrive via the
+[GitHub Flow](https://docs.github.com/en/get-started/quickstart/github-flow)
+targeting the develop branch of the repository. If you are unable to contribute
+via the GitHub Flow, feel free to email us.
+
+## GitHub Discussions
+
+To ask questions or view answers to frequently asked questions, refer to
+[GitHub Discussions](https://github.com/RadeonOpenCompute/ROCm/discussions).
+On GitHub Discussions, in addition to asking and answering questions,
+members can share updates, have open-ended conversations,
+and follow along on via public announcements.
+
+## GitHub Issue
+
+Issues on existing or absent docs can be filed as
+[GitHub Issues](https://github.com/RadeonOpenCompute/ROCm/issues).
+
+## Email
+
+Send other feedback or questions to [rocm-feedback@amd.com](rocm-feedback@amd.com)
--- a/docs/data/AMD-404.png
+++ b/docs/data/AMD-404.png
--- a/docs/data/reference/openmp/openmp_toolchain.svg
+++ b/docs/data/reference/openmp/openmp_toolchain.svg
--- a/docs/deploy/linux/os-native/install.md
+++ b/docs/deploy/linux/os-native/install.md
@@ -166,6 +166,7 @@ section.
 # version
 ver=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -364,6 +365,7 @@ section.
 # version
 ver=5.6.1

+
 sudo tee /etc/zypp/repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -481,6 +483,7 @@ but are generally useful. Verification of the install is advised.

   ```shell
   export PATH=$PATH:/opt/rocm-5.6.1/bin:/opt/rocm-5.6.1/opencl/bin
+
   ```

   ```{attention}
--- a/docs/deploy/linux/os-native/upgrade.md
+++ b/docs/deploy/linux/os-native/upgrade.md
@@ -28,6 +28,7 @@ repository to the new release.
 # version
 version=5.6.1

+
 # amdgpu repository for focal
 echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/$version/ubuntu focal main" \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
@@ -42,6 +43,7 @@ sudo apt update
 # version
 version=5.6.1

+
 # amdgpu repository for jammy
 echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/$version/ubuntu jammy main" \
    | sudo tee /etc/apt/sources.list.d/amdgpu.list
@@ -63,6 +65,7 @@ sudo apt update
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -84,6 +87,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -105,6 +109,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -126,6 +131,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -147,6 +153,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -173,6 +180,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/zypp/repos.d/amdgpu.repo <<EOF
 [amdgpu]
 name=amdgpu
@@ -257,6 +265,7 @@ repository to the new release.
 # version
 version=5.6.1

+
 echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/$version focal main" \
    | sudo tee /etc/apt/sources.list.d/rocm.list
 echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
@@ -272,6 +281,7 @@ sudo apt update
 # version
 version=5.6.1

+
 echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/$version jammy main" \
    | sudo tee /etc/apt/sources.list.d/rocm.list
 echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
@@ -293,6 +303,7 @@ sudo apt update
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/rocm.repo <<EOF
 [ROCm-$ver]
 name=ROCm$ver
@@ -313,6 +324,7 @@ sudo yum clean all
 # version
 version=5.6.1

+
 sudo tee /etc/yum.repos.d/rocm.repo <<EOF
 [ROCm-$ver]
 name=ROCm$ver
--- a/docs/deploy/windows/index.md
+++ b/docs/deploy/windows/index.md
@@ -1,4 +1,4 @@
-# Deploy ROCm on Windows
+# Install ROCm (HIP SDK) on Windows

 Start with {doc}`/deploy/windows/quick_start` or follow the detailed
 instructions below.
@@ -39,6 +39,27 @@ Use the command line front-end of the installer.

 ::::

+## Post Installation
+
+::::{grid} 1 1 2 2
+:gutter: 1
+
+:::{grid-item-card} ROCm-Examples
+:link: https://github.com/amd/rocm-examples
+:link-type: url
+
+Learn how to use ROCm with descriptive examples for novice to intermediate users.
+:::
+
+:::{grid-item-card} Windows App Deployment Guidelines
+:link: ../../understand/windows-app-deployment-guidelines
+:link-type: doc
+
+Discusses strategies on how to bundle HIP libraries with an end user application.
+:::
+
+::::
+
 ## See Also

 - {doc}`/release/gpu_os_support`
--- a/docs/deploy/windows/prerequisites.md
+++ b/docs/deploy/windows/prerequisites.md
@@ -6,16 +6,16 @@ system meets all the requirements to proceed with the installation.
 ## Confirm the System Is Supported

 The ROCm installation is supported only on specific host architectures, Windows
-SKUs and update versions.
+Editions and update versions.

-### Check the Windows SKU and Update Version on Your System
+### Check the Windows Editions and Update Version on Your System

 This section discusses obtaining information about the host architecture,
-Windows SKU and update version.
+Windows Edition and update version.

 #### Command Line Check

-Verify the Windows SKU using the following steps:
+Verify the Windows Edition using the following steps:

 1. To obtain the Linux distribution information, type the following command on
   your system from a PowerShell Command Line Interface (CLI):
--- a/docs/how_to/gpu_aware_mpi.md
+++ b/docs/how_to/gpu_aware_mpi.md
@@ -66,11 +66,8 @@ cd ucx
 ./autogen.sh
 mkdir build
 cd build
-../contrib/configure-release -prefix=$UCX_DIR \
-    --with-rocm=/opt/rocm \
-    --without-cuda -enable-optimizations -disable-logging \
-    --disable-debug -disable-assertions \
-    --disable-params-check -without-java
+../configure -prefix=$UCX_DIR \
+    --with-rocm=/opt/rocm
 make -j $(nproc)
 make -j $(nproc) install
 ```
@@ -93,9 +90,7 @@ cd ompi
 mkdir build
 cd build
 ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
-    --with-rocm=/opt/rocm \
-    --enable-mca-no-build=btl-uct --enable-mpi1-compatibility \
-    CC=clang CXX=clang++ FC=flang
+    --with-rocm=/opt/rocm
 make -j $(nproc)
 make -j $(nproc) install
 ```
@@ -165,7 +160,12 @@ Inter-GPU bandwidth with various payload sizes.
 Collective Operations on GPU buffers are best handled through the
 Unified Collective Communication Library (UCC) component in Open MPI.
 For this, the UCC library has to be configured and compiled with ROCm
-support. An example for configuring UCC and Open MPI with ROCm support
+support.
+
+Please note the compatibility [table](../release/3rd_party_support_matrix.md#communication-libraries)
+for UCC versions with the various ROCm versions.
+
+An example for configuring UCC and Open MPI with ROCm support
 is shown below:

 ```shell
--- a/docs/how_to/system_debugging.md
+++ b/docs/how_to/system_debugging.md
@@ -64,5 +64,4 @@ Debug messages when developing/debugging base ROCm driver. You could enable the

 ## PCIe-Debug

-Refer to ROCm PCIe Debug, <a href="https://rocmdocs.amd.com/en/latest/Other_Solutions/PCIe-Debug.html#pcie-debug" target="_blank">https://rocmdocs.amd.com/en/latest/Other_Solutions/PCIe-Debug.html#pcie-debug</a>.
 For information on how to debug and profile HIP applications, see {doc}`hip:how_to_guides/debugging`
--- a/docs/index.md
+++ b/docs/index.md
@@ -5,11 +5,10 @@

 ::::{grid-item}
 :::{dropdown} [What is ROCm?](rocm)
-ROCm is an open-source stack for GPU computation. ROCm is primarily
-Open-Source Software (OSS) that allows developers the freedom to customize and
-tailor their GPU software for their own needs while collaborating with a
-community of other developers, and helping each other find solutions in an
-agile, flexible, rapid and secure manner. [more...](rocm)
+ROCm is an open-source stack, composed primarily of open-source software (OSS), designed for
+graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
+tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
+[more...](rocm)

 ::::

--- a/docs/license.md
+++ b/docs/license.md
@@ -0,0 +1,6 @@
+# License
+
+> Note: This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that contains documentation primarily. For other licensing information, see the [Licensing Terms page](./release/licensing).
+
+```{include} ../LICENSE
+```
--- a/docs/reference/all.md
+++ b/docs/reference/all.md
@@ -29,6 +29,7 @@ ROCm template libraries for C++ primitives and algorithms are as follows:
 - {doc}`rocPRIM <rocprim:index>`
 - {doc}`rocThrust <rocthrust:index>`
 - {doc}`hipCUB <hipcub:index>`
+- {doc}`hipTensor <hiptensor:index>`

 :::

--- a/docs/reference/gpu_libraries/c++_primitives.md
+++ b/docs/reference/gpu_libraries/c++_primitives.md
@@ -40,4 +40,14 @@ interface. It's back-end is rocPRIM.

 :::

+:::{grid-item-card} {doc}`hipTensor <hiptensor:index>`
+hipTensor is AMD's C++ library for accelerating tensor primitives
+based on the composable kernel library,
+through general purpose kernel languages, like HIP C++.
+
+- {doc}`Documentation <hiptensor:index>`
+- [GitHub](https://github.com/ROCmSoftwarePlatform/hipTensor)
+
+:::
+
 :::::
--- a/docs/reference/gpu_libraries/linear_algebra.md
+++ b/docs/reference/gpu_libraries/linear_algebra.md
@@ -97,4 +97,13 @@ supporting both `rocSPARSE` and `cuSPARSE` as backends.

 :::

+:::{grid-item-card} {doc}`hipSPARSELt <hipsparselt:index>`
+`hipSPARSE` is a marshalling library to provide sparse BLAS functionality,
+supporting both `rocSPARSELt` and `cuSPARSELt` as backends.
+
+- {doc}`Documentation <hipsparselt:index>`
+- [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSELt)
+
+:::
+
 :::::
--- a/docs/reference/gpu_libraries/math.md
+++ b/docs/reference/gpu_libraries/math.md
@@ -1,6 +1,6 @@
 # Math Libraries

-AMD provides various math domain and support libraries as part of the ROCm.
+AMD provides various math domain and support libraries as part of ROCm.

 ## rocLIB vs. hipLIB

@@ -26,6 +26,7 @@ at compile-time of the hipLIB in question. For dynamic dispatch between vendor i
 - {doc}`hipSOLVER <hipsolver:index>`
 - {doc}`rocSPARSE <rocsparse:index>`
 - {doc}`hipSPARSE <hipsparse:index>`
+- {doc}`hipSPARSELt <hipsparselt:index>`

 :::

--- a/docs/reference/openmp/openmp.md
+++ b/docs/reference/openmp/openmp.md
@@ -11,6 +11,11 @@ OpenMP toolchain, example usage of device offloading, and usage of `rocprof`
 with OpenMP applications. The GPUs supported are the same as those supported by
 this ROCm release. See the list of supported GPUs in {doc}`/release/gpu_os_support`.

+The ROCm OpenMP compiler is implemented using LLVM compiler technology.
+{numref}`openmp-toolchain` illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
+
+![OpenMP Toolchain](../../data/reference/openmp/openmp_toolchain.svg)
+
 ### Installation

 The OpenMP toolchain is automatically installed as part of the standard ROCm
@@ -107,8 +112,7 @@ code compiled with AOMP:
   options --list-basic and --list-derived. `rocprof` accepts either a text or
   an XML file as an input.

-For more details on `rocprof`, refer to the ROCm Profiling Tools document on
-{doc}`rocprofiler:rocprof`.
+For more details on `rocprof`, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.

 ### Using Tracing Options

@@ -134,20 +138,21 @@ Google Chrome at chrome://tracing/ or [Perfetto](https://perfetto.dev/).
 Navigate to Chrome or Perfetto and load the JSON file to see the timeline of the
 HSA calls.

-For more details on tracing, refer to the ROCm Profiling Tools document on
-{doc}`rocprofiler:rocprof`.
+For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.

 ### Environment Variables

 :::{table}
 :widths: auto
-| Environment Variable        | Description                  |
+| Environment Variable        | Purpose                  |
 | --------------------------- | ---------------------------- |
-| `OMP_NUM_TEAMS`             | The implementation chooses the number of teams for kernel launch. The user can change this number for performance tuning using this environment variable, subject to implementation limits. |
-| `LIBOMPTARGET_KERNEL_TRACE` | This environment variable is used to print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
-| `LIBOMPTARGET_INFO`         | This environment variable is used to print informational messages from the device runtime as the program executes. Users can request fine-grain information by setting it to the value of 1 or higher and can set the value of -1 for complete information. |
-| `LIBOMPTARGET_DEBUG`        | If a debug version of the device library is present, setting this environment variable to 1 and using that library emits further detailed debugging information about data transfer operations and kernel launch. |
-| `GPU_MAX_HW_QUEUES`         | This environment variable is used to set the number of HSA queues in the OpenMP runtime. |
+| `OMP_NUM_TEAMS`             | To set the number of teams for kernel launch, which is otherwise chosen by the implementation by default. You can set this number (subject to implementation limits) for performance tuning. |
+| `LIBOMPTARGET_KERNEL_TRACE` | To print useful statistics for device operations. Setting it to 1 and running the program emits the name of every kernel launched, the number of teams and threads used, and the corresponding register usage. Setting it to 2 additionally emits timing information for kernel launches and data transfer operations between the host and the device. |
+| `LIBOMPTARGET_INFO`         | To print informational messages from the device runtime as the program executes. Setting it to a value of 1 or higher, prints fine-grain information and setting it to -1 prints complete information. |
+| `LIBOMPTARGET_DEBUG`        | To get detailed debugging information about data transfer operations and kernel launch when using a debug version of the device library. Set this environment variable to 1 to get the detailed information from the library. |
+| `GPU_MAX_HW_QUEUES`         | To set the number of HSA queues in the OpenMP runtime. The HSA queues are created on demand up to the maximum value as supplied here. The queue creation starts with a single initialized queue to avoid unnecessary allocation of resources. The provided value is capped if it exceeds the recommended, device-specific value. |
+| `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` | To set the threshold size up to which data transfers are initiated asynchronously. The default threshold size is 1*1024*1024 bytes (1MB). |
+| `OMPX_FORCE_SYNC_REGIONS` | To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations. |
 :::

 ## OpenMP: Features
@@ -159,10 +164,17 @@ implemented in the past releases.

 ### Asynchronous Behavior in OpenMP Target Regions

- Multithreaded offloading on the same device
+- Controlling Asynchronous Behavior
+
+The OpenMP offloading runtime executes in an asynchronous fashion by default, allowing multiple data transfers to start concurrently. However, if the data to be transferred becomes larger than the default threshold of 1MB, the runtime falls back to a synchronous data transfer. The buffers that have been locked already are always executed asynchronously.
+You can overrule this default behavior by setting `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` and `OMPX_FORCE_SYNC_REGIONS`. See the [Environment Variables](#environment-variables) table for details.
+
+- Multithreaded Offloading on the Same Device
+
 The `libomptarget` plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.

- Parallel memory copy invocations
+- Parallel Memory Copy Invocations
+
 Implicit asynchronous execution of single target region enables parallel memory copy invocations.

 ### Unified Shared Memory
@@ -317,8 +329,10 @@ double a = 0.0;
 a = a + 1.0;
 ```

-NOTE `AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
+:::{note}
+`AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
 `AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
+:::

 To disable the generation of fast floating-point atomic instructions at the file
 level, build using the option `-msafe-fp-atomics` or use a hint clause on a
--- a/docs/reference/rocmcc/rocmcc.md
+++ b/docs/reference/rocmcc/rocmcc.md
@@ -1109,7 +1109,7 @@ The following table lists the other Clang options and their support status.
 |-ftime-trace|Supported|Turns on time profiler. Generates JSON file based on output filename|
 |-ftrap-function= \<value\>|Unsupported|Issues call to specified function rather than a trap instruction|
 |-ftrapv-handler= \<function name\>|Unsupported|Specifies the function to be called on overflow|
- |-ftrapv|Unsupported|Traps on integer overflow|
+ |-ftrapv|Supported|Traps on integer overflow|
 |-ftrigraphs|Supported|Processes trigraph sequences|
 |-ftrivial-auto-var-init-stop-after= \<value\>|Supported|Stops initializing trivial automatic stack variables after the specified number of instances|
 |-ftrivial-auto-var-init= \<value\>|Supported|Initializes trivial automatic stack variables. Values: uninitialized (default) / pattern|
--- a/docs/release/3rd_party_support_matrix.md
+++ b/docs/release/3rd_party_support_matrix.md
@@ -32,6 +32,14 @@ UCX version | ROCm 5.4 and older | ROCm 5.5 and newer |
 | -1.14.0   | COMPATIBLE         | INCOMPATIBLE       |
 |  1.14.1+  | COMPATIBLE         | COMPATIBLE         |

+The Unified Collective Communication Library [UCC](https://https://github.com/openucx/ucc)
+also has support for ROCm devices.
+
+UCC version | ROCm 5.5 and older | ROCm 5.6 and newer |
+|:----------|:------------------:|:------------------:|
+| -1.1.0    | COMPATIBLE         | INCOMPATIBLE       |
+|  1.2.0+   | COMPATIBLE         | COMPATIBLE         |
+
 ## Algorithm libraries

 ROCm releases provide algorithm libraries with interfaces compatible with
--- a/docs/release/gpu_os_support.md
+++ b/docs/release/gpu_os_support.md
@@ -58,8 +58,6 @@ ROCm supports virtualization for select GPUs only as shown below.
 | VMWare         | ESXi 8   | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
 | VMWare         | ESXi 7   | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |

-(supported_gpus)=
-
 ## Linux Supported GPUs

 The table below shows supported GPUs for Instinct™, Radeon Pro™ and Radeon™
--- a/docs/release/licensing.md
+++ b/docs/release/licensing.md
@@ -8,61 +8,63 @@ The table shows ROCm components, the name of license and link to the license ter
 The table is ordered to follow ROCm's manifest file.

 <!-- spellcheck-disable -->
-| Component                                                                                        | License                                                                                                                    |
+| Component | License |
 |:------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
-| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/)                   | [GPL 2.0 WITH Linux-syscall-note](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/COPYING)             |
-| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/)               | [MIT](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/LICENSE.md)                                    |
-| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/)                               | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/LICENSE.txt)               |
-| [rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib/)                               | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/License.txt)               |
-| [rocm-cmake](https://github.com/RadeonOpenCompute/rocm-cmake/)                                   | [MIT](https://github.com/RadeonOpenCompute/rocm-cmake/blob/develop/LICENSE)                                                |
-| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/)                                       | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocminfo/blob/master/License.txt)                   |
-| [rocprofiler](https://github.com/ROCm-Developer-Tools/rocprofiler/)                              | [MIT](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/LICENSE)                                         |
-| [roctracer](https://github.com/ROCm-Developer-Tools/roctracer/)                                  | [MIT](https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/LICENSE)                                           |
-| [ROCm-OpenCL-Runtime](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/)                 | [MIT](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/develop/LICENSE.txt)                                   |
-| [ROCm-OpenCL-Runtime/api/opencl/khronos/icd](https://github.com/KhronosGroup/OpenCL-ICD-Loader/) | [Apache 2.0](https://github.com/KhronosGroup/OpenCL-ICD-Loader/blob/main/LICENSE)                                          |
-| [clang-ocl](https://github.com/RadeonOpenCompute/clang-ocl/)                                     | [MIT](https://github.com/RadeonOpenCompute/clang-ocl/blob/master/LICENSE)                                                  |
-| [HIP](https://github.com/ROCm-Developer-Tools/HIP/)                                              | [MIT](https://github.com/ROCm-Developer-Tools/HIP/blob/develop/LICENSE.txt)                                                |
-| [hipamd](https://github.com/ROCm-Developer-Tools/hipamd/)                                        | [MIT](https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/LICENSE.txt)                                             |
-| [ROCclr](https://github.com/ROCm-Developer-Tools/ROCclr/)                                        | [MIT](https://github.com/ROCm-Developer-Tools/ROCclr/blob/develop/LICENSE.txt)                                             |
-| [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/)                                        | [MIT](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/LICENSE.txt)                                         |
-| [HIPCC](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt)                  | [MIT](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt)                                              |
-| [llvm-project](https://github.com/ROCm-Developer-Tools/llvm-project/)                            | [Apache](https://github.com/ROCm-Developer-Tools/llvm-project/blob/main/LICENSE.TXT)                                       |
-| rocm-llvm-alt                                                                                    | [AMD Proprietary License](https://www.amd.com/en/support/amd-software-eula)
-| [ROCm-Device-Libs](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/)                       | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/LICENSE.TXT)     |
-| [atmi](https://github.com/RadeonOpenCompute/atmi/)                                               | [MIT](https://github.com/RadeonOpenCompute/atmi/blob/master/LICENSE.txt)                                                   |
-| [ROCm-CompilerSupport](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/)               | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/LICENSE.txt) |
-| [rocr_debug_agent](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/)                    | [The University of Illinois/NCSA](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/blob/master/LICENSE.txt)        |
-| [rocm_bandwidth_test](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/)                 | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/blob/master/LICENSE.txt)        |
-| [half](https://github.com/ROCmSoftwarePlatform/half/)                                            | [MIT](https://github.com/ROCmSoftwarePlatform/half/blob/master/LICENSE.txt)                                                |
-| [RCP](https://github.com/GPUOpen-Tools/radeon_compute_profiler/)                                 | [MIT](https://github.com/GPUOpen-Tools/radeon_compute_profiler/blob/master/LICENSE)                                        |
-| [ROCgdb](https://github.com/ROCm-Developer-Tools/ROCgdb/)                                        | [GNU General Public License v2.0](https://github.com/ROCm-Developer-Tools/ROCgdb/blob/amd-master/COPYING)                  |
-| [ROCdbgapi](https://github.com/ROCm-Developer-Tools/ROCdbgapi/)                                  | [MIT](https://github.com/ROCm-Developer-Tools/ROCdbgapi/blob/amd-master/LICENSE.txt)                                       |
-| [rdc](https://github.com/RadeonOpenCompute/rdc/)                                                 | [MIT](https://github.com/RadeonOpenCompute/rdc/blob/master/LICENSE)                                                        |
-| [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/LICENSE.md)                                             |
-| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/Tensile/blob/develop/LICENSE.md)                                             |
-| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/LICENSE.md)                                             |
-| [rocFFT](https://github.com/ROCmSoftwarePlatform/rocFFT/)                                        | [MIT](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/LICENSE.md)                                              |
-| [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT/)                                        | [MIT](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/LICENSE.md)                                              |
-| [rocRAND](https://github.com/ROCmSoftwarePlatform/rocRAND/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/LICENSE.txt)                                            |
-| [rocSPARSE](https://github.com/ROCmSoftwarePlatform/rocSPARSE/)                                  | [MIT](https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/develop/LICENSE.md)                                           |
-| [rocSOLVER](https://github.com/ROCmSoftwarePlatform/rocSOLVER/)                                  | [BSD-2-Clause](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/LICENSE.md)                                           |
-| [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER/)                                  | [MIT](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/LICENSE.md)                                           |
-| [hipSPARSE](https://github.com/ROCmSoftwarePlatform/hipSPARSE/)                                  | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSE/blob/develop/LICENSE.md)                                           |
-| [rocALUTION](https://github.com/ROCmSoftwarePlatform/rocALUTION/)                                | [MIT](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/LICENSE.md)                                          |
-| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/)                                | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/blob/master/LICENSE.txt)                                          |
-| [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen/)                                        | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/LICENSE.txt)                                              |
-| [rccl](https://github.com/ROCmSoftwarePlatform/rccl/)                                            | [Custom](https://github.com/ROCmSoftwarePlatform/rccl/blob/develop/LICENSE.txt)                                            |
-| [MIVisionX](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/)                 | [MIT](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/LICENSE.txt)                          |
-| [rocThrust](https://github.com/ROCmSoftwarePlatform/rocThrust/)                                  | [Apache 2.0](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/LICENSE)                                       |
-| [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB/)                                        | [Custom](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/LICENSE.txt)                                          |
-| [rocPRIM](https://github.com/ROCmSoftwarePlatform/rocPRIM/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/LICENSE.txt)                                            |
-| [rocWMMA](https://github.com/ROCmSoftwarePlatform/rocWMMA/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/LICENSE.md)                                             |
-| [hipfort](https://github.com/ROCmSoftwarePlatform/hipfort/)                                      | [MIT](https://github.com/ROCmSoftwarePlatform/hipfort/blob/master/LICENSE)                                                 |
-| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/)                              | [MIT](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/LICENSE)                                            |
-| [ROCmValidationSuite](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/)              | [MIT](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/LICENSE)                                     |
-| [aomp](https://github.com/ROCm-Developer-Tools/aomp/)                                            | [Apache 2.0](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/LICENSE)                                           |
-| [aomp-extras](https://github.com/ROCm-Developer-Tools/aomp-extras/)                              | [MIT](https://github.com/ROCm-Developer-Tools/aomp-extras/blob/aomp-dev/LICENSE)                                           |
-| [flang](https://github.com/ROCm-Developer-Tools/flang/)                                          | [Apache 2.0](https://github.com/ROCm-Developer-Tools/flang/blob/master/LICENSE.txt)                                        |
+| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | [MIT](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/LICENSE) |
+| [HIPCC](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) | [MIT](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) |
+| [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/) | [MIT](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/LICENSE.txt) |
+| [HIP](https://github.com/ROCm-Developer-Tools/HIP/) | [MIT](https://github.com/ROCm-Developer-Tools/HIP/blob/develop/LICENSE.txt) |
+| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/blob/master/LICENSE.txt) |
+| [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/LICENSE.txt) |
+| [MIVisionX](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/) | [MIT](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/LICENSE.txt) |
+| [RCP](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | [MIT](https://github.com/GPUOpen-Tools/radeon_compute_profiler/blob/master/LICENSE) |
+| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/COPYING) |
+| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/LICENSE.txt) |
+| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | [MIT](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/LICENSE.md) |
+| [ROCclr](https://github.com/ROCm-Developer-Tools/ROCclr/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCclr/blob/develop/LICENSE.txt) |
+| [ROCdbgapi](https://github.com/ROCm-Developer-Tools/ROCdbgapi/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCdbgapi/blob/amd-master/LICENSE.txt) |
+| [ROCgdb](https://github.com/ROCm-Developer-Tools/ROCgdb/) | [GNU General Public License v2.0](https://github.com/ROCm-Developer-Tools/ROCgdb/blob/amd-master/COPYING) |
+| [ROCm-CompilerSupport](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/LICENSE.txt) |
+| [ROCm-Device-Libs](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/LICENSE.TXT) |
+| [ROCm-OpenCL-Runtime/api/opencl/khronos/icd](https://github.com/KhronosGroup/OpenCL-ICD-Loader/) | [Apache 2.0](https://github.com/KhronosGroup/OpenCL-ICD-Loader/blob/main/LICENSE) |
+| [ROCm-OpenCL-Runtime](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/) | [MIT](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/develop/LICENSE.txt) |
+| [ROCmValidationSuite](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/LICENSE) |
+| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile/) | [MIT](https://github.com/ROCmSoftwarePlatform/Tensile/blob/develop/LICENSE.md) |
+| [aomp-extras](https://github.com/ROCm-Developer-Tools/aomp-extras/) | [MIT](https://github.com/ROCm-Developer-Tools/aomp-extras/blob/aomp-dev/LICENSE) |
+| [aomp](https://github.com/ROCm-Developer-Tools/aomp/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/LICENSE) |
+| [atmi](https://github.com/RadeonOpenCompute/atmi/) | [MIT](https://github.com/RadeonOpenCompute/atmi/blob/master/LICENSE.txt) |
+| [clang-ocl](https://github.com/RadeonOpenCompute/clang-ocl/) | [MIT](https://github.com/RadeonOpenCompute/clang-ocl/blob/master/LICENSE) |
+| [flang](https://github.com/ROCm-Developer-Tools/flang/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/flang/blob/master/LICENSE.txt) |
+| [half](https://github.com/ROCmSoftwarePlatform/half/) | [MIT](https://github.com/ROCmSoftwarePlatform/half/blob/master/LICENSE.txt) |
+| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/LICENSE.md) |
+| [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB/) | [Custom](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/LICENSE.txt) |
+| [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/LICENSE.md) |
+| [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/LICENSE.md) |
+| [hipSPARSELt](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/blob/develop/LICENSE.md) |
+| [hipSPARSE](https://github.com/ROCmSoftwarePlatform/hipSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSE/blob/develop/LICENSE.md) |
+| [hipTensor](https://github.com/ROCmSoftwarePlatform/hipTensor) | [MIT](https://github.com/ROCmSoftwarePlatform/hipTensor/blob/develop/LICENSE) |
+| [hipamd](https://github.com/ROCm-Developer-Tools/hipamd/) | [MIT](https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/LICENSE.txt) |
+| [hipfort](https://github.com/ROCmSoftwarePlatform/hipfort/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipfort/blob/master/LICENSE) |
+| [llvm-project](https://github.com/ROCm-Developer-Tools/llvm-project/) | [Apache](https://github.com/ROCm-Developer-Tools/llvm-project/blob/main/LICENSE.TXT) |
+| [rccl](https://github.com/ROCmSoftwarePlatform/rccl/) | [Custom](https://github.com/ROCmSoftwarePlatform/rccl/blob/develop/LICENSE.txt) |
+| [rdc](https://github.com/RadeonOpenCompute/rdc/) | [MIT](https://github.com/RadeonOpenCompute/rdc/blob/master/LICENSE) |
+| [rocALUTION](https://github.com/ROCmSoftwarePlatform/rocALUTION/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/LICENSE.md) |
+| [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/LICENSE.md) |
+| [rocFFT](https://github.com/ROCmSoftwarePlatform/rocFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/LICENSE.md) |
+| [rocPRIM](https://github.com/ROCmSoftwarePlatform/rocPRIM/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/LICENSE.txt) |
+| [rocRAND](https://github.com/ROCmSoftwarePlatform/rocRAND/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/LICENSE.txt) |
+| [rocSOLVER](https://github.com/ROCmSoftwarePlatform/rocSOLVER/) | [BSD-2-Clause](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/LICENSE.md) |
+| [rocSPARSE](https://github.com/ROCmSoftwarePlatform/rocSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/develop/LICENSE.md) |
+| [rocThrust](https://github.com/ROCmSoftwarePlatform/rocThrust/) | [Apache 2.0](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/LICENSE) |
+| [rocWMMA](https://github.com/ROCmSoftwarePlatform/rocWMMA/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/LICENSE.md) |
+| [rocm-cmake](https://github.com/RadeonOpenCompute/rocm-cmake/) | [MIT](https://github.com/RadeonOpenCompute/rocm-cmake/blob/develop/LICENSE) |
+| [rocm_bandwidth_test](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/blob/master/LICENSE.txt) |
+| [rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/License.txt) |
+| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocminfo/blob/master/License.txt) |
+| [rocprofiler](https://github.com/ROCm-Developer-Tools/rocprofiler/) | [MIT](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/LICENSE) |
+| [rocr_debug_agent](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/) | [The University of Illinois/NCSA](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/blob/master/LICENSE.txt) |
+| [roctracer](https://github.com/ROCm-Developer-Tools/roctracer/) | [MIT](https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/LICENSE) |
+| rocm-llvm-alt | [AMD Proprietary License](https://www.amd.com/en/support/amd-software-eula)

 Open sourced ROCm components are released via public GitHub
 repositories, packages on https://repo.radeon.com and other distribution channels.
--- a/docs/release/versions.md
+++ b/docs/release/versions.md
@@ -0,0 +1,23 @@
+# ROCm Release History
+
+| Version | Release Date |
+| ------- | ------------ |
+| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
+| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
+| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
+| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 |
+| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 |
+| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 |
+| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 |
+| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 |
+| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 |
+| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 |
+| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 |
+| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 |
+| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 |
+| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 |
+| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 |
+| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 |
+| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 |
+| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 |
+| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 |
--- a/docs/rocm.md
+++ b/docs/rocm.md
@@ -1,27 +1,23 @@
 # What is ROCm?

-ROCm is an open-source stack for GPU computation. ROCm is primarily Open-Source
-Software (OSS) that allows developers the freedom to customize and tailor their
-GPU software for their own needs while collaborating with a community of other
-developers, and helping each other find solutions in an agile, flexible, rapid
-and secure manner.
+ROCm is an open-source stack, composed primarily of open-source software (OSS), designed for
+graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
+tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

-ROCm is a collection of drivers, development tools and APIs enabling GPU
-programming from the low-level kernel to end-user applications. ROCm is powered
-by AMD’s Heterogeneous-computing Interface for Portability (HIP), an OSS C++ GPU
-programming environment and its corresponding runtime. HIP allows ROCm
-developers to create portable applications on different platforms by deploying
-code on a range of platforms, from dedicated gaming GPUs to exascale HPC
-clusters. ROCm supports programming models such as OpenMP and OpenCL, and
-includes all the necessary OSS compilers, debuggers and libraries. ROCm is fully
-integrated into ML frameworks such as PyTorch and TensorFlow. ROCm can be
-deployed in many ways, including through the use of containers such as Docker,
-Spack, and your own build from source.
+With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
+collaborate, test, and deploy your applications in a free, open-source, integrated, and secure software
+ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
+artificial intelligence (AI), scientific computing, and computer aided design (CAD).

-ROCm’s goal is to allow our users to maximize their GPU hardware investment.
-ROCm is designed to help develop, test and deploy GPU accelerated HPC, AI,
-scientific computing, CAD, and other applications in a free, open-source,
-integrated and secure software ecosystem.
+ROCm is powered by AMD’s
+[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
+an OSS C++ GPU programming environment and its corresponding runtime. HIP allows ROCm
+developers to create portable applications on different platforms by deploying code on a range of
+platforms, from dedicated gaming GPUs to exascale HPC clusters.
+
+ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary OSS
+compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks,
+such as PyTorch and TensorFlow.

 ## ROCm on Windows

--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -74,6 +74,7 @@ subtrees:
    title: Changelog
  - file: release/gpu_os_support
  - file: release/windows_support
+  - file: release/versions
  - url: https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue
    title: Known Issues
  - file: release/compatibility
@@ -120,6 +121,8 @@ subtrees:
                url: ${project:rocsparse}
              - title: hipSPARSE
                url: ${project:hipsparse}
+              - title: hipSPARSELt
+                url: ${project:hipsparselt}
          - file: reference/gpu_libraries/fft
            subtrees:
            - entries: 
@@ -140,12 +143,15 @@ subtrees:
        - entries:
          - title: rocPRIM
            url: ${project:rocprim}
+        - entries:
+          - title: rocThrust
+            url: ${project:rocthrust}
        - entries:
          - title: hipCUB 
            url: ${project:hipcub}
        - entries:
-          - title: rocThrust
-            url: ${project:rocthrust}
+          - title: hipTensor 
+            url: ${project:hiptensor}
    - file: reference/gpu_libraries/communication
      title: Communication Libraries
      subtrees:
@@ -181,9 +187,9 @@ subtrees:
          - url: ${project:rocgdb}
            title: ROCgdb
          - url: ${project:rocprofiler}
-            title: rocprofiler
+            title: ROCProfiler
          - url: ${project:roctracer}
-            title: roctracer
+            title: ROCTracer
          - url: ${project:rocdbgapi}
            title: ROCdbgapi
    - file: reference/management_tools
@@ -217,8 +223,12 @@ subtrees:
        - entries:
            - file: understand/gpu_arch/mi250
              title: MI250
+            - file: understand/gpu_arch/mi200_performance_counters
+              title: MI200 Performance Counters and Metrics
            - file: understand/gpu_arch/mi100
              title: MI100
+    - file: understand/using_gpu_sanitizer
+      title: Using GPU Sanitizer
    - file: understand/More-about-how-ROCm-uses-PCIe-Atomics
 - caption: How to Guides
  entries:
@@ -258,3 +268,8 @@ subtrees:
  entries:
    - file: about
    - file: contributing
+      subtrees:
+        - entries: 
+          - file: contribute/building.md
+          - file: contribute/feedback.md
+    - file: license.md
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1 +1 @@
-rocm-docs-core==0.21.0
+rocm-docs-core==0.23.0
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -1,8 +1,8 @@
 #
-# This file is autogenerated by pip-compile with Python 3.11
+# This file is autogenerated by pip-compile with Python 3.8
 # by the following command:
 #
-#    pip-compile --resolver=backtracking requirements.in
+#    pip-compile requirements.in
 #
 accessible-pygments==0.0.3
    # via pydata-sphinx-theme
@@ -16,7 +16,7 @@ beautifulsoup4==4.11.2
    # via pydata-sphinx-theme
 breathe==4.34.0
    # via rocm-docs-core
-certifi==2022.12.7
+certifi==2023.7.22
    # via requests
 cffi==1.15.1
    # via
@@ -46,12 +46,14 @@ idna==3.4
    # via requests
 imagesize==1.4.1
    # via sphinx
+importlib-metadata==6.8.0
+    # via sphinx
+importlib-resources==6.0.1
+    # via rocm-docs-core
 jinja2==3.1.2
    # via
    #   myst-parser
    #   sphinx
-linkify-it-py==1.0.3
-    # via myst-parser
 markdown-it-py==2.2.0
    # via
    #   mdit-py-plugins
@@ -62,7 +64,7 @@ mdit-py-plugins==0.3.4
    # via myst-parser
 mdurl==0.1.2
    # via markdown-it-py
-myst-parser[linkify]==1.0.0
+myst-parser==1.0.0
    # via rocm-docs-core
 packaging==23.0
    # via
@@ -76,7 +78,7 @@ pydata-sphinx-theme==0.13.3
    #   sphinx-book-theme
 pygithub==1.58.1
    # via rocm-docs-core
-pygments==2.14.0
+pygments==2.15.0
    # via
    #   accessible-pygments
    #   pydata-sphinx-theme
@@ -92,11 +94,11 @@ pyyaml==6.0
    #   myst-parser
    #   rocm-docs-core
    #   sphinx-external-toc
-requests==2.28.1
+requests==2.31.0
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==0.21.0
+rocm-docs-core==0.23.0
    # via -r requirements.in
 smmap==5.0.0
    # via gitdb
@@ -139,9 +141,11 @@ sphinxcontrib-serializinghtml==1.1.5
    # via sphinx
 typing-extensions==4.5.0
    # via pydata-sphinx-theme
-uc-micro-py==1.0.1
-    # via linkify-it-py
 urllib3==1.26.13
    # via requests
 wrapt==1.14.1
    # via deprecated
+zipp==3.16.2
+    # via
+    #   importlib-metadata
+    #   importlib-resources
--- a/docs/understand/More-about-how-ROCm-uses-PCIe-Atomics.rst
+++ b/docs/understand/More-about-how-ROCm-uses-PCIe-Atomics.rst
@@ -30,9 +30,9 @@ If your system has a PCIe Express Switch it needs to support AtomicsOp routing.

 Atomic Operation is a Non-Posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).

-To understand more about how PCIe Atomic operations work `PCIe Atomics <https://pcisig.com/sites/default/files/specification_documents/ECN_Atomic_Ops_080417.pdf>`_
+To understand more about how PCIe Atomic operations work `PCIe Atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_

-`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/patch/7261731/>`_
+`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_

 There are also a number of papers which talk about these new capabilities:

@@ -50,7 +50,7 @@ Other I/O devices with PCIe Atomics support

 Future bus technology with richer I/O Atomics Operation Support

-  * `GenZ <http://genzconsortium.org/faq/gen-z-technology/#33/>`_
+  * GenZ

 New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPU’s with PCIe Generation 3.0 support.

@@ -65,8 +65,6 @@ In ROCm, we also take advantage of PCIe ID based ordering technology for P2P whe

 They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.

-`Good Paper on Understanding PCIe Generation 3 Throughput <https://www.altera.com/en_US/pdfs/literature/an/an690.pdf>`_
-
 BAR Memory Overview
 *******************
 On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.
--- a/docs/understand/gpu_arch/mi100.md
+++ b/docs/understand/gpu_arch/mi100.md
@@ -21,7 +21,7 @@ fabric.

 <img src="../../data/reference/gpu_arch/image.004.png" alt="Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.">

-Structure of a single GCD in the AMD Instinct MI250 accelerator.
+Structure of a single GCD in the AMD Instinct MI100 accelerator.
 :::

 In a typical node configuration, each processor can host up to four AMD
--- a/docs/understand/gpu_arch/mi200_performance_counters.md
+++ b/docs/understand/gpu_arch/mi200_performance_counters.md
@@ -0,0 +1,457 @@
+# MI200 Performance Counters and Metrics
+<!-- markdownlint-disable no-duplicate-header -->
+
+This document lists and describes the hardware performance counters and the derived metrics available on the AMD Instinct™ MI200 GPU. All hardware performance monitors, and the derived performance metrics are accessible via AMD ROCm™ Profiler tool.
+
+## MI200 Performance Counters List
+
+:::{note}
+Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
+:::
+
+### Graphics Register Bus Management (GRBM)
+
+#### GRBM Counters
+
+| Hardware Counter   | Unit   | Definition                                                                |
+|--------------------|--------| --------------------------------------------------------------------------|
+| `grbm_count`       | Cycles | Free-running GPU clock                                                    |
+| `grbm_gui_active`  | Cycles | GPU active cycles                                                         |
+| `grbm_cp_busy`     | Cycles | Any of the CP (CPC/CPF) blocks are busy.                                  |
+| `grbm_spi_busy`    | Cycles | Any of the Shader Processor Input (SPI) are busy in the shader engine(s). |
+| `grbm_ta_busy`     | Cycles | Any of the Texture Addressing Unit (TA) are busy in the shader engine(s). |
+| `grbm_tc_busy`     | Cycles | Any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy.               |
+| `grbm_cpc_busy`    | Cycles | The Command Processor - Compute (CPC) is busy.                            |
+| `grbm_cpf_busy`    | Cycles | The Command Processor - Fetcher (CPF) is busy.                            |
+| `grbm_utcl2_busy`  | Cycles | The Unified Translation Cache - Level 2 (UTCL2) block is busy.            |
+| `grbm_ea_busy`     | Cycles | The Efficiency Arbiter (EA) block is busy.                                |
+
+### Command Processor (CP)
+
+The command processor counters are further classified into fetcher and compute.
+
+#### Command Processor - Fetcher (CPF)
+
+##### CPF Counters
+
+| Hardware Counter                     | Unit   | Definition                                                   |
+|--------------------------------------|--------|--------------------------------------------------------------|
+| `cpf_cmp_utcl1_stall_on_translation` | Cycles | One of the Compute UTCL1s is stalled waiting on translation. |
+| `cpf_cpf_stat_idle[∗]`               | Cycles | CPF idle                                                   |
+| `cpf_cpf_stat_stall`                 | Cycles | CPF stall                                                  |
+| `cpf_cpf_tciu_busy`                  | Cycles | CPF TCIU interface busy                                    |
+| `cpf_cpf_tciu_idle`                  | Cycles | CPF TCIU interface idle                                    |
+| `cpf_cpf_tciu_stall[∗]`              | Cycles | CPF TCIU interface is stalled waiting on free tags.        |
+
+#### Command Processor - Compute (CPC)
+
+##### CPC Counters
+
+| Hardware Counter                 | Unit   | Definition                                          |
+| ---------------------------------| -------| --------------------------------------------------- |
+| `cpc_me1_busy_for_packet_decode` | Cycles | CPC ME1 busy decoding packets                       |
+| `cpc_utcl1_stall_on_translation` | Cycles | One of the UTCL1s is stalled waiting on translation |
+| `cpc_cpc_stat_busy`              | Cycles | CPC busy                                            |
+| `cpc_cpc_stat_idle`              | Cycles | CPC idle                                            |
+| `cpc_cpc_stat_stall`             | Cycles | CPC stalled                                         |
+| `cpc_cpc_tciu_busy`              | Cycles | CPC TCIU interface busy                             |
+| `cpc_cpc_tciu_idle`              | Cycles | CPC TCIU interface idle                             |
+| `cpc_cpc_utcl2iu_busy`           | Cycles | CPC UTCL2 interface busy                            |
+| `cpc_cpc_utcl2iu_idle`           | Cycles | CPC UTCL2 interface idle                            |
+| `cpc_cpc_utcl2iu_stall[∗]`       | Cycles | CPC UTCL2 interface stalled waiting                 |
+| `cpc_me1_dci0_spi_busy`          | Cycles | CPC ME1 Processor busy                              |
+
+### Shader Processor Input (SPI)
+
+#### SPI Counters
+
+| Hardware Counter             | Unit        | Definition                                                   |
+| :----------------------------| :-----------| -----------------------------------------------------------: |
+| `spi_csn_busy`                 | Cycles      | Number of clocks with outstanding waves                      |
+| `spi_csn_window_valid`         | Cycles      | Clock count enabled by perfcounter_start event               |
+| `spi_csn_num_threadgroups`     | Workgroups  | Total number of dispatched workgroups                        |
+| `spi_csn_wave`                 | Wavefronts  | Total number of dispatched wavefronts                        |
+| `spi_ra_req_no_alloc`          | Cycles      | Arb cycles with requests but no allocation (need to multiply this value by 4) |
+|`spi_ra_req_no_alloc_csn`       | Cycles      | Arb cycles with CSn req and no CSn alloc (need to multiply this value by 4) |
+| `spi_ra_res_stall_csn`         | Cycles      | Arb cycles with CSn req and no CSn fits (need to multiply this value by 4) |
+| `spi_ra_tmp_stall_csn[∗]`      | Cycles      | Cycles where CSn wants to req but does not fit in temp space |
+| `spi_ra_wave_simd_full_csn`    | SIMD-cycles | Sum of SIMD where WAVE cannot take csn wave when not fits    |
+| `spi_ra_vgpr_simd_full_csn[∗]` | SIMD-cycles | Sum of SIMD where VGPR cannot take csn wave when not fits    |
+| `spi_ra_sgpr_simd_full_csn[∗]` | SIMD-cycles | Sum of SIMD where SGPR cannot take csn wave when not fits    |
+| `spi_ra_lds_cu_full_csn`       | CUs         | Sum of CU where LDS cannot take csn wave when not fits       |
+| `spi_ra_bar_cu_full_csn[∗]`    | CUs         | Sum of CU where BARRIER cannot take csn wave when not fits   |
+| `spi_ra_bulky_cu_full_csn[∗]`  | CUs         | Sum of CU where BULKY cannot take csn wave when not fits     |
+| `spi_ra_tglim_cu_full_csn[∗]`  | Cycles      | Cycles where csn wants to req but all CUs are at tg_limit    |
+| `spi_ra_wvlim_cu_full_csn[∗]`  | Cycles      | Number of clocks csn is stalled due to WAVE LIMIT            |
+| `spi_vwc_csc_wr`               | Cycles      | Number of clocks to write CSC waves to VGPRs (need to multiply this value by 4) |
+| `spi_swc_csc_wr`               | Cycles      | Number of clocks to write CSC waves to SGPRs (need to multiply this value by 4) |
+
+### Compute Unit
+
+The compute unit counters are further classified into instruction mix, MFMA operation counters, level counters, wavefront counters, wavefront cycle counters, local data share counters, and others.
+
+#### Instruction Mix
+
+| Hardware Counter        | Unit   | Definition                                                               |
+| :-----------------------| :-----:| -----------------------------------------------------------------------: |
+| `sq_insts`                | Instr | Number of instructions issued                                             |
+| `sq_insts_valu`           | Instr | Number of VALU instructions issued, including MFMA                        |
+| `sq_insts_valu_add_f16`   | Instr | Number of VALU F16 Add instructions issued                                |
+| `sq_insts_valu_mul_f16`   | Instr | Number of VALU F16 Multiply instructions issued                           |
+| `sq_insts_valu_fma_f16`   | Instr | Number of VALU F16 FMA instructions issued                                |
+| `sq_insts_valu_trans_f16` | Instr | Number of VALU F16 Transcendental instructions issued                     |
+| `sq_insts_valu_add_f32`   | Instr | Number of VALU F32 Add instructions issued                                |
+| `sq_insts_valu_mul_f32`   | Instr | Number of VALU F32 Multiply instructions issued                           |
+| `sq_insts_valu_fma_f32`   | Instr | Number of VALU F32 FMA instructions issued                                |
+| `sq_insts_valu_trans_f32` | Instr | Number of VALU F32 Transcendental instructions issued                     |
+| `sq_insts_valu_add_f64`   | Instr | Number of VALU F64 Add instructions issued                                |
+| `sq_insts_valu_mul_f64`   | Instr | Number of VALU F64 Multiply instructions issued                           |
+| `sq_insts_valu_fma_f64`   | Instr | Number of VALU F64 FMA instructions issued                                |
+| `sq_insts_valu_trans_f64` | Instr | Number of VALU F64 Transcendental instructions issued                     |
+| `sq_insts_valu_int32`     | Instr | Number of VALU 32-bit integer instructions issued (signed or unsigned)    |
+| `sq_insts_valu_int64`     | Instr | Number of VALU 64-bit integer instructions issued (signed or unsigned)    |
+| `sq_insts_valu_cvt`       | Instr | Number of VALU Conversion instructions issued                             |
+| `sq_insts_valu_mfma_i8`   | Instr | Number of 8-bit Integer MFMA instructions issued                          |
+| `sq_insts_valu_mfma_f16`  | Instr | Number of F16 MFMA instructions issued                                    |
+| `sq_insts_valu_mfma_bf16` | Instr | Number of BF16 MFMA instructions issued                                   |
+| `sq_insts_valu_mfma_f32`  | Instr | Number of F32 MFMA instructions issued                                    |
+| `sq_insts_valu_mfma_f64`  | Instr | Number of F64 MFMA instructions issued                                    |
+| `sq_insts_mfma`           | Instr | Number of MFMA instructions issued                                        |
+| `sq_insts_vmem_wr`        | Instr | Number of VMEM Write instructions issued                                  |
+| `sq_insts_vmem_rd`        | Instr | Number of VMEM Read instructions issued                                   |
+| `sq_insts_vmem`           | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions |
+| `sq_insts_salu`           | Instr | Number of SALU instructions issued                                        |
+| `sq_insts_smem`           | Instr | Number of SMEM instructions issued                                        |
+| `sq_insts_smem_norm`      | Instr | Number of SMEM instructions issued to normalize to match `smem_level`. Used in measuring SMEM latency |
+| `sq_insts_flat`           | Instr | Number of FLAT instructions issued                                        |
+| `sq_insts_flat_lds_only`  | Instr | Number of FLAT instructions issued that read/write only from/to LDS       |
+| `sq_insts_lds`            | Instr | Number of LDS instructions issued                                         |
+| `sq_insts_gds`            | Instr | Number of GDS instructions issued                                         |
+| `sq_insts_exp_gds`        | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued |
+| `sq_insts_branch`         | Instr | Number of Branch instructions issued                                      |
+| `sq_insts_sendmsg`        | Instr | Number of SENDMSG instructions including s_endpgm issued                  |
+| `sq_insts_vskipped[∗]`    | Instr | Number of VSkipped instructions issued                                    |
+
+#### MFMA Operation Counters
+
+| Hardware Counter             | Unit  | Definition                                      |
+| :----------------------------| :-----| ----------------------------------------------: |
+| `sq_insts_valu_mfma_mops_I8`   | IOP   | Number of 8-bit integer MFMA ops in unit of 512 |
+| `sq_insts_valu_mfma_mops_F16`  | FLOP  | Number of F16 floating MFMA ops in unit of 512  |
+| `sq_insts_valu_mfma_mops_BF16` | FLOP  | Number of BF16 floating MFMA ops in unit of 512 |
+| `sq_insts_valu_mfma_mops_F32`  | FLOP  | Number of F32 floating MFMA ops in unit of 512  |
+| `sq_insts_valu_mfma_mops_F64`  | FLOP  | Number of F64 floating MFMA ops in unit of 512  |
+
+#### Level Counters
+
+| Hardware Counter    | Unit  | Definition                             |
+| :-------------------| :-----| -------------------------------------: |
+| `sq_accum_prev`       | Count | Accumulated counter sample value where accumulation takes place once every  four cycles |
+| `sq_accum_prev_hires` | Count | Accumulated counter sample value where accumulation takes place once every cycle |
+| `sq_level_waves`      | Waves | Number of inflight waves               |
+| `sq_insts_level_vmem` | Instr | Number of inflight VMEM instructions   |
+| `sq_insts_level_smem` | Instr | Number of inflight SMEM instructions   |
+| `sq_insts_level_lds`  | Instr | Number of inflight LDS instructions    |
+| `sq_ifetch_level`     | Instr | Number of inflight instruction fetches |
+
+#### Wavefront Counters
+
+| Hardware Counter     | Unit  | Definition                                                        |
+| :--------------------| :-----| ----------------------------------------------------------------: |
+| `sq_waves`             | Waves | Number of wavefronts dispatch to SQs, including both new and restored wavefronts |
+| `sq_waves_saved[∗]`    | Waves | Number of context-saved wavefronts                                |
+| `sq_waves_restored[∗]` | Waves | Number of context-restored wavefronts                             |
+| `sq_waves_eq_64`       | Waves | Number of wavefronts with exactly 64 active threads sent to SQs   |
+| `sq_waves_lt_64`       | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
+| `sq_waves_lt_48`       | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
+| `sq_waves_lt_32`       | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
+| `sq_waves_lt_16`       | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
+
+#### Wavefront Cycle Counters
+
+| Hardware Counter         | Unit    | Definition                                                            |
+| :------------------------| :-------| --------------------------------------------------------------------: |
+| `sq_cycles`                | Cycles  | Free-running  SQ clocks                                               |
+| `sq_busy_cycles`           | Cycles  | Number of cycles while SQ reports it to be busy                       |
+| `sq_busy_cu_cycles`        | Qcycles | Number of quad-cycles each CU is busy                                 |
+| `sq_valu_mfma_busy_cycles` | Cycles  | Number of cycles the MFMA ALU is busy                                 |
+| `sq_wave_cycles`           | Qcycles | Number of quad-cycles spent by waves in the CUs                       |
+| `sq_wait_any`              | Qcycles | Number of quad-cycles spent waiting for anything                      |
+| `sq_wait_inst_any`         | Qcycles | Number of quad-cycles spent waiting for an issued instruction         |
+| `sq_active_inst_any`       | Qcycles | Number of quad-cycles spent by each wave to work on an instruction    |
+| `sq_active_inst_vmem`      | Qcycles | Number of quad-cycles spent by each wave to work on a non-FLAT VMEM instruction |
+| `sq_active_inst_lds`       | Qcycles | Number of quad-cycles spent by each wave to work on an LDS instruction |
+| `sq_active_inst_valu`      | Qcycles | Number of quad-cycles spent by each wave to work on a VALU instruction |
+| `sq_active_inst_sca`       | Qcycles | Number of quad-cycles spent by each wave to work on an SCA instruction |
+| `sq_active_inst_exp_gds`   | Qcycles | Number of quad-cycles spent by each wave to work on EXP or GDS instruction |
+| `sq_active_inst_misc`      | Qcycles | Number of quad-cycles spent by each wave to work on an MISC instruction, including branch and sendmsg |
+| `sq_active_inst_flat`      | Qcycles | Number of quad-cycles spent by each wave to work on a FLAT instruction |
+| `sq_inst_cycles_vmem_wr`   | Qcycles | Number of quad-cycles  spent to send addr and cmd data for VMEM Write instructions, including both FLAT and Buffer |
+| `sq_inst_cycles_vmem_rd`   | Qcycles | Number of quad-cycles  spent to send addr and cmd data for VMEM Read instructions, including both FLAT and Buffer |
+| `sq_inst_cycles_smem`      | Qcycles | Number of quad-cycles  spent to execute scalar memory reads           |
+| `sq_inst_cycles_salu`      | Cycles  | Number of cycles spent to execute non-memory read scalar operations   |
+| `sq_thread_cycles_valu`    | Cycles  | Number of thread-cycles spent to execute VALU operations              |
+
+#### Local Data Share
+
+| Hardware Counter           | Unit   | Definition                                                |
+| :--------------------------| :------| --------------------------------------------------------: |
+| `sq_lds_atomic_return`       | Cycles | Number of atomic return cycles in LDS                     |
+| `sq_lds_bank_conflict`       | Cycles | Number of cycles LDS is stalled by bank conflicts         |
+| `sq_lds_addr_conflict[∗]`    | Cycles | Number of cycles LDS is stalled by address conflicts      |
+| `sq_lds_unaligned_stalls[∗]` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
+| `sq_lds_mem_violations[∗]`   | Count  | Number of threads that have a memory violation in the LDS |
+
+#### Miscellaneous
+
+##### Local Data Share
+
+| Hardware Counter | Unit    | Definition                                                |
+| :----------------| :-------| --------------------------------------------------------: |
+| `sq_ifetch`        | Count   | Number of fetch requests from L1I cache, in 32-byte width |
+| `sq_items`         | Threads | Number of valid threads                                   |
+
+### L1I and sL1D Caches
+
+#### L1I and sL1D Caches
+
+| Hardware Counter             | Unit   | Definition                                                        |
+| :----------------------------| :------| ----------------------------------------------------------------: |
+| `sqc_icache_req`               | Req    | Number of L1I cache requests                                      |
+| `sqc_icache_hits`              | Count  | Number of L1I cache lookup-hits                                   |
+| `sqc_icache_misses`            | Count  | Number of L1I cache non-duplicate lookup-misses                   |
+| `sqc_icache_misses_duplicate`  | Count  | Number of d L1I cache duplicate lookup misses  whose previous lookup miss on the same cache line is not fulfilled yet |
+| `sqc_dcache_req`               | Req    | Number of sL1D cache requests                                       |
+| `sqc_dcache_input_valid_readb` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
+| `sqc_dcache_hits`              | Count  | Number of sL1D cache lookup-hits                                  |
+| `sqc_dcache_misses`            | Count  | Number of sL1D non-duplicate lookup-misses                        |
+| `sqc_dcache_misses_duplicate`  | Count  | Number of sL1D duplicate lookup-misses                            |
+| `sqc_dcache_req_read_1`        | Req    | Number of Read requests in a single 32-bit Data Word, DWORD (DW)  |
+| `sqc_dcache_req_read_2`        | Req    | Number of Read requests in 2 DW                                   |
+| `sqc_dcache_req_read_4`        | Req    | Number of Read requests in 4 DW                                   |
+| `sqc_dcache_req_read_8`        | Req    | Number of Read requests in 8 DW                                   |
+| `sqc_dcache_req_read_16`       | Req    | Number of Read requests in 16 DW                                  |
+| `sqc_dcache_atomic[∗]`         | Req    | Number of Atomic requests                                         |
+| `sqc_tc_req`                   | Req    | Number of L2 cache requests that were issued by instruction and constant caches |
+| `sqc_tc_inst_req`              | Req    | Number of instruction cache line requests to L2 cache             |
+| `sqc_tc_data_read_req`         | Req    | Number of data Read requests to the L2 cache                      |
+| `sqc_tc_data_write_req[∗]`     | Req    | Number of data Write requests to the L2 cache                     |
+| `sqc_tc_data_atomic_req[∗]`    | Req    | Number of data Atomic requests to the L2 cache                    |
+| `sqc_tc_stall[∗]`              | Cycles | Number of cycles while the valid requests to L2 Cache are stalled |
+
+### Vector L1 Cache Subsystem
+
+The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1D cache, and texture cache arbiter.
+
+#### Texture Addressing Unit
+
+##### Texture Addressing Unit Counters
+
+| Hardware Counter                 | Unit   | Definition                                        |
+| :--------------------------------| :------| ------------------------------------------------: |
+| `ta_ta_busy`                       | Cycles | TA busy cycles                                    |
+| `ta_total_wavefronts`              | Instr  | Number of wavefront instructions                  |
+| `ta_buffer_wavefronts`             | Instr  | Number of Buffer wavefront instructions           |
+| `ta_buffer_read_wavefronts`        | Instr  | Number of Buffer Read wavefront instructions      |
+| `ta_buffer_write_wavefronts`       | Instr  | Number of Buffer Write wavefront instructions     |
+| `ta_buffer_atomic_wavefronts[∗]`   | Instr  | Number of Buffer Atomic wavefront instructions    |
+| `ta_buffer_total_cycles`           | Cycles | Number of Buffer cycles, including Read and Write |
+| `ta_buffer_coalesced_read_cycles`  | Cycles | Number of coalesced Buffer read cycles            |
+| `ta_buffer_coalesced_write_cycles` | Cycles | Number of coalesced Buffer write cycles           |
+| `ta_addr_stalled_by_tc`            | Cycles | Number of cycles TA address is stalled by TCP     |
+| `ta_data_stalled_by_tc`            | Cycles | Number of cycles TA data is stalled by TCP        |
+| `ta_addr_stalled_by_td_cycles[∗]`  | Cycles | Number of cycles TA address is stalled by TD      |
+| `ta_flat_wavefronts`               | Instr  | Number of Flat wavefront instructions             |
+| `ta_flat_read_wavefronts`          | Instr  | Number of Flat Read wavefront instructions        |
+| `ta_flat_write_wavefronts`         | Instr  | Number of Flat Write wavefront instructions       |
+| `ta_flat_atomic_wavefronts`        | Instr  | Number of Flat Atomic wavefront instructions      |
+
+#### Texture Data Unit
+
+##### Texture Data Unit Counters
+
+| Hardware Counter         | Unit  | Definition                                           |
+| :------------------------| :-----| ---------------------------------------------------: |
+| `td_td_busy`               | Cycle | TD busy cycles                                       |
+| `td_tc_stall`              | Cycle | Number of cycles TD is stalled by TCP                |
+| `td_spi_stall[∗]`          | Cycle | Number of cycles TD is stalled by SPI                |
+| `td_load_wavefront`        | Instr | Number of wavefront instructions (Read/Write/Atomic) |
+| `td_store_wavefront`       | Instr | Number of Write wavefront instructions               |
+| `td_atomic_wavefront`      | Instr | Number of Atomic wavefront instructions              |
+| `td_coalescable_wavefront` | Instr | Number of coalescable instructions                   |
+
+#### Vector L1D Cache
+
+| Hardware Counter                    | Unit   | Definition                                                  |
+| :-----------------------------------| :------| ----------------------------------------------------------: |
+| `tcp_gate_en1`                        | Cycles | Number of cycles/ vL1D interface clocks are turned on    |
+| `tcp_gate_en2`                        | Cycles | Number of cycles vL1D core clocks are turned on           |
+| `tcp_td_tcp_stall_cycles`             | Cycles | Number of cycles TD stalls vL1D                           |
+| `tcp_tcr_tcp_stall_cycles`            | Cycles | Number of cycles TCR stalls vL1D                           |
+| `tcp_read_tagconflict_stall_cycles`   | Cycles | Number of cycles tagram conflict stalls on a Read          |
+| `tcp_write_tagconflict_stall_cycles`  | Cycles | Number of cycles tagram conflict stalls on a Write         |
+| `tcp_atomic_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on an Atomic       |
+| `tcp_pending_stall_cycles`            | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache |
+| `tcp_ta_tcp_state_read`               | Req    | Number of wavefront instruction requests to vL1D           |
+| `tcp_volatile[∗]`                     | Req    | Number of L1 volatile pixels/buffers from TA               |
+| `tcp_total_accesses`                  | Req    | Number of vL1D accesses                                    |
+| `tcp_total_read`                      | Req    | Number of vL1D Read accesses                               |
+| `tcp_total_write`                     | Req    | Number of vL1D Write accesses                              |
+| `tcp_total_atomic_with_ret`           | Req    | Number of vL1D Atomic with return                          |
+| `tcp_total_atomic_without_ret`        | Req    | Number of vL1D Atomic without return                       |
+| `tcp_total_writeback_invalidates`     | Count  | Number of vL1D Writebacks and Invalidates                  |
+| `tcp_utcl1_request`                   | Req    | Number of address translation requests to UTCL1            |
+| `tcp_utcl1_translation_hit`           | Req    | Number of UTCL1 translation hits                            |
+| `tcp_utcl1_translation_miss`          | Req    | Number of UTCL1 translation misses                          |
+| `tcp_utcl1_persmission_miss`          | Req    | Number of UTCL1 permission misses                           |
+| `tcp_total_cache_accesses`            | Req    | Number of vL1D cache accesses                               |
+| `tcp_tcp_latency`                     | Cycles | Accumulated wave access latency to vL1D over all wavefronts |
+| `tcp_tcc_read_req_latency`            | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for Reads and Atomics with return |
+| `tcp_tcc_write_req_latency`           | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for Writes and Atomics without return |
+| `tcp_tcc_read_req`                    | Req    | Number of Read requests to L2 Cache                        |
+| `tcp_tcc_write_req`                   | Req    | Number of Write requests to L2 Cache                       |
+| `tcp_tcc_atomic_with_ret_req`         | Req    | Number of Atomic requests to L2 Cache with return          |
+| `tcp_tcc_atomic_without_ret_req`      | Req    | Number of Atomic requests to L2 Cache without return       |
+| `tcp_tcc_nc_read_req`                 | Req    | Number of NC Read requests to L2 Cache                     |
+| `tcp_tcc_uc_read_req`                 | Req    | Number of UC Read requests to L2 Cache                     |
+| `tcp_tcc_cc_read_req`                 | Req    | Number of CC Read requests to L2 Cache                     |
+| `tcp_tcc_rw_read_req`                 | Req    | Number of RW Read requests to L2 Cache                     |
+| `tcp_tcc_nc_write_req`                | Req    | Number of NC Write requests to L2 Cache                    |
+| `tcp_tcc_uc_write_req`                | Req    | Number of UC Write requests to L2 Cache                    |
+| `tcp_tcc_cc_write_req`                | Req    | Number of CC Write requests to L2 Cache                    |
+| `tcp_tcc_rw_write_req`                | Req    | Number of RW Write requests to L2 Cache                    |
+| `tcp_tcc_nc_atomic_req`               | Req    | Number of NC Atomic requests to L2 Cache                   |
+| `tcp_tcc_uc_atomic_req`               | Req    | Number of UC Atomic requests to L2 Cache                   |
+| `tcp_tcc_cc_atomic_req`               | Req    | Number of CC Atomic requests to L2 Cache                   |
+| `tcp_tcc_rw_atomic_req`               | Req    | Number of RW Atomic requests to L2 Cache                   |
+
+#### Texture Cache Arbiter (TCA)
+
+| Hardware Counter | Unit   | Definition                                  |
+| :----------------| :------| ------------------------------------------: |
+| `tca_cycle`        | Cycles | TCA cycles                                  |
+| `tca_busy`         | Cycles | Number of cycles  TCA has a pending request |
+
+### L2 Cache Access
+
+#### L2 Cache Access Counters
+
+| Hardware Counter                 | Unit   | Definition                                                     |
+| :--------------------------------| :------| -------------------------------------------------------------: |
+| `tcc_cycle`                        |Cycle   | L2 Cache free-running clocks                                  |
+| `tcc_busy`                         |Cycle   | L2 Cache busy cycles                                          |
+| `tcc_req`                          |Req     | Number of L2 Cache requests                                   |
+| `tcc_streaming_req[∗]`             |Req     | Number of L2 Cache Streaming requests                         |
+| `tcc_NC_req`                       |Req     | Number of NC requests                                         |
+| `tcc_UC_req`                       |Req     | Number of UC requests                                         |
+| `tcc_CC_req`                       |Req     | Number of CC requests                                         |
+| `tcc_RW_req`                       |Req     | Number of RW requests                                         |
+| `tcc_probe`                        |Req     | Number of L2 Cache probe requests                             |
+| `tcc_probe_all[∗]`                 |Req     | Number of external probe requests with EA_TCC_preq_all== 1    |
+| `tcc_read_req`                     |Req     | Number of L2 Cache Read requests                              |
+| `tcc_write_req`                    |Req     | Number of L2 Cache Write requests                             |
+| `tcc_atomic_req`                   |Req     | Number of L2 Cache Atomic requests                            |
+| `tcc_hit`                          |Req     | Number of L2 Cache lookup-hits                                |
+| `tcc_miss`                         |Req     | Number of L2 cache lookup-misses                              |
+| `tcc_writeback`                    |Req     | Number of lines written back to main memory, including writebacks of dirty lines and uncached Write/Atomic requests |
+| `tcc_ea_wrreq`                     |Req     | Total number of 32-byte and 64-byte Write requests to EA      |
+| `tcc_ea_wrreq_64B`                 |Req     | Total number of 64-byte Write requests to EA                  |
+| `tcc_ea_wr_uncached_32B`           |Req     | Number of 32-byte Write/Atomic going over the TC_EA_wrreq interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
+| `tcc_ea_wrreq_stall`               | Cycles | Number of cycles a Write request was stalled                  |
+| `tcc_ea_wrreq_io_credit_stall[∗]`  | Cycles | Number of cycles an EA Write request runs out of IO credits   |
+| `tcc_ea_wrreq_gmi_credit_stall[∗]` | Cycles | Number of cycles an EA Write request runs out of GMI credits  |
+| `tcc_ea_wrreq_dram_credit_stall`   | Cycles | Number of cycles an EA Write request runs out of DRAM credits |
+| `tcc_too_many_ea_wrreqs_stall[∗]`  | Cycles | Number of cycles the L2 Cache reaches maximum number of pending EA Write requests |
+| `tcc_ea_wrreq_level`               | Req    | Accumulated number of L2 Cache-EA Write requests in flight    |
+| `tcc_ea_atomic`                    | Req    | Number of 32-byte and 64-byte Atomic requests to EA           |
+| `tcc_ea_atomic_level`              | Req    | Accumulated number of L2 Cache-EA Atomic requests in flight   |
+| `tcc_ea_rdreq`                     | Req    | Total number of 32-byte and 64-byte Read requests to EA       |
+| `tcc_ea_rdreq_32B`                 | Req    | Total number of 32-byte Read requests to EA                   |
+| `tcc_ea_rd_uncached_32B`           | Req    | Number of 32-byte L2 Cache-EA Read due to uncached traffic. A 64-byte request is counted as 2. |
+| `tcc_ea_rdreq_io_credit_stall[∗]`  | Cycles | Number of cycles Read request interface runs out of IO credits  |
+| `tcc_ea_rdreq_gmi_credit_stall[∗]` | Cycles | Number of cycles Read request interface runs out of GMI credits |
+| `tcc_ea_rdreq_dram_credit_stall`   | Cycles | Number of cycles Read request interface runs out of DRAM credits |
+| `tcc_ea_rdreq_level`               | Req    | Accumulated number of L2 Cache-EA Read requests in flight     |
+| `tcc_ea_rdreq_dram`                | Req    | Number of 32-byte and 64-byte Read requests to HBM            |
+| `tcc_ea_wrreq_dram`                | Req    | Number of 32-byte and 64-byte Write requests to HBM           |
+| `tcc_tag_stall`                    | Cycles | Number of cycles the normal request pipeline in the tag was stalled for any reason |
+| `tcc_normal_writeback`             | Req    | Number of L2 cache normal writeback                           |
+| `tcc_all_tc_op_wb_writeback[∗]`    | Req    | Number of instruction-triggered writeback requests            |
+| `tcc_normal_evict`                 | Req    | Number of L2 cache normal evictions                           |
+| `tcc_all_tc_op_inv_evict[∗]`       | Req    | Number of instruction-triggered eviction requests             |
+
+## MI200 Derived Metrics List
+
+### Derived Metrics on MI200 GPUs
+
+| Derived Metric   | Description                                                                            |
+| :----------------| -------------------------------------------------------------------------------------: |
+| `VFetchInsts`      | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory               |
+| `VWriteInsts`      | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory                 |
+| `FlatVMemInsts`    | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch |
+| `LDSInsts`         | The average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS |
+| `FlatLDSInsts`     | The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control) |
+| `VALUUtilization`  | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence) |
+| `VALUBusy`         | The percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
+| `SALUBusy`         | The percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
+| `MemWrites32B`     | The total number of effective 32B write transactions to the memory                      |
+| `L2CacheHit`       | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal) |
+| `MemUnitStalled`   | The percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad) |
+| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad)      |
+| `LDSBankConflict`  | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
+
+## Abbreviations
+
+### MI200 Abbreviations
+
+| Abbreviation | Meaning                                                                           |
+| :------------| --------------------------------------------------------------------------------: |
+| `ALU`          | Arithmetic Logic Unit                                                             |
+| `Arb`          | Arbiter                                                                           |
+| `BF16`         | Brain Floating Point – 16 bits                                                    |
+| `CC`           | Coherently Cached                                                                 |
+| `CP`           | Command Processor                                                                 |
+| `CPC`          | Command Processor – Compute                                                       |
+| `CPF`          | Command Processor – Fetcher                                                       |
+| `CS`           | Compute Shader                                                                    |
+| `CSC`          | Compute Shader Controller                                                         |
+| `CSn`          | Compute Shader, the n-th pipe                                                     |
+| `CU`           | Compute Unit                                                                      |
+| `DW`           | 32-bit Data Word, DWORD                                                           |
+| `EA`           | Efficiency Arbiter                                                                |
+| `F16`          | Half Precision Floating Point                                                     |
+| `FLAT`         | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>•   Global Memory<br>•   Scratch (“private”)<br>•   LDS (“shared”)<br>•   Invalid – MEM_VIOL TrapStatus |
+| `FMA`          | Fused Multiply Add                                                                |
+| `GDS`          | Global Data Share                                                                 |
+| `GRBM`         | Graphics Register Bus Manager                                                     |
+| `HBM`          | High Bandwidth Memory                                                             |
+| `Instr`        | Instructions                                                                      |
+| `IOP`          | Integer Operation                                                                 |
+| `L2`           | Level-2 Cache                                                                     |
+| `LDS`          | Local Data Share                                                                  |
+| `ME1`          | Micro Engine, running packet processing firmware on CPC                           |
+| `MFMA`         | Matrix Fused Multiply Add                                                         |
+| `NC`           | Noncoherently Cached                                                              |
+| `RW`           | Coherently Cached with Write                                                      |
+| `SALU`         | Scalar ALU                                                                        |
+| `SGPR`         | Scalar GPR                                                                        |
+| `SIMD`         | Single Instruction Multiple Data                                                  |
+| `sL1D`         | Scalar Level-1 Data Cache                                                         |
+| `SMEM`         | Scalar Memory                                                                     |
+| `SPI`          | Shader Processor Input                                                            |
+| `SQ`           | Sequencer                                                                         |
+| `TA`           | Texture Addressing Unit                                                           |
+| `TC`           | Texture Cache                                                                     |
+| `TCA`          | Texture Cache Arbiter                                                             |
+| `TCC`          | Texture Cache per Channel, known as L2 Cache                                      |
+| `TCIU`         | Texture Cache Interface Unit, Command Processor (CP)’s interface to memory system |
+| `TCP`          | Texture Cache per Pipe, known as vector L1 Cache                                  |
+| `TCR`          | Texture Cache Router                                                              |
+| `TD`           | Texture Data Unit                                                                 |
+| `UC`           | Uncached                                                                          |
+| `UTCL1`        | Unified Translation Cache – Level 1                                               |
+| `UTCL2`        | Unified Translation Cache – Level 2                                               |
+| `VALU`         | Vector ALU                                                                        |
+| `VGPR`         | Vector GPR                                                                        |
+| `vL1D`         | Vector Level -1 Data Cache                                                        |
+| `VMEM`         | Vector Memory                                                                     |
--- a/docs/understand/using_gpu_sanitizer.md
+++ b/docs/understand/using_gpu_sanitizer.md
@@ -0,0 +1,225 @@
+### Using the LLVM Address Sanitizer (ASAN) on the GPU
+
+The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
+
+Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
+
+This document provides documentation on using ROCm Address Sanitizer.
+For information about LLVM Address Sanitizer, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
+
+### Compile for Address Sanitizer
+
+The address sanitizer process begins by compiling the application of interest with the address sanitizer instrumentation.
+
+Recommendations for doing this are:
+
+ Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`.
+ Add the following options to the existing compiler and linker options:
+  + `-fsanitize=address` - enables instrumentation
+  + `-shared-libsan` - use shared version of runtime
+  + `-g` - add debug info for improved reporting
+ Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+`
+Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted.
+
+It is not an error to compile some files without address sanitizer instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the Address Sanitizer runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section.
+
+#### About Compilation Time
+
+When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all of the downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.
+
+There are a few options if the compile time becomes unacceptable:
+
+ Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the address sanitizer process.
+ Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information.
+ Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process.
+
+### Use AMD Supplied Address Sanitizer Instrumented Libraries
+
+ROCm releases provide optional packages containing address sanitizer instrumented builds of a subset of those ROCm libraries usually found in `/opt/rocm-<version>/lib`. These optional packages are typically named <library>-asan. However, the instrumented libraries themselves have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`. It is expected that the subset of address sanitizer instrumented ROCm libraries will be expanded in future releases. They are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
+
+These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
+
+When adjusting an application build to add instrumentation, linking against these instrumented libraries is unnecessary. For example, any `-L` `/opt/rocm-<version>/lib` compiler options need not be changed. However, the instrumented libraries should be used when the application is run. It is particularly important that the instrumented language runtimes, like `libamdhip64.so` and `librocm-core.so`, are used; otherwise, device invalid access detections may not be reported.
+
+### Running Address Sanitizer Instrumented Applications
+
+#### Preparing to Run an Instrumented Application
+
+Here are a few recommendations to consider before running an address sanitizer instrumented heterogeneous application.
+
+ Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient.
+ Ensure XNACK is enabled
+  + For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`.
+  + For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`.  
+This requirement is due to the fact that the XNACK setting for these GPUs is system-wide.
+
+ Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd <application name>` can be used to see which libraries will be used.
+If the instrumented libraries are not listed by `ldd`, the environment variable `LD_LIBRARY_PATH` may need to be adjusted, or in some cases an `RPATH` compiled into the application may need to be changed and the application recompiled.
+
+ Ensure that the application depends on the address sanitizer runtime. This can be checked by running the command `readelf -d <application name> | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output.
+If it does not appear, when executed the application will quickly output an address sanitizer error that looks like:
+
+```bash
+==3210==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
+```
+
+ Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-<version>/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).
+
+There is an environment variable, `ASAN_OPTIONS` which can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.
+
+There are two `ASAN_OPTION` flags of particular note.
+
+ `halt_on_error=0/1 default 1`.  
+
+This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional address sanitizer instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.
+
+ `detect_leaks=0/1 default 1`.
+This option directs the address sanitizer runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.
+
+### Runtime Overhead
+
+Running an address sanitizer instrumented application incurs
+overheads which may result in unacceptably long runtimes
+or failure to run at all.
+
+#### Higher Execution Time
+
+Address sanitizer detection works by checking each address at runtime
+before the address is actually accessed by a load, store, or atomic
+instruction.
+This checking involves an additional load to "shadow" memory which
+records whether the address is "poisoned" or not, and additional logic
+that decides whether to produce an detection report or not.
+
+This extra runtime work can cause the application to slow down by
+a factor of three or more, depending on how many memory accesses are
+executed.
+For heterogeneous applications, the shadow memory must be accessible by all devices
+and this can mean that shadow accesses from some devices may be more costly
+than non-shadow accesses.
+
+#### Higher Memory Use
+
+The address checking described above relies on the compiler to surround
+each program variable with a red zone and on address sanitizer
+runtime to surround each runtime memory allocation with a red zone and
+fill the shadow corresponding to each red zone with poison.
+The added memory for the red zones is additional overhead on top
+of the 13% overhead for the shadow memory itself.
+
+Applications which consume most one or more available memory pools when
+run normally are likely to encounter allocation failures when run with
+instrumentation.
+
+### Runtime Reporting
+
+It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the address sanitizer runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.
+
+An invalid address detection report for the CPU always starts with
+
+```bash
+==<PID>==ERROR: AddressSanitizer: <problem type> on address <memory address> at pc <pc> bp <bp> sp <sp> <access> of size <N> at <memory address> thread T0
+```
+
+and continues with a stack trace for the access, a stack trace for the allocation and deallocation, if relevant, and a dump of the shadow near the <memory address>.
+
+In contrast, an invalid address detection report for the GPU always starts with
+
+```bash
+==<PID>==ERROR: AddressSanitizer: <problem type> on amdgpu device <device> at pc <pc> <access> of size <n> in workgroup id (<X>,<Y>,<Z>)
+```
+
+Above, `<device>` is the integer device ID, and `(<X>, <Y>, <Z>)` is the ID of the workgroup or block where the invalid address was detected.
+
+While the CPU report include a call stack for the thread attempting the invalid access, the GPU is currently to a call stack of size one, i.e. the (symbolized) of the invalid access, e.g.
+
+```bash
+#0 <pc> in <fuction signature> at /path/to/file.hip:<line>:<column>
+```
+
+This short call stack is followed by a GPU unique section that looks like
+
+```bash
+Thread ids and accessed addresses:
+<lid0> <maddr 0> : <lid1> <maddr1> : ...
+```
+
+where each `<lid j> <maddr j>` indicates the lane ID and the invalid memory address held by lane `j` of the wavefront attempting the invalid access.
+
+Additionally, reports for invalid GPU accesses to memory allocated by GPU code via `malloc` or new starting with, for example,
+
+```bash
+==1234==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fa9f5c92dcc
+```
+
+or
+
+```bash
+==5678==ERROR: AddressSanitizer: heap-use-after-free on amdgpu device 3 at pc 0x7f4c10062d74
+```
+
+currently may include one or two surprising CPU side tracebacks mentioning :`hostcall`". This is due to how `malloc` and `free` are implemented for GPU code and these call stacks can be ignored.
+
+### Running with `rocgdb`
+
+`rocgdb` can be used to further investigate address sanitizer detected errors, with some preparation.
+
+Currently, the address sanitizer runtime complains when starting `rocgdb` without preparation.
+
+```bash
+$ rocgdb my_app
+==1122==ASan` runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
+```
+
+This is solved by setting environment variable `LD_PRELOAD` to the path to the address sanitizer runtime, whose path can be obtained using the command
+
+```bash
+amdclang++ -print-file-name=libclang_rt.asan-x86_64.so
+```
+
+It is also recommended to set the environment variable `HIP_ENABLE_DEFERRED_LOADING=0` before debugging HIP applications.
+
+After starting `rocgdb` breakpoints can be set on the address sanitizer runtime error reporting entry points of interest. For example, if an address sanitizer error report includes
+
+```bash
+WRITE of size 4 in workgroup id (10,0,0)
+```
+
+the `rocgdb` command needed to stop the program before the report is printed is
+
+```bash
+(gdb) break __asan_report_store4
+```
+
+Similarly, the appropriate command for a report including
+
+```bash
+READ of size <N> in workgroup ID (1,2,3)
+```
+
+is
+
+```bash
+(gdb) break __asan_report_load<N>
+```
+
+It is possible to set breakpoints on all address sanitizer report functions using these commands:
+
+```bash
+$ rocgdb <path to application>
+(gdb) start <commmand line arguments>
+(gdb) rbreak ^__asan_report
+(gdb) c
+```
+
+### Using Address Sanitizer with a Short HIP Application (LINK NEEDED HERE)
+
+### Known Issues with Using GPU Sanitizer
+
+ Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.
+
+ Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.
+
+ Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented.
+
+ It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.
--- a/docs/understand/windows-app-deployment-guidelines.md
+++ b/docs/understand/windows-app-deployment-guidelines.md
@@ -0,0 +1,71 @@
+# Application Deployment Guidelines for Windows
+
+ISVs deploying applications using the HIP SDK depend on the AMD GPU Drivers, HIP
+Runtime Library and HIP SDK Libraries. A compatibility matrix table provides
+details on AMD’s support model. AMD GPU Drivers are distributed with a HIP
+Runtime included. Each HIP Runtime is associated with a HIP compiler version.
+Applications built with a particular HIP compiler should document its associated
+HIP Runtime version and AMD GPU Driver as minimum version requirements for its
+end users. Applications do not distribute the HIP Runtime. Instead, end users
+will use the HIP Runtime provided by an AMD GPU Driver. AMD provides backward
+compatibility for applications dynamically linked to the HIP Runtime based on
+our Driver and HIP support policy. ISV applications using the HIP SDK Libraries,
+for example hipBLAS, should distribute the HIP SDK Library as part of its
+installer package. It is recommended not to require end users to install the
+HIP SDK. AMD provides backward compatibility for AMD Driver and HIP Runtime for
+the HIP SDK Libraries based on our support policy. AMD support policy for Visual
+Studio and other third-party compilers are documented here.
+
+## Usage Scenario
+
+This guide is intended for Independent Software Vendors (ISVs) and other
+software developers intending to build applications with the HIP SDK for
+Windows. The HIP SDK is intended for developer distribution in contrast to the
+AMD GPU driver which is intended for all end users. The guide discusses how to
+use and distribute components from the HIP SDK. The HIP SDK is the collection of
+the AMD GPU Driver, HIP Runtime and the HIP Libraries. These three parts are
+distributed in the HIP SDK installer. The compatibility and versioning relation
+between these three parts is documented here. AMD’s support policies for the
+developer tools allows the ISVs the stability to plan the usage of a tool chain.
+
+## Recommended Library Distribution Model
+
+The HIP SDK is distributed via a Windows installer. This distribution system is
+only intended for software developers and testers. AMD recommends that end users
+of the program built against HIP SDK components do not have a requirement to
+install the HIP SDK. There are two types of ISV applications that use the HIP
+SDK as follows.
+
+The first group of ISV applications have a dependency on the HIP Runtime and
+select HIP Header Only Libraries (rocPRIM, hipCUB and rocThrust). This group of
+ISV applications need to require their end users install an AMD GPU Driver. Each
+AMD GPU driver has a HIP runtime library bundled with it. The ISV application
+should ensure that the HIP runtime library has a minimum version associated with
+it. As the HIP runtime library does not have semantic versioning, the ISV
+application cannot check for compatibility. However, AMD is committed to not
+breaking API/ABI compatibility unless the major version number of the HIP
+runtime is incremented. ISV applications may run without user warning if the HIP
+major version available in the driver is the same as the HIP major version
+associated with the compiler it was built with. The ISV at its discretion may
+throw a warning if the HIP major version is higher than the associate HIP major
+version of the compiler it was built with.
+
+The second group of ISV application has a dependency on the HIP Runtime and one
+or more Dynamically Linked HIP Libraries including the HIP RT library. ISV
+applications with this dependency need to ensure the end user installs an AMD
+GPU Driver and is recommended to distribute the dynamically linked HIP library
+in the installer package of its application. This allows end users to avoid
+installing the HIP SDK. One benefit of this model is smaller disk space required
+as only required binaries are distributed by the ISV application. It also avoids
+the end user to have to agree to licensing agreements for the entire HIP SDK.
+The version checks recommended for the ISV application including dynamically
+linked HIP Libraries follow the same requirements as the ISV applications that
+only have the HIP Runtime and header only library. In addition, each dynamically
+linked HIP library also has a minimum HIP runtime requirement. Checks for the
+minimum HIP version for each dynamically linked HIP library may be added at the
+ISVs discretion. Usually, the minimum HIP version check for the HIP runtime is
+sufficient if dynamically linked HIP libraries come from the same SDK package as
+the HIP compiler.
+
+Please note AMD does not support static linking to any components distributed in
+the HIP SDK.