Link and formatting fixes (#2482)

2026-01-09 14:48:06 -05:00 · 2023-09-20 09:55:21 -06:00
parent acde6284a0
commit 940d2933ff
27 changed files with 1057 additions and 528 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -13,7 +13,6 @@ _doxygen/
 _readthedocs/
 # avoid duplicating contributing.md due to conf.py
-CHANGELOG.md
+docs/CHANGELOG.md
-docs/contributing.md
+docs/contribute/index.md
 docs/release.md
 docs/about/release-notes.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -21,76 +21,78 @@ The release notes for the ROCm platform.
 ### Release Highlights for ROCm 5.7
-ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
+ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
 Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
-#### AMD Instinct™ MI50 End of Support Notice
+#### AMD Instinct™ MI50 end-of-support notice
 AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
 As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support)
+* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
+* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
+* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release.
+* Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906.
+* Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
+* Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
-#### Feature Updates
+#### Feature updates
-##### Non-hostcall HIP Printf
+##### Non-hostcall HIP printf
 **Current behavior**
-The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above), 
+The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
 ```
    Pcie atomics not enabled, hostcall not supported
 ```
 **Workaround**
-The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics. 
+The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
-Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.	
+
 Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
 The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
+* “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution. 
+* “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
-**NOTE**: With the new workaround,
+**NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular.  After the buffer is filled, calls to printf will not result in additional output.
+* The printf buffer is fixed size and non-circular.  After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
+* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
-##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU
+##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
-The ROCm v5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
+The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
 Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
-Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](understand/using_gpu_sanitizer.md).
+Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
-**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
+Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
-#### Fixed Defects
+**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
-The following defects are fixed in ROCm v5.7,
+#### Fixed defects
- Test hangs observed in HMM RCCL
+The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker
+* Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+
+* NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
+* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
 * Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
 #### HIP 5.7.0
@@ -98,79 +100,78 @@ The following defects are fixed in ROCm v5.7,
 ##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
+* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux
+* Added new APIs supporting Windows only, under development on Linux
-    - `hipMallocMipmappedArray` for allocating a mipmapped array on the device
+    * `hipMallocMipmappedArray` for allocating a mipmapped array on the device
-    - `hipFreeMipmappedArray` for freeing a mipmapped array on the device
+    * `hipFreeMipmappedArray` for freeing a mipmapped array on the device
-    - `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
+    * `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
-    - `hipMipmappedArrayCreate` for creating a mipmapped array
+    * `hipMipmappedArrayCreate` for creating a mipmapped array
-    - `hipMipmappedArrayDestroy` for destroy a mipmapped array
+    * `hipMipmappedArrayDestroy` for destroy a mipmapped array
-    - `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
+    * `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
 ##### Changed
 ##### Fixed
-##### Known Issues
+##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
+* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
+* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
 ##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure
+* Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
+* Addition of new fields in hipDeviceProp_t structure
-    - maxTexture1D
+    * maxTexture1D
-    - maxTexture2D
+    * maxTexture2D
-    - maxTexture1DLayered
+    * maxTexture1DLayered
-    - maxTexture2DLayered
+    * maxTexture2DLayered
    - sharedMemPerMultiprocessor
    - deviceOverlap
    - asyncEngineCount
    - surfaceAlignment
    - unifiedAddressing
    - computePreemptionSupported
    - hostRegisterSupported
    - uuid
 - Removal of deprecated code -hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
+    * sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
+    * deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
+    * asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code
+    * surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
+    * unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
+    * computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t
+    * hostRegisterSupported
    * uuid
-### Library Changes in ROCM 5.7.0
+* Removal of deprecated code -hip-hcc codes from hip code tree
 * Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
 * HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
 * Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
 * Correct hipGetLastError to return the last error instead of last API call's return code
 * Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
 * Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
 * Remove hiparray* and make it opaque with hipArray_t
 ### Library changes in ROCM 5.7.0
 | Library | Version |
 |---------|---------|
@@ -211,11 +212,11 @@ hipCUB 2.13.1 for ROCm 5.7.0
 ##### Changed
- CUB backend references CUB and Thrust version 2.0.1.
+* CUB backend references CUB and Thrust version 2.0.1.
- Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
+* Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
- Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
+* Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
-##### Known Issues
+##### Known issues
 - `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
 - `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
@@ -228,11 +229,11 @@ hipFFT 1.0.12 for ROCm 5.7.0
 ##### Added
- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
+* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
 ##### Changed
- Added --precision argument to benchmark/test clients.  --double is still accepted but is deprecated as a method to request a double-precision transform.
+* Added --precision argument to benchmark/test clients.  --double is still accepted but is deprecated as a method to request a double-precision transform.
 #### hipSOLVER 1.8.1
@@ -240,7 +241,7 @@ hipSOLVER 1.8.1 for ROCm 5.7.0
 ##### Changed
- Changed hipsolver-test sparse input data search paths to be relative to the test executable
+* Changed hipsolver-test sparse input data search paths to be relative to the test executable
 #### hipSPARSE 2.3.8
@@ -248,10 +249,10 @@ hipSPARSE 2.3.8 for ROCm 5.7.0
 ##### Improved
- Fix compilation failures when using cusparse 12.1.0 backend
+* Fix compilation failures when using cusparse 12.1.0 backend
- Fix compilation failures when using cusparse 12.0.0 backend
+* Fix compilation failures when using cusparse 12.0.0 backend
- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
+* Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
- Minor improvements
+* Minor improvements
 #### MIOpen 2.19.0
@@ -259,17 +260,17 @@ MIOpen 2.19.0 for ROCm 5.7.0
 ##### Added
- ROCm 5.5 support for gfx1101 (Navi32)
+* ROCm 5.5 support for gfx1101 (Navi32)
 ##### Changed
- Tuning results for MLIR on ROCm 5.5
+* Tuning results for MLIR on ROCm 5.5
- Bumping MLIR commit to 5.5.0 release tag
+* Bumping MLIR commit to 5.5.0 release tag
 ##### Fixed
- Fix 3d convolution Host API bug
+* Fix 3d convolution Host API bug
- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
+* [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
 #### RCCL 2.17.1-1
@@ -277,19 +278,19 @@ RCCL 2.17.1-1 for ROCm 5.7.0
 ##### Changed
- Compatibility with NCCL 2.17.1-1
+* Compatibility with NCCL 2.17.1-1
- Performance tuning for some collective operations
+* Performance tuning for some collective operations
 ##### Added
- Minor improvements to MSCCL codepath
+* Minor improvements to MSCCL codepath
- NCCL_NCHANNELS_PER_PEER support
+* NCCL_NCHANNELS_PER_PEER support
- Improved compilation performance
+* Improved compilation performance
- Support for gfx94x
+* Support for gfx94x
 ##### Fixed
- Potential race-condition during ncclSocketClose()
+* Potential race-condition during ncclSocketClose()
 #### rocALUTION 2.1.11
@@ -297,11 +298,11 @@ rocALUTION 2.1.11 for ROCm 5.7.0
 ##### Added
- Added support for gfx940, gfx941 and gfx942
+* Added support for gfx940, gfx941 and gfx942
 ##### Improved
- Fixed OpenMP runtime issue with Windows toolchain
+* Fixed OpenMP runtime issue with Windows toolchain
 #### rocBLAS 3.1.0
@@ -309,33 +310,33 @@ rocBLAS 3.1.0 for ROCm 5.7.0
 ##### Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
+* yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
+* rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
 ##### Fixed
- make offset calculations for rocBLAS functions 64 bit safe.  Fixes for very large leading dimensions or increments potentially causing overflow:
+* make offset calculations for rocBLAS functions 64 bit safe.  Fixes for very large leading dimensions or increments potentially causing overflow:
-  - Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
+  * Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
-  - Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
+  * Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
-  - Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
+  * Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
-  - General: set_vector, get_vector, set_matrix, get_matrix
+  * General: set_vector, get_vector, set_matrix, get_matrix
-  - Related fixes: internal scalar loads with &gt; 32bit offsets
+  * Related fixes: internal scalar loads with &gt; 32bit offsets
-  - fix in-place functionality for all trtri sizes
+  * fix in-place functionality for all trtri sizes
 ##### Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
+* dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
+* enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions
+* standardized internal rocblas C++ interface across most functions
 ##### Deprecated
- Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
+* Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
 ##### Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients
+* Optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil
+* Optional build tool only dependency on python psutil
 #### rocFFT 1.0.24
@@ -343,16 +344,16 @@ rocFFT 1.0.24 for ROCm 5.7.0
 ##### Optimizations
- Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
+* Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
 ##### Added
- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
+* Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
 ##### Changed
- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
+* Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
- Moved library kernel cache to lib/rocFFT directory.
+* Moved library kernel cache to lib/rocFFT directory.
 #### rocm-cmake 0.10.0
@@ -360,8 +361,8 @@ rocm-cmake 0.10.0 for ROCm 5.7.0
 ##### Added
- Added ROCMTest module
+* Added ROCMTest module
- ROCMCreatePackage: Added support for ASAN packages
+* ROCMCreatePackage: Added support for ASAN packages
 #### rocPRIM 2.13.1
@@ -369,14 +370,14 @@ rocPRIM 2.13.1 for ROCm 5.7.0
 ##### Changed
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
+* Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
+* Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
+* The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
+* Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
 ##### Fixed
- Fixed build issue caused by missing header in `thread/thread_search.hpp`.
+* Fixed build issue caused by missing header in `thread/thread_search.hpp`.
 #### rocRAND 2.10.17
@@ -384,14 +385,14 @@ rocRAND 2.10.17 for ROCm 5.7.0
 ##### Added
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
+* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
+* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
- experimental HIP-CPU feature
+* Experimental HIP-CPU feature
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
+* ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
 ##### Changed
- Python 2.7 is no longer officially supported.
+* Python 2.7 is no longer officially supported.
 #### rocSOLVER 3.23.0
@@ -399,19 +400,19 @@ rocSOLVER 3.23.0 for ROCm 5.7.0
 ##### Added
- LU factorization without pivoting for block tridiagonal matrices:
+* LU factorization without pivoting for block tridiagonal matrices:
-    - GEBLTTRF_NPVT now supports interleaved\_batched format
+    * GEBLTTRF_NPVT now supports interleaved\_batched format
- Linear system solver without pivoting for block tridiagonal matrices:
+* Linear system solver without pivoting for block tridiagonal matrices:
-    - GEBLTTRS_NPVT now supports interleaved\_batched format
+    * GEBLTTRS_NPVT now supports interleaved\_batched format
 ##### Fixed
- Fixed stack overflow in sparse tests on Windows
+* Fixed stack overflow in sparse tests on Windows
 ##### Changed
- Changed rocsolver-test sparse input data search paths to be relative to the test executable
+* Changed rocsolver-test sparse input data search paths to be relative to the test executable
- Changed build scripts to default to compressed debug symbols in Debug builds
+* Changed build scripts to default to compressed debug symbols in Debug builds
 #### rocSPARSE 2.5.4
@@ -419,12 +420,12 @@ rocSPARSE 2.5.4 for ROCm 5.7.0
 ##### Added
- Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
+* Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
- Added support for gfx940, gfx941 and gfx942
+* Added support for gfx940, gfx941 and gfx942
 ##### Improved
- Fixed a bug in csrsm and bsrsm
+* Fixed a bug in csrsm and bsrsm
 ##### Known Issues
@@ -434,15 +435,15 @@ In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accur
 rocThrust 2.18.0 for ROCm 5.7.0
-##### Fixed 
+##### Fixed
- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
+* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
- Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
+* Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
 ##### Changed
- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
+* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
- Removed references to and workarounds for deprecated hcc
+* Removed references to and workarounds for deprecated hcc
 #### rocWMMA 1.2.0
@@ -450,8 +451,8 @@ rocWMMA 1.2.0 for ROCm 5.7.0
 ##### Changed
- Fixed a bug with synchronization
+* Fixed a bug with synchronization
- Updated rocWMMA cmake versioning
+* Updated rocWMMA cmake versioning
 #### Tensile 4.38.0
@@ -459,29 +460,29 @@ Tensile 4.38.0 for ROCm 5.7.0
 ##### Added
- Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
+* Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
- Added user-driven solution selection feature
+* Added user-driven solution selection feature
 ##### Optimizations
- Enabled LocalSplitU with MFMA for I8 data type
+* Enabled LocalSplitU with MFMA for I8 data type
- Optimized K mask code in mfmaIter
+* Optimized K mask code in mfmaIter
- Enabled TailLoop code in NoLoadLoop to prefetch global/local read
+* Enabled TailLoop code in NoLoadLoop to prefetch global/local read
- Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
+* Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
- Optimized DirectToLds test cases to reduce the test duration
+* Optimized DirectToLds test cases to reduce the test duration
 ##### Changed
- Removed DGEMM NT custom kernels and related test cases
+* Removed DGEMM NT custom kernels and related test cases
- Changed noTailLoop logic to apply noTailLoop only for NT
+* Changed noTailLoop logic to apply noTailLoop only for NT
- Changed the range of AssertFree0ElementMultiple and Free1
+* Changed the range of AssertFree0ElementMultiple and Free1
- Unified aStr, bStr generation code in mfmaIter
+* Unified aStr, bStr generation code in mfmaIter
 ##### Fixed
- Fixed LocalSplitU mismatch issue for SGEMM
+* Fixed LocalSplitU mismatch issue for SGEMM
- Fixed BufferStore=0 and Ldc != Ldd case
+* Fixed BufferStore=0 and Ldc != Ldd case
- Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
+* Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
 -------------------
@@ -2282,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
 - Enable new device RTL in libomptarget as default.
 - New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
+- Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
 - Initial implementation of optimized cross-team sum reduction for float and double type scalars.
 - Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -21,7 +21,7 @@ The release notes for the ROCm platform.
 ### What's new in this release
-ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
+ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
 Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -15,6 +15,477 @@ The release notes for the ROCm platform.
 -------------------
 ## ROCm 5.7.0
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable no-duplicate-header -->
 ### Release Highlights for ROCm 5.7
 ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
 Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
 #### AMD Instinct™ MI50 end-of-support notice
 AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
 As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
 * ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
 * No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
 * Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
 * Bug fixes during the maintenance will be made to the next ROCm point release.
 * Bug fixes will not be backported to older ROCm releases for gfx906.
 * Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
 #### Feature updates
 ##### Non-hostcall HIP printf
 **Current behavior**
 The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
 ```
    Pcie atomics not enabled, hostcall not supported
 ```
 **Workaround**
 The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
 Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
 The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
 * “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
 * “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
 **NOTE**: With the new workaround:
 * The printf buffer is fixed size and non-circular.  After the buffer is filled, calls to printf will not result in additional output.
 * The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
 ##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
 The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
 Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
 Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
 **Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
 #### Fixed defects
 The following defects are fixed in ROCm v5.7:
 * Test hangs observed in HMM RCCL
 * NoGpuTst test of Catch2 fails with Docker
 * Failures observed with non-HMM HIP directed catch2 tests with XNACK+
 * Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
 #### HIP 5.7.0
 ##### Optimizations
 ##### Added
 * Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
 * Added new APIs supporting Windows only, under development on Linux
    * `hipMallocMipmappedArray` for allocating a mipmapped array on the device
    * `hipFreeMipmappedArray` for freeing a mipmapped array on the device
    * `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
    * `hipMipmappedArrayCreate` for creating a mipmapped array
    * `hipMipmappedArrayDestroy` for destroy a mipmapped array
    * `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
 ##### Changed
 ##### Fixed
 ##### Known issues
 * HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
 * HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
 ##### Upcoming changes for HIP in ROCm 6.0 release
 * Removal of `gcnarch` from hipDeviceProp_t structure
 * Addition of new fields in hipDeviceProp_t structure
    * maxTexture1D
    * maxTexture2D
    * maxTexture1DLayered
    * maxTexture2DLayered
    * sharedMemPerMultiprocessor
    * deviceOverlap
    * asyncEngineCount
    * surfaceAlignment
    * unifiedAddressing
    * computePreemptionSupported
    * hostRegisterSupported
    * uuid
 * Removal of deprecated code -hip-hcc codes from hip code tree
 * Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
 * HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
 * Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
 * Correct hipGetLastError to return the last error instead of last API call's return code
 * Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
 * Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
 * Remove hiparray* and make it opaque with hipArray_t
 ### Library changes in ROCM 5.7.0
 | Library | Version |
 |---------|---------|
 | hipBLAS | 0.54.0 ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.0) |
 | hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.0) |
 | hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.0) |
 | hipSOLVER | 1.8.0 ⇒ [1.8.1](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.0) |
 | hipSPARSE | 2.3.7 ⇒ [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.0) |
 | MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.0) |
 | rccl | 2.15.5 ⇒ [2.17.1-1](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.7.0) |
 | rocALUTION | 2.1.9 ⇒ [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.0) |
 | rocBLAS | 3.0.0 ⇒ [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.0) |
 | rocFFT | 1.0.23 ⇒ [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.0) |
 | rocm-cmake | 0.9.0 ⇒ [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.0) |
 | rocPRIM | 2.13.0 ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.0) |
 | rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.0) |
 | rocSOLVER | 3.22.0 ⇒ [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.0) |
 | rocSPARSE | 2.5.2 ⇒ [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.0) |
 | rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.0) |
 | rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.0) |
 | Tensile | 4.37.0 ⇒ [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.0) |
 #### hipBLAS 1.1.0
 hipBLAS 1.1.0 for ROCm 5.7.0
 ##### Changed
 - updated documentation requirements
 ##### Dependencies
 - dependency rocSOLVER now depends on rocSPARSE
 #### hipCUB 2.13.1
 hipCUB 2.13.1 for ROCm 5.7.0
 ##### Changed
 * CUB backend references CUB and Thrust version 2.0.1.
 * Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
 * Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
 ##### Known issues
 - `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
 - `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
 - `DeviceHistogram::HistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int]`.
 - `DeviceHistogram::MultiHistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]` and `[LevelT, SampleIteratorT] = [float, double]`.
 #### hipFFT 1.0.12
 hipFFT 1.0.12 for ROCm 5.7.0
 ##### Added
 * Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
 ##### Changed
 * Added --precision argument to benchmark/test clients.  --double is still accepted but is deprecated as a method to request a double-precision transform.
 #### hipSOLVER 1.8.1
 hipSOLVER 1.8.1 for ROCm 5.7.0
 ##### Changed
 * Changed hipsolver-test sparse input data search paths to be relative to the test executable
 #### hipSPARSE 2.3.8
 hipSPARSE 2.3.8 for ROCm 5.7.0
 ##### Improved
 * Fix compilation failures when using cusparse 12.1.0 backend
 * Fix compilation failures when using cusparse 12.0.0 backend
 * Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
 * Minor improvements
 #### MIOpen 2.19.0
 MIOpen 2.19.0 for ROCm 5.7.0
 ##### Added
 * ROCm 5.5 support for gfx1101 (Navi32)
 ##### Changed
 * Tuning results for MLIR on ROCm 5.5
 * Bumping MLIR commit to 5.5.0 release tag
 ##### Fixed
 * Fix 3d convolution Host API bug
 * [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
 #### RCCL 2.17.1-1
 RCCL 2.17.1-1 for ROCm 5.7.0
 ##### Changed
 * Compatibility with NCCL 2.17.1-1
 * Performance tuning for some collective operations
 ##### Added
 * Minor improvements to MSCCL codepath
 * NCCL_NCHANNELS_PER_PEER support
 * Improved compilation performance
 * Support for gfx94x
 ##### Fixed
 * Potential race-condition during ncclSocketClose()
 #### rocALUTION 2.1.11
 rocALUTION 2.1.11 for ROCm 5.7.0
 ##### Added
 * Added support for gfx940, gfx941 and gfx942
 ##### Improved
 * Fixed OpenMP runtime issue with Windows toolchain
 #### rocBLAS 3.1.0
 rocBLAS 3.1.0 for ROCm 5.7.0
 ##### Added
 * yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
 * rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
 ##### Fixed
 * make offset calculations for rocBLAS functions 64 bit safe.  Fixes for very large leading dimensions or increments potentially causing overflow:
  * Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
  * Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
  * Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
  * General: set_vector, get_vector, set_matrix, get_matrix
  * Related fixes: internal scalar loads with &gt; 32bit offsets
  * fix in-place functionality for all trtri sizes
 ##### Changed
 * dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
 * enhanced reporting of installation issues caused by runtime libraries (Tensile)
 * standardized internal rocblas C++ interface across most functions
 ##### Deprecated
 * Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
 ##### Dependencies
 * Optional use of AOCL BLIS 4.0 on Linux for clients
 * Optional build tool only dependency on python psutil
 #### rocFFT 1.0.24
 rocFFT 1.0.24 for ROCm 5.7.0
 ##### Optimizations
 * Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
 ##### Added
 * Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
 ##### Changed
 * Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
 * Moved library kernel cache to lib/rocFFT directory.
 #### rocm-cmake 0.10.0
 rocm-cmake 0.10.0 for ROCm 5.7.0
 ##### Added
 * Added ROCMTest module
 * ROCMCreatePackage: Added support for ASAN packages
 #### rocPRIM 2.13.1
 rocPRIM 2.13.1 for ROCm 5.7.0
 ##### Changed
 * Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
 * Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
 * The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
 * Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
 ##### Fixed
 * Fixed build issue caused by missing header in `thread/thread_search.hpp`.
 #### rocRAND 2.10.17
 rocRAND 2.10.17 for ROCm 5.7.0
 ##### Added
 * MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
 * New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
 * Experimental HIP-CPU feature
 * ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
 ##### Changed
 * Python 2.7 is no longer officially supported.
 #### rocSOLVER 3.23.0
 rocSOLVER 3.23.0 for ROCm 5.7.0
 ##### Added
 * LU factorization without pivoting for block tridiagonal matrices:
    * GEBLTTRF_NPVT now supports interleaved\_batched format
 * Linear system solver without pivoting for block tridiagonal matrices:
    * GEBLTTRS_NPVT now supports interleaved\_batched format
 ##### Fixed
 * Fixed stack overflow in sparse tests on Windows
 ##### Changed
 * Changed rocsolver-test sparse input data search paths to be relative to the test executable
 * Changed build scripts to default to compressed debug symbols in Debug builds
 #### rocSPARSE 2.5.4
 rocSPARSE 2.5.4 for ROCm 5.7.0
 ##### Added
 * Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
 * Added support for gfx940, gfx941 and gfx942
 ##### Improved
 * Fixed a bug in csrsm and bsrsm
 ##### Known Issues
 In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
 #### rocThrust 2.18.0
 rocThrust 2.18.0 for ROCm 5.7.0
 ##### Fixed
 * `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
 * Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
 ##### Changed
 * Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
 * Removed references to and workarounds for deprecated hcc
 #### rocWMMA 1.2.0
 rocWMMA 1.2.0 for ROCm 5.7.0
 ##### Changed
 * Fixed a bug with synchronization
 * Updated rocWMMA cmake versioning
 #### Tensile 4.38.0
 Tensile 4.38.0 for ROCm 5.7.0
 ##### Added
 * Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
 * Added user-driven solution selection feature
 ##### Optimizations
 * Enabled LocalSplitU with MFMA for I8 data type
 * Optimized K mask code in mfmaIter
 * Enabled TailLoop code in NoLoadLoop to prefetch global/local read
 * Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
 * Optimized DirectToLds test cases to reduce the test duration
 ##### Changed
 * Removed DGEMM NT custom kernels and related test cases
 * Changed noTailLoop logic to apply noTailLoop only for NT
 * Changed the range of AssertFree0ElementMultiple and Free1
 * Unified aStr, bStr generation code in mfmaIter
 ##### Fixed
 * Fixed LocalSplitU mismatch issue for SGEMM
 * Fixed BufferStore=0 and Ldc != Ldd case
 * Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
 -------------------
 ## ROCm 5.6.1
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable no-duplicate-header -->
@@ -32,6 +503,36 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
 * Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
 * Using *hipGraphAddMemFreeNode* no longer results in a crash
 ### Library Changes in ROCM 5.6.1
 | Library | Version |
 |---------|---------|
 | hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.6.1) |
 | hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.6.1) |
 | hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.6.1) |
 | hipSOLVER | [1.8.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.6.1) |
 | hipSPARSE | 2.3.6 ⇒ [2.3.7](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.6.1) |
 | MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.6.1) |
 | rccl | [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.6.1) |
 | rocALUTION | [2.1.9](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.6.1) |
 | rocBLAS | [3.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.6.1) |
 | rocFFT | [1.0.23](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.6.1) |
 | rocm-cmake | [0.9.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.6.1) |
 | rocPRIM | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.6.1) |
 | rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.6.1) |
 | rocSOLVER | [3.22.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.6.1) |
 | rocSPARSE | [2.5.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.6.1) |
 | rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.6.1) |
 | Tensile | [4.37.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.6.1) |
 #### hipSPARSE 2.3.7
 hipSPARSE 2.3.7 for ROCm 5.6.1
 ##### Bugfix
 * Reverted an undocumented API change in hipSPARSE 2.3.6 that affected hipsparseSpSV_solve function
 -------------------
 ## ROCm 5.6.0
@@ -531,7 +1032,7 @@ rocWMMA 1.1.0 for ROCm 5.6.0
 * Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
 * Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
-* Added performance gemm samples for half, single and double precision
+* Added performance gemm samples for half, single, and double precision
 * Added rocWMMA cmake versioning
 * Added vectorized support in coordinate transforms
 * Included ROCm smi for runtime clock rate detection
@@ -1782,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
 - Enable new device RTL in libomptarget as default.
 - New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
+- Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
 - Initial implementation of optimized cross-team sum reduction for float and double type scalars.
 - Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
--- a/docs/about/release-notes.md
+++ b/docs/about/release-notes.md
@@ -15,15 +15,15 @@ The release notes for the ROCm platform.
 -------------------
-## ROCm 5.6.1
+## ROCm 5.7.0
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable no-duplicate-header -->
 ### What's new in this release
-ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
+ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
-## HIP 5.6.1 (for ROCm 5.6.1)
+Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
 ### Fixed defects
--- a/docs/conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
+++ b/docs/conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
@@ -3,7 +3,7 @@ How ROCm uses PCIe atomics
 ===========================
-ROCm PCIe feature and overview base address register (BAR) memory
+ROCm PCIe feature and overview of BAR memory
 ======================================================================
@@ -57,7 +57,7 @@ New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or
  * `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
  * `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
-In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:  
+In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
  | 1. write to another GPU memory,
@@ -86,7 +86,7 @@ For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual add
   * BAR4 register: Optional, not a boot device.
   * BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
-Here is how our BAR works on GFX 8 GPU’s with 40 bit Physical Address Limit ::
+Here is how our base address register (BAR) works on GFX 8 GPU’s with 40 bit Physical Address Limit ::
  11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)
--- a/docs/conceptual/ai-pytorch-inception.md
+++ b/docs/conceptual/ai-pytorch-inception.md
@@ -467,11 +467,11 @@ Plotting the train and test loss shows both metrics reducing over training epoch
 ### Custom model with CIFAR-10 on PyTorch
-The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
+The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
 Follow these steps:
-1. Import dependencies, including torch, os, and [torchvision](https://github.com/pytorch/vision).
+1. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision).
    ```py
    import torch
@@ -662,13 +662,13 @@ Follow these steps:
        print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
    ```
-### Case study: TensorFlow with Fashion MNIST
+### Case study: TensorFlow with Fashion-MNIST
-Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
+Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
 Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.
-The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries.
+The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries.
 Access the source code from the following repository:
@@ -690,7 +690,7 @@ To understand the code step by step, follow these steps:
    print(tf._version__) r
    ```
-3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
+3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
 4. The model is tested against the test set, test_images, and test_labels arrays.
--- a/docs/conceptual/file-reorg.md
+++ b/docs/conceptual/file-reorg.md
@@ -1,4 +1,4 @@
-# ROCm Linux Filesystem Hierarchy Standard (FHS) reorganization
+# ROCm Linux Filesystem Hierarchy Standard reorganization
 ## Introduction
--- a/docs/conceptual/gpu-arch/mi200-performance-counters.md
+++ b/docs/conceptual/gpu-arch/mi200-performance-counters.md
@@ -9,7 +9,7 @@ This document lists and describes the hardware performance counters and the deri
 Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
 ```
-### Graphics register bus management (GRBM)
+### GRBM
 #### GRBM counters
@@ -30,7 +30,7 @@ Preliminary validation of all MI200 performance counters is in progress. Those w
 The command processor counters are further classified into fetcher and compute.
-#### Command processor - fetcher (CPF)
+#### CPF
 ##### CPF counters
@@ -43,7 +43,7 @@ The command processor counters are further classified into fetcher and compute.
 | `cpf_cpf_tciu_idle`                  | Cycles | CPF TCIU interface idle                                    |
 | `cpf_cpf_tciu_stall[∗]`              | Cycles | CPF TCIU interface is stalled waiting on free tags.        |
-#### Command processor - compute (CPC)
+#### CPC
 ##### CPC counters
@@ -61,7 +61,7 @@ The command processor counters are further classified into fetcher and compute.
 | `cpc_cpc_utcl2iu_stall[∗]`       | Cycles | CPC UTCL2 interface stalled waiting                 |
 | `cpc_me1_dci0_spi_busy`          | Cycles | CPC ME1 Processor busy                              |
-### Shader processor input (SPI)
+### SPI
 #### SPI counters
@@ -325,7 +325,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
 | `tcp_tcc_cc_atomic_req`               | Req    | Number of CC atomic requests to L2 cache                   |
 | `tcp_tcc_rw_atomic_req`               | Req    | Number of RW atomic requests to L2 cache                   |
-#### Texture cache arbiter (TCA)
+#### TCA
 | Hardware Counter | Unit   | Definition                                  |
 | :----------------| :------| ------------------------------------------: |
@@ -400,9 +400,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
 | `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad)      |
 | `LDSBankConflict`  | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
-## Abbreviations
+## MI200 acronyms
 ### MI200 abbreviations
 | Abbreviation | Meaning                                                                           |
 | :------------| --------------------------------------------------------------------------------: |
--- a/docs/conceptual/gpu-memory.md
+++ b/docs/conceptual/gpu-memory.md
@@ -1,4 +1,4 @@
-# GPU Memory
+# GPU memory
 For the HIP reference documentation, see:
--- a/docs/conceptual/index.md
+++ b/docs/conceptual/index.md
@@ -44,9 +44,9 @@ product lines.
 :::
-:::{grid-item-card} GPU Memory
+:::{grid-item-card}
-:link: gpu_memory
+**[GPU memory](./gpu-memory.md)**
-:link-type: doc
+
 Learn about the different types of memory allocations.
 :::
--- a/docs/conceptual/using-gpu-sanitizer.md
+++ b/docs/conceptual/using-gpu-sanitizer.md
@@ -1,4 +1,4 @@
-# Using the LLVM AddressSanitizer (ASan) on a GPU (beta release)
+# Using the LLVM ASan on a GPU (beta release)
 The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
@@ -7,7 +7,7 @@ Until now, the LLVM ASan process was only available for traditional purely CPU a
 This document provides documentation on using ROCm ASan.
 For information about LLVM ASan, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
-**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
+**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
 ## Compiling for ASan
@@ -51,11 +51,11 @@ For a complete ROCm GPU Sanitizer installation, the following  must be installed
    sudo apt-get install hipfft-asan hipsparse-asan migraphx-asan miopen-hip-asan rocalution-asan rocblas-asan rocfft-asan rocm-core-asan rocsparse-asan hipblaslt-asan mivisionx-asan rocsolver-asan 
 ```
-**Note**: It is recommended to install all address sanitizer packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries.
+**Note**: It is recommended to install all ASan packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries.
 ## Using AMD-supplied ASan instrumented libraries
-ROCm releases have optional packages containing additional address sanitizer instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`.
+ROCm releases have optional packages containing additional ASan instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`.
 These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as descibed above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
 These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
@@ -235,7 +235,7 @@ $ rocgdb <path to application>
 ### Using ASan with a short HIP application
-Refer to the following example to use address sanitizer with a short HIP application,
+Refer to the following example to use ASan with a short HIP application,
 https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md
--- a/docs/contribute/building.md
+++ b/docs/contribute/building.md
@@ -41,10 +41,10 @@ python3 -mvenv .venv
 Then open up `_build/html/index.html` in your favorite browser.
-## Build documentation using Visual Studio (VS) Code
+## Build documentation using Visual Studio Code
 One can put together a productive environment to author documentation and also
-test it locally using VS Code with only a handful of extensions. Even though the
+test it locally using Visual Studio (VS) Code with only a handful of extensions. Even though the
 extension landscape of VS Code is ever changing, here is one example setup that
 proved useful at the time of writing. In it, one can change/add content, build a
 new version of the docs using a single VS Code Task (or hotkey), see all errors/
--- a/docs/how-to/gpu-enabled-mpi.md
+++ b/docs/how-to/gpu-enabled-mpi.md
@@ -17,8 +17,7 @@ memory with RDMA capabilities. These interfaces are currently registered as a
 These interfaces are used to optimize inter-node MPI message communication.
 This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
-MPI project is an open source implementation of the Message Passing Interface
+MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
 (MPI) that is developed and maintained by a consortium of academic, research,
 and industry partners.
 Several MPI implementations can be made ROCm-aware by compiling them with
@@ -72,7 +71,7 @@ make -j $(nproc)
 make -j $(nproc) install
 ```
-The [communication libraries tables](#communication-libraries)
+The [communication libraries tables](../reference/libraries/gpu-libraries/communication.md)
 documents the compatibility of UCX versions with ROCm versions.
 ## Install Open MPI
@@ -157,7 +156,7 @@ Unified Collective Communication Library (UCC) component in Open MPI.
 For this, the UCC library has to be configured and compiled with ROCm
 support.
-Please note the compatibility [tables](#communication-libraries)
+Please note the compatibility tables in the [communication libraries](../reference/libraries/gpu-libraries/communication.md)
 for UCC versions with the various ROCm versions.
 An example for configuring UCC and Open MPI with ROCm support
--- a/docs/how-to/tuning-guides/mi100.md
+++ b/docs/how-to/tuning-guides/mi100.md
@@ -405,7 +405,7 @@ SIMD pipelines, memory information, and instruction set architecture:
 ![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system")
 For a complete list of architecture (LLVM target) names, refer to
-[Linux Support](../../about/compatibility/linux-support.md) and [Windows Support](../../about/compatibility/windows-support.md).
+[Linux support](../../about/compatibility/linux-support.md) and [Windows support](../../about/compatibility/windows-support.md).
 ### Testing inter-device bandwidth
--- a/docs/reference/libraries/ai-libraries.md
+++ b/docs/reference/libraries/ai-libraries.md
@@ -1,4 +1,4 @@
-# AI libraries
+# Artificial intelligence libraries
 ::::{grid} 1 1 2 2
 :gutter: 1
--- a/docs/reference/libraries/gpu-libraries/math-fft.md
+++ b/docs/reference/libraries/gpu-libraries/math-fft.md
@@ -9,7 +9,6 @@ ROCm libraries for fast Fourier transforms (FFTs) are as follows:
 rocFFT is an AMD GPU optimized library for FFT.
 * {doc}`Documentation <rocfft:index>`
 * [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT)
 * [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md)
@@ -21,7 +20,6 @@ hipFFT is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs
 using rocFFT. hipFFT allows for a common interface for other non AMD GPU
 FFT libraries.
 * {doc}`Documentation <hipfft:index>`
 * [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT)
 * [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md)
--- a/docs/reference/openmp/openmp.md
+++ b/docs/reference/openmp/openmp.md
@@ -436,7 +436,7 @@ See the complete sample code for global buffer overflow
 You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options:
-* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions.
+* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including no-loop and Cross-team reductions.
 * `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution.
@@ -448,13 +448,13 @@ You can use the clang compiler option `-fopenmp-target-fast` for kernel optimiza
 Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:
-* No-Loop
+* No-loop
-* Big-Jump-Loop
+* Big-jump-loop
-* Cross-Team (Xteam) Reductions
+* Cross-team reductions
 To enable the generation of specialized kernels, follow these guidelines:
-* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel.
+* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.
 * Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`.
@@ -464,11 +464,11 @@ To enable the generation of specialized kernels, follow these guidelines:
 #### No-loop kernel generation
-The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
+The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
 #### Big-jump-loop kernel generation
-A No-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
+A no-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
 #### Cross-team optimized reduction kernel generation
--- a/docs/reference/rocmcc/rocmcc.md
+++ b/docs/reference/rocmcc/rocmcc.md
@@ -656,7 +656,7 @@ of target triple and the target GPU (along with the associated target features).
 modified to query this structure to identify a compatible image based on the
 capability of the current system.
-#### Unified shared memory (USM)
+#### Unified shared memory
 The following OpenMP pragma is available on MI200, and it must be executed with
 `xnack+` support.
@@ -665,7 +665,7 @@ The following OpenMP pragma is available on MI200, and it must be executed with
 omp requires unified_shared_memory
 ```
-For more details on USM refer to the {ref}`openmp_usm` section of the OpenMP
+For more details on unified shared memory refer to the {ref}`openmp_usm` section of the OpenMP
 Guide.
 ### Support status of other Clang options
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -119,7 +119,7 @@ subtrees:
      - file: tutorials/install/docker.md
        title: ROCm Docker containers
      - file: tutorials/install/pytorch-install.md
-        title: Pytorch for ROCm
+        title: PyTorch for ROCm
      - file: tutorials/install/tensorflow-install.md
        title: Tensorflow for ROCm
      - file: tutorials/install/magma-install.md
@@ -199,6 +199,8 @@ subtrees:
            title: MI200
          - file: conceptual/gpu-arch/mi250.md
            title: MI250
      - file: conceptual/gpu-memory.md
        title: GPU memory
      - file: conceptual/compiler-disambiguation.md
        title: Compiler disambiguation
      - file: conceptual/file-reorg.md
@@ -214,7 +216,7 @@ subtrees:
      - file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
        title: ROCm & PCIe atomics
      - file: conceptual/ai-pytorch-inception.md
-        title: Inception v3 with Pytorch
+        title: Inception v3 with PyTorch
      - file: conceptual/ai-migraphx-optimization.md
        title: Inference optimization with MIGraphX
--- a/docs/temp/troubleshooting.md
+++ b/docs/temp/troubleshooting.md
@@ -43,8 +43,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`.
 **Q: Can I install PyTorch directly on bare metal?**
 Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to
-Option 2: Install PyTorch Using Wheels Package in the section
+Option 2: Install PyTorch Using Wheels Package. See [Installing PyTorch](../tutorials/install/pytorch-install.md) for more information.
 {ref}`install-pytorch-using-wheels` of this guide for more information.
 **Q: How do I profile PyTorch workloads?**
--- a/docs/tutorials/install/pytorch-install.md
+++ b/docs/tutorials/install/pytorch-install.md
@@ -1,370 +1,399 @@
-# PyTorch Installation for ROCm
+# Installing PyTorch for ROCm
-## PyTorch
+[PyTorch](https://pytorch.org/) is an open-source tensor library designed for deep learning. PyTorch on
 ROCm provides mixed-precision and large-scale training using our
 [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) and
 [RCCL](https://github.com/ROCmSoftwarePlatform/rccl) libraries.
-PyTorch is an open-source machine learning Python library, primarily
+To install [PyTorch for ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), you have the following options:
 differentiated by Tensor computing with GPU acceleration and a type-based
 automatic differentiation. Other advanced features include:
-* Support for distributed training
+* [Use a Docker image with PyTorch pre-installed](#using-a-docker-image-with-pytorch-pre-installed)
-* Native ONNX support
+  (recommended)
-* C++ front-end
+* [Use a wheels package](#using-a-wheels-package)
-* The ability to deploy at scale using TorchServe
+* [Use the PyTorch ROCm base Docker image](#using-the-pytorch-rocm-base-docker-image)
-* A production-ready deployment mechanism through TorchScript
+* [Use the PyTorch upstream Docker file](#using-the-pytorch-upstream-docker-file)
-### Installing PyTorch
+For hardware, software, and third-party framework compatibility between ROCm and PyTorch, refer to:
-To install ROCm on bare metal, refer to the sections
+* [GPU and OS support (Linux)](../../about/compatibility/linux-support.md)
-[GPU and OS Support (Linux)](../../about/compatibility/linux-support.md) and
+* [Compatibility](../../about/compatibility/3rd-party-support-matrix.md)
 [Compatibility](../../about/compatibility/index.md) for hardware, software and
 3rd-party framework compatibility between ROCm and PyTorch. The recommended
 option to get a PyTorch environment is through Docker. However, installing the
 PyTorch wheels package on bare metal is also supported.
-#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-Installed
+## Using a Docker image with PyTorch pre-installed
-Using Docker gives you portability and access to a prebuilt Docker container
+1. Download the latest public PyTorch Docker image
-that has been rigorously tested within AMD. This might also save on the
+   ([https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch)).
 compilation time and should perform as it did when tested without facing
 potential installation issues.
 Follow these steps:
 1. Pull the latest public PyTorch Docker image.
   ```bash
   docker pull rocm/pytorch:latest
   ```
-   Optionally, you may download a specific and supported configuration with
+   You can also download a specific and supported configuration with different user-space ROCm
-   different user-space ROCm versions, PyTorch versions, and supported operating
+   versions, PyTorch versions, and operating systems.
   systems. To download the PyTorch Docker image, refer to
   [https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).
-2. Start a Docker container using the downloaded image.
+2. Start a Docker container using the image.
   ```bash
-   docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
+   docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
   --device=/dev/kfd --device=/dev/dri --group-add video \
   --ipc=host --shm-size 8G rocm/pytorch:latest
   ```
-   ```{note}
+   :::{note}
-   This will automatically download the image if it does not exist on the host.
+   This will automatically download the image if it does not exist on the host. You can also pass the `-v`
-   You can also pass the -v argument to mount any data directories from the host
+   argument to mount any data directories from the host onto the container.
-   onto the container.
+   :::
 (install_pytorch_wheels)=
 ## Using a wheels package
 PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, go
 to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). In the interactive
 table, choose ROCm from the _Compute Platform_ row.
 1. Choose one of the following three options:
   **Option 1:**
   a. Download a base Docker image with the correct user-space ROCm version.
   | Base OS         | Docker image                   | Link to Docker image|
   |----------------|-----------------------------|----------------|
   | Ubuntu 20.04 | `rocm/dev-ubuntu-20.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-20.04](https://hub.docker.com/r/rocm/dev-ubuntu-20.04)
   | Ubuntu 22.04 | `rocm/dev-ubuntu-22.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-22.04](https://hub.docker.com/r/rocm/dev-ubuntu-22.04)
   | CentOS 7        | `rocm/dev-centos-7` | [https://hub.docker.com/r/rocm/dev-centos-7](https://hub.docker.com/r/rocm/dev-centos-7)
   b. Pull the selected image.
   ```bash
   docker pull rocm/dev-ubuntu-20.04:latest
   ```
-(install-pytorch-using-wheels)=
+   c. Start a Docker container using the downloaded image.
-#### Option 2: Install PyTorch Using Wheels Package
+   ```bash
 PyTorch supports the ROCm platform by providing tested wheels packages. To
 access this feature, refer to
 [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
 and choose the "ROCm" compute platform. The following image is a matrix from <https://pytorch.org/> that illustrates the installation compatibility between ROCm and the PyTorch build.
 ![Pytorch installation matrix](../../data/tutorials/install/magma-install/magma006.png "Pytorch installation matrix")
 To install PyTorch using the wheels package, follow these installation steps:
 1. Choose one of the following options:
   a. Obtain a base Docker image with the correct user-space ROCm version
      installed from
      [https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04).
   or
   b. Download a base OS Docker image and install ROCm following the
      installation directions in the [Installation](../../tutorials/install/linux/index) section. ROCm 5.2 is installed in
      this example, as supported by the installation matrix from
      <https://pytorch.org/>.
   or
   c. Install on bare metal. Skip to Step 3.
      ```bash
      docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
      ```
 2. Start the Docker container, if not installing on bare metal.
   ```dockerfile
   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
   ```
-3. Install any dependencies needed for installing the wheels package.
+   **Option 2:**
   Select a base OS Docker image (Check [OS compatibility](../../about/compatibility/linux-support.md))
   Pull selected base OS image (Ubuntu 20.04 for example)
   ```docker
   docker pull ubuntu:20.04
   ```
   Start a Docker container using the downloaded image
   ```docker
   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:20.04
   ```
   Install ROCm using the directions in the [Installation section](../install/linux/install-options.md).
   **Option 3:**
   Install on bare metal. Check [OS compatibility](../../about/compatibility/linux-support.md) and install ROCm using the
   directions in the [Installation section](../install/linux/install-options.md).
 2. Install the required dependencies for the wheels package.
   ```bash
   sudo apt update
-   sudo apt install libjpeg-dev python3-dev
+   sudo apt install libjpeg-dev python3-dev python3-pip
   pip3 install wheel setuptools
   ```
-4. Install Torch, TorchVision, and TorchAudio as specified by the installation
+3. Install `torch`, `torchvision`, and `torchaudio`, as specified in the [installation matrix](https://pytorch.org/get-started/locally/).
   matrix.
-   ```{note}
+   :::{note}
-   ROCm 5.2 PyTorch wheel in the command below is shown for reference.
+   The following command uses the ROCm 5.6 PyTorch wheel. If you want a different version of ROCm,
-   ```
+   modify the command accordingly.
   :::
   ```bash
-   pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/
+   pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6/
   ```
-#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image
+4. (Optional) Use MIOpen kdb files with ROCm PyTorch wheels.
-A prebuilt base Docker image is used to build PyTorch in this option. The base
+   PyTorch uses [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for machine learning
-Docker has all dependencies installed, including:
+   primitives, which are compiled into kernels at runtime. Runtime compilation causes a small warm-up
   phase when starting PyTorch, and MIOpen kdb files contain precompiled kernels that can speed up
   application warm-up phases. For more information, refer to the
   {doc}`MIOpen installation page <miopen:install>`.
   MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in
   a specific location with respect to the PyTorch installation path. A helper script simplifies this task by
   taking the ROCm version and GPU architecture as inputs. This works for Ubuntu and CentOS.
   You can download the helper script here:
   [install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/    install_kdb_files_for_pytorch_wheels.sh), or use:
   `wget https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh`
   After installing ROCm PyTorch wheels, run the following code:
   ```bash
   #Optional; replace 'gfx90a' with your architecture
   export GFX_ARCH=gfx90a
   #Optional
   export ROCM_VERSION=5.5
   ./install_kdb_files_for_pytorch_wheels.sh
   ```
 ## Using the PyTorch ROCm base Docker image
 The pre-built base Docker image has all dependencies installed, including:
 * ROCm
 * Torchvision
 * Conda packages
-* Compiler toolchain
+* The compiler toolchain
-Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and
+Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, which is used by the build
-the build scripts utilize that to determine the build environment configuration.
+scripts to determine the configuration of the build environment.
-Follow these steps:
+1. Download the Docker image. This is the base image, which does not contain PyTorch.
 1. Obtain the Docker image.
   ```bash
   docker pull rocm/pytorch:latest-base
   ```
-   The above will download the base container, which does not contain PyTorch.
+2. Start a Docker container using the downloaded image.
 2. Start a Docker container using the image.
   ```bash
   docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
   ```
-   You can also pass the -v argument to mount any data directories from the host
+   You can also pass the `-v` argument to mount any data directories from the host onto the container.
   onto the container.
 3. Clone the PyTorch repository.
   ```bash
   cd ~
   git clone https://github.com/pytorch/pytorch.git
-   cd pytorch
+   cd /pytorch
   git submodule update --init --recursive
   ```
-4. Build PyTorch for ROCm.
+4. Set ROCm architecture (optional). The Docker image tag is `rocm/pytorch:latest-base`.
-   ```{note}
+   :::{note}
-   By default in the `rocm/pytorch:latest-base`, PyTorch builds for these
+   By default in the `rocm/pytorch:latest-base` image, PyTorch builds simultaneously for the following
-   architectures simultaneously:
+   architectures:
   * gfx900
   * gfx906
   * gfx908
   * gfx90a
   * gfx1030
-   ```
+   :::
-5. To determine your AMD uarch, run:
+   If you want to compile _only_ for your microarchitecture (uarch), run:
   ```bash
   rocminfo | grep gfx
   ```
 6. In the event you want to compile only for your uarch, use:
   ```bash
   export PYTORCH_ROCM_ARCH=<uarch>
   ```
-   `<uarch>` is the architecture reported by the `rocminfo` command.
+   Where `<uarch>` is the architecture reported by the `rocminfo` command.
-7. Build PyTorch using the following command:
+   To find your uarch, run:
   ```bash
-   ./.jenkins/pytorch/build.sh
+   rocminfo | grep gfx
   ```
-   This will first convert PyTorch sources for HIP compatibility and build the
+5. Build PyTorch.
   ```bash
   ./.ci/pytorch/build.sh
   ```
   This converts PyTorch sources for
   [HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
   PyTorch framework.
-8. Alternatively, build PyTorch by issuing the following commands:
+   To check if your build is successful, run:
   ```bash
-   python3 tools/amd_build/build_amd.py
+   echo $? # should return 0 if success
   USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
   ```
-#### Option 4: Install Using PyTorch Upstream Docker File
+## Using the PyTorch upstream Docker file
-Instead of using a prebuilt base Docker image, you can build a custom base
+If you don't want to use a prebuilt base Docker image, you can build a custom base Docker image
-Docker image using scripts from the PyTorch repository. This will utilize a
+using scripts from the PyTorch repository. This uses a standard Docker image from operating system
-standard Docker image from operating system maintainers and install all the
+maintainers and installs all the required dependencies, including:
 dependencies required to build PyTorch, including
 * ROCm
 * Torchvision
 * Conda packages
-* Compiler toolchain
+* The compiler toolchain
-Follow these steps:
+1. Clone the PyTorch repository.
 1. Clone the PyTorch repository on the host.
   ```bash
   cd ~
   git clone https://github.com/pytorch/pytorch.git
-   cd pytorch
+   cd /pytorch
   git submodule update --init --recursive
   ```
 2. Build the PyTorch Docker image.
   ```bash
-   cd.circleci/docker
+   cd .ci/docker
-   ./build.sh pytorch-linux-bionic-rocm<version>-py3.7
+   ./build.sh pytorch-linux-<os-version>-rocm<rocm-version>-py<python-version> -t rocm/pytorch:build_from_dockerfile
   # eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
   ```
-   This should be complete with a message "Successfully build `<image_id>`."
+   Where:
   * `<os-version>`: `ubuntu20.04` (or `focal`), `ubuntu22.04` (or `jammy`), `centos7.5`, or `centos9`
   * `<rocm-version>`: `5.4`, `5.5`, or `5.6`
   * `<python-version>`: `3.8`-`3.11`
-3. Start a Docker container using the image:
+   To verify that your image was successfully created, run:
   `docker image ls rocm/pytorch:build_from_dockerfile`
   If successful, the output looks like this:
   ```bash
-   docker run -it --cap-add=SYS_PTRACE --security-opt
+   REPOSITORY    TAG                       IMAGE ID         CREATED           SIZE
-   seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add
+   rocm/pytorch  build_from_dockerfile     17071499be47     2 minutes ago     32.8GB
   video --ipc=host --shm-size 8G <image_id>
   ```
-   You can also pass -v argument to mount any data directories from the host
+3. Start a Docker container using the image with the mounted PyTorch folder.
-   onto the container.
+
-
+   ```bash
-4. Clone the PyTorch repository.
+   docker run -it --cap-add=SYS_PTRACE --security-opt --user root \
   seccomp=unconfined --device=/dev/kfd --device=/dev/dri \
   --group-add video --ipc=host --shm-size 8G \
   -v ~/pytorch:/pytorch rocm/pytorch:build_from_dockerfile
   ```
   You can also pass the `-v` argument to mount any data directories from the host onto the container.
 4. Go to the PyTorch directory.
   ```bash
   cd ~
   git clone https://github.com/pytorch/pytorch.git
   cd pytorch
   git submodule update --init --recursive
   ```
-5. Build PyTorch for ROCm.
+5. Set ROCm architecture.
-   ```{note}
+   To determine your AMD architecture, run:
   By default in the `rocm/pytorch:latest-base`, PyTorch builds for these
   architectures simultaneously:
   * gfx900
   * gfx906
   * gfx908
   * gfx90a
   * gfx1030
   ```
 6. To determine your AMD uarch, run:
   ```bash
   rocminfo | grep gfx
   ```
-7. If you want to compile only for your uarch:
+   The result looks like this (for `gfx1030` architecture):
   ```bash
   Name:                    gfx1030
   Name:                    amdgcn-amd-amdhsa--gfx1030
   ```
   Set the `PYTORCH_ROCM_ARCH` environment variable to specify the architectures you want to
   build PyTorch for.
   ```bash
   export PYTORCH_ROCM_ARCH=<uarch>
   ```
-   `<uarch>` is the architecture reported by the `rocminfo` command.
+   where `<uarch>` is the architecture reported by the `rocminfo` command.
-8. Build PyTorch using:
+6. Build PyTorch.
   ```bash
-   ./.jenkins/pytorch/build.sh
+   ./.ci/pytorch/build.sh
   ```
-This will first convert PyTorch sources to be HIP compatible and then build the
+   This converts PyTorch sources for
-PyTorch framework.
+   [HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
   PyTorch framework.
-Alternatively, build PyTorch by issuing the following commands:
+   To check if your build is successful, run:
-```bash
+   ```bash
-python3 tools/amd_build/build_amd.py
+   echo $? # should return 0 if success
 USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
 ```
 ### Test the PyTorch Installation
 You can use PyTorch unit tests to validate a PyTorch installation. If using a
 prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official
 wheels package, these tests are already run on those configurations.
 Alternatively, you can manually run the unit tests to validate the PyTorch
 installation fully.
 Follow these steps:
 1. Test if PyTorch is installed and accessible by importing the torch package in
   Python.
   ```{note}
   Do not run in the PyTorch git folder.
   ```
 ## Testing the PyTorch installation
 You can use PyTorch unit tests to validate your PyTorch installation. If you used a
 **prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
 **official wheels package**, validation tests are not necessary.
 If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps:
 1. Import the torch package in Python to test if PyTorch is installed and accessible.
   :::{note}
   Do not run the following command in the PyTorch git folder.
   :::
   ```bash
   python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
   ```
-2. Test if the GPU is accessible from PyTorch. In the PyTorch framework,
+2. Check if the GPU is accessible from PyTorch. In the PyTorch framework, `torch.cuda` is a generic way
-   `torch.cuda` is a generic mechanism to access the GPU; it will access an AMD
+   to access the GPU. This can only access an AMD GPU if one is available.
   GPU only if available.
   ```bash
   python3 -c 'import torch; print(torch.cuda.is_available())'
   ```
-3. Run the unit tests to validate the PyTorch installation fully. Run the
+3. Run unit tests to validate the PyTorch installation fully.
-   following command from the PyTorch home directory:
+
   :::{note}
   You must run the following command from the PyTorch home directory.
   :::
   ```bash
-   BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh
+   PYTORCH_TEST_WITH_ROCM=1 python3 test/run_test.py --verbose \
   --include test_nn test_torch test_cuda test_ops \
   test_unary_ufuncs test_binary_ufuncs test_autograd
   ```
-   This ensures that even for wheel installs in a non-controlled environment,
+   This command ensures that the required environment variable is set to skip certain unit tests for
-   the required environment variable will be set to skip certain unit tests for
+   ROCm. This also applies to wheel installs in a non-controlled environment.
   ROCm.
-   ```{note}
+   :::{note}
-   Make sure the PyTorch source code is corresponding to the PyTorch wheel or
+   Make sure your PyTorch source code corresponds to the PyTorch wheel or the installation in the
-   installation in the Docker image. Incompatible PyTorch source code might give
+   Docker image. Incompatible PyTorch source code can give errors when running unit tests.
-   errors when running the unit tests.
+   :::
   ```
-   This will first install some dependencies, such as a supported [torchvision](https://pytorch.org/vision/stable/index.html)
+   Some tests may be skipped, as appropriate, based on your system configuration. ROCm doesn't
-   version for PyTorch. Torchvision is used in some PyTorch tests for loading
+   support all PyTorch features; tests that evaluate unsupported features are skipped. Other tests might
-   models. Next, this will run all the unit tests.
+   be skipped, depending on the host or GPU memory and the number of available GPUs.
-   ```{note}
+   If the compilation and installation are correct, all tests will pass.
   Some tests may be skipped, as appropriate, based on your system
   configuration. All features of PyTorch are not supported on ROCm, and the
   tests that evaluate these features are skipped. In addition, depending on the
   host memory, or the number of available GPUs, other tests may be skipped. No
   test should fail if the compilation and installation are correct.
   ```
-4. Run individual unit tests with the following command:
+4. Run individual unit tests.
   ```bash
   PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
   ```
-   `test_nn.py` can be replaced with any other test set.
+   You can replace `test_nn.py` with any other test set.
-### Run a Basic PyTorch Example
+## Running a basic PyTorch example
-The PyTorch examples repository provides basic examples that exercise the
+The PyTorch examples repository provides basic examples that exercise the functionality of your
-functionality of the framework. MNIST (Modified National Institute of Standards
+framework.
 and Technology) database is a collection of handwritten digits that may be used
 to train a Convolutional Neural Network for handwriting recognition.
 Alternatively, ImageNet is a database of images used to train a network for
 visual object recognition.
-Follow these steps:
+Two of our favorite testing databases are:
 * **MNIST** (Modified National Institute of Standards and Technology): A database of handwritten
  digits that can be used to train a Convolutional Neural Network for **handwriting recognition**.
 * **ImageNet**: A database of images that can be used to train a network for
  **visual object recognition**.
 ### MNIST PyTorch example
 1. Clone the PyTorch examples repository.
@@ -372,44 +401,45 @@ Follow these steps:
   git clone https://github.com/pytorch/examples.git
   ```
-2. Run the MNIST example.
+2. Go to the MNIST example folder.
   ```bash
   cd examples/mnist
   ```
-3. Follow the instructions in the `README` file in this folder. In this case:
+3. Follow the instructions in the `README.md`` file in this folder to install the requirements. Then run:
   ```bash
   pip3 install -r requirements.txt
   python3 main.py
   ```
-4. Run the ImageNet example.
+   This generates the following output:
   ```bash
   ...
   Train Epoch: 14 [58240/60000 (97%)]     Loss: 0.010128
   Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.001348
   Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.005261
   Test set: Average loss: 0.0252, Accuracy: 9921/10000 (99%)
   ```
 ### ImageNet PyTorch example
 1. Clone the PyTorch examples repository (if you didn't already do this step in the preceding MNIST example).
   ```bash
   git clone https://github.com/pytorch/examples.git
   ```
 2. Go to the ImageNet example folder.
   ```bash
   cd examples/imagenet
   ```
-5. Follow the instructions in the `README` file in this folder. In this case:
+3. Follow the instructions in the `README.md` file in this folder to install the Requirements. Then run:
   ```bash
   pip3 install -r requirements.txt
   python3 main.py
   ```
 ## Using MIOpen kdb files with ROCm PyTorch wheels
 PyTorch uses MIOpen for machine learning primitives. These primitives are compiled into kernels at runtime. Runtime compilation causes a small warm-up phase when starting PyTorch. MIOpen kdb files contain precompiled kernels that can speed up the warm-up phase of an application. More information is available in the {doc}`MIOpeninstallation page <miopen:install>`.
 MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in a specific location with respect to the PyTorch installation path. A helper script simplifies this task for the user. The script takes in the ROCm version and user's GPU architecture as inputs, and works for Ubuntu and CentOS.
 Helper script: [install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh)
 Usage:
 After installing ROCm PyTorch wheels:
 1. [Optional] `export GFX_ARCH=gfx90a`
 2. [Optional] `export ROCM_VERSION=5.5`
 3. `./install_kdb_files_for_pytorch_wheels.sh`
--- a/docs/tutorials/install/tensorflow-install.md
+++ b/docs/tutorials/install/tensorflow-install.md
@@ -26,7 +26,7 @@ version support matrix:
 The following sections contain options for installing TensorFlow.
-#### Option 1: Using a Docker image
+#### Option 1: using a Docker image
 To install ROCm on bare metal, follow the section
 [Installation (Linux)](../../tutorials/install/linux/os-native/install). The recommended option to
--- a/docs/tutorials/install/windows/cli/install.md
+++ b/docs/tutorials/install/windows/cli/install.md
@@ -1,4 +1,4 @@
-# Installation using the command-line interface (CLI)
+# Installation using the CLI
 The steps to install the HIP SDK for Windows are described in this document.
--- a/docs/tutorials/install/windows/cli/uninstall.md
+++ b/docs/tutorials/install/windows/cli/uninstall.md
@@ -1,4 +1,4 @@
-# Uninstallation using the command-line interface (CLI)
+# Uninstallation using the CLI
 The steps to uninstall the HIP SDK for Windows are described in this document.
--- a/tools/autotag/templates/rocm_changes/5.4.0.md
+++ b/tools/autotag/templates/rocm_changes/5.4.0.md
@@ -81,7 +81,7 @@ This release consists of the following OpenMP enhancements:
 * Enable new device RTL in libomptarget as default.
 * New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
-* Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
+* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
 * Initial implementation of optimized cross-team sum reduction for float and double type scalars.
 * Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
--- a/tools/autotag/templates/rocm_changes/5.7.0.md
+++ b/tools/autotag/templates/rocm_changes/5.7.0.md
@@ -1,78 +1,80 @@
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable no-duplicate-header -->
-### Release Highlights for ROCm 5.7
+### Release highlights for ROCm 5.7
-ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
+ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
 Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
-#### AMD Instinct™ MI50 End of Support Notice
+#### AMD Instinct™ MI50 end-of-support notice
 AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
 As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support)
+* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
+* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
+* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release.
+* Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906.
+* Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
+* Distribution and operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
-#### Feature Updates
+#### Feature updates
-##### Non-hostcall HIP Printf
+##### Non-hostcall HIP printf
 **Current behavior**
-The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above), 
+The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
 ```
    Pcie atomics not enabled, hostcall not supported
 ```
 **Workaround**
-The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics. 
+The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
-Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.	
+
 Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
 The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
+* “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution. 
+* “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
-**NOTE**: With the new workaround,
+**NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular.  After the buffer is filled, calls to printf will not result in additional output.
+* The printf buffer is fixed size and non-circular.  After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
+* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
-##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU
+##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
-The ROCm 5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
+The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
 Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
-Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](./docs/understand/using_gpu_sanitizer.md).
+Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
-**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
+Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using_gpu_sanitizer.md).
-#### Fixed Defects
+**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
-The following defects are fixed in ROCm v5.7,
+#### Fixed defects
- Test hangs observed in HMM RCCL
+The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker
+* Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+
+* NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
+* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
 * Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
 #### HIP 5.7.0
@@ -80,73 +82,73 @@ The following defects are fixed in ROCm v5.7,
 ##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
+* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux
+* Added new APIs supporting Windows only, under development on Linux
-    - `hipMallocMipmappedArray` for allocating a mipmapped array on the device
+    * `hipMallocMipmappedArray` for allocating a mipmapped array on the device
-    - `hipFreeMipmappedArray` for freeing a mipmapped array on the device
+    * `hipFreeMipmappedArray` for freeing a mipmapped array on the device
-    - `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
+    * `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
-    - `hipMipmappedArrayCreate` for creating a mipmapped array
+    * `hipMipmappedArrayCreate` for creating a mipmapped array
-    - `hipMipmappedArrayDestroy` for destroy a mipmapped array
+    * `hipMipmappedArrayDestroy` for destroy a mipmapped array
-    - `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
+    * `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
 ##### Changed
 ##### Fixed
-##### Known Issues
+##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
+* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
+* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
 ##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure
+* Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
+* Addition of new fields in hipDeviceProp_t structure
-    - maxTexture1D
+    * maxTexture1D
-    - maxTexture2D
+    * maxTexture2D
-    - maxTexture1DLayered
+    * maxTexture1DLayered
-    - maxTexture2DLayered
+    * maxTexture2DLayered
    - sharedMemPerMultiprocessor
    - deviceOverlap
    - asyncEngineCount
    - surfaceAlignment
    - unifiedAddressing
    - computePreemptionSupported
    - hostRegisterSupported
    - uuid
 - Removal of deprecated code -hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
+    * sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
+    * deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
+    * asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code
+    * surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
+    * unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
+    * computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t
+    * hostRegisterSupported
    * uuid
 * Removal of deprecated code -hip-hcc codes from hip code tree
 * Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
 * HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
 * Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
 * Correct hipGetLastError to return the last error instead of last API call's return code
 * Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
 * Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
 * Remove hiparray* and make it opaque with hipArray_t
`@@ -1,4 +1,4 @@`
	`# ROCm Linux Filesystem Hierarchy Standard (FHS) reorganization`	`# ROCm Linux Filesystem Hierarchy Standard reorganization`

	`## Introduction`	`## Introduction`
`@@ -1,4 +1,4 @@`
	`# GPU Memory`	`# GPU memory`

	`For the HIP reference documentation, see:`	`For the HIP reference documentation, see:`
`@@ -1,4 +1,4 @@`
	`# Installation using the command-line interface (CLI)`	`# Installation using the CLI`

	`The steps to install the HIP SDK for Windows are described in this document.`	`The steps to install the HIP SDK for Windows are described in this document.`
`@@ -1,4 +1,4 @@`
	`# Uninstallation using the command-line interface (CLI)`	`# Uninstallation using the CLI`

	`The steps to uninstall the HIP SDK for Windows are described in this document.`	`The steps to uninstall the HIP SDK for Windows are described in this document.`