Link and formatting fixes (#2482)

This commit is contained in:
Lisa
2023-09-20 09:55:21 -06:00
committed by GitHub
parent acde6284a0
commit 940d2933ff
27 changed files with 1057 additions and 528 deletions

5
.gitignore vendored
View File

@@ -13,7 +13,6 @@ _doxygen/
_readthedocs/ _readthedocs/
# avoid duplicating contributing.md due to conf.py # avoid duplicating contributing.md due to conf.py
CHANGELOG.md docs/CHANGELOG.md
docs/contributing.md docs/contribute/index.md
docs/release.md
docs/about/release-notes.md docs/about/release-notes.md

View File

@@ -21,76 +21,78 @@ The release notes for the ROCm platform.
### Release Highlights for ROCm 5.7 ### Release Highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only. ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure. Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 End of Support Notice #### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023. AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state. As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support) * ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7). * No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release). * Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release. * Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906. * Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. * Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature Updates #### Feature updates
##### Non-hostcall HIP Printf ##### Non-hostcall HIP printf
**Current behavior** **Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above), The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
``` ```
Pcie atomics not enabled, hostcall not supported Pcie atomics not enabled, hostcall not supported
``` ```
**Workaround** **Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics. The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values, The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme. * “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution. * “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround, **NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output. * The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed. * The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU ##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm v5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement. The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](understand/using_gpu_sanitizer.md). Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04. Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
#### Fixed Defects **Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
The following defects are fixed in ROCm v5.7, #### Fixed defects
- Test hangs observed in HMM RCCL The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker * Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+ * NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+ * Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0 #### HIP 5.7.0
@@ -98,79 +100,78 @@ The following defects are fixed in ROCm v5.7,
##### Added ##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition * Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux * Added new APIs supporting Windows only, under development on Linux
- `hipMallocMipmappedArray` for allocating a mipmapped array on the device * `hipMallocMipmappedArray` for allocating a mipmapped array on the device
- `hipFreeMipmappedArray` for freeing a mipmapped array on the device * `hipFreeMipmappedArray` for freeing a mipmapped array on the device
- `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array * `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
- `hipMipmappedArrayCreate` for creating a mipmapped array * `hipMipmappedArrayCreate` for creating a mipmapped array
- `hipMipmappedArrayDestroy` for destroy a mipmapped array * `hipMipmappedArrayDestroy` for destroy a mipmapped array
- `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level * `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed ##### Changed
##### Fixed ##### Fixed
##### Known Issues ##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility. * HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host. * HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release ##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure * Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure * Addition of new fields in hipDeviceProp_t structure
- maxTexture1D * maxTexture1D
- maxTexture2D * maxTexture2D
- maxTexture1DLayered * maxTexture1DLayered
- maxTexture2DLayered * maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA * sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() * deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' * asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code * surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]" * unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess * computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t * hostRegisterSupported
* uuid
### Library Changes in ROCM 5.7.0 * Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t
### Library changes in ROCM 5.7.0
| Library | Version | | Library | Version |
|---------|---------| |---------|---------|
@@ -211,11 +212,11 @@ hipCUB 2.13.1 for ROCm 5.7.0
##### Changed ##### Changed
- CUB backend references CUB and Thrust version 2.0.1. * CUB backend references CUB and Thrust version 2.0.1.
- Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one. * Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
- Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`. * Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
##### Known Issues ##### Known issues
- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks. - `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types. - `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
@@ -228,11 +229,11 @@ hipFFT 1.0.12 for ROCm 5.7.0
##### Added ##### Added
- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. * Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
##### Changed ##### Changed
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. * Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
#### hipSOLVER 1.8.1 #### hipSOLVER 1.8.1
@@ -240,7 +241,7 @@ hipSOLVER 1.8.1 for ROCm 5.7.0
##### Changed ##### Changed
- Changed hipsolver-test sparse input data search paths to be relative to the test executable * Changed hipsolver-test sparse input data search paths to be relative to the test executable
#### hipSPARSE 2.3.8 #### hipSPARSE 2.3.8
@@ -248,10 +249,10 @@ hipSPARSE 2.3.8 for ROCm 5.7.0
##### Improved ##### Improved
- Fix compilation failures when using cusparse 12.1.0 backend * Fix compilation failures when using cusparse 12.1.0 backend
- Fix compilation failures when using cusparse 12.0.0 backend * Fix compilation failures when using cusparse 12.0.0 backend
- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend * Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
- Minor improvements * Minor improvements
#### MIOpen 2.19.0 #### MIOpen 2.19.0
@@ -259,17 +260,17 @@ MIOpen 2.19.0 for ROCm 5.7.0
##### Added ##### Added
- ROCm 5.5 support for gfx1101 (Navi32) * ROCm 5.5 support for gfx1101 (Navi32)
##### Changed ##### Changed
- Tuning results for MLIR on ROCm 5.5 * Tuning results for MLIR on ROCm 5.5
- Bumping MLIR commit to 5.5.0 release tag * Bumping MLIR commit to 5.5.0 release tag
##### Fixed ##### Fixed
- Fix 3d convolution Host API bug * Fix 3d convolution Host API bug
- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. * [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
#### RCCL 2.17.1-1 #### RCCL 2.17.1-1
@@ -277,19 +278,19 @@ RCCL 2.17.1-1 for ROCm 5.7.0
##### Changed ##### Changed
- Compatibility with NCCL 2.17.1-1 * Compatibility with NCCL 2.17.1-1
- Performance tuning for some collective operations * Performance tuning for some collective operations
##### Added ##### Added
- Minor improvements to MSCCL codepath * Minor improvements to MSCCL codepath
- NCCL_NCHANNELS_PER_PEER support * NCCL_NCHANNELS_PER_PEER support
- Improved compilation performance * Improved compilation performance
- Support for gfx94x * Support for gfx94x
##### Fixed ##### Fixed
- Potential race-condition during ncclSocketClose() * Potential race-condition during ncclSocketClose()
#### rocALUTION 2.1.11 #### rocALUTION 2.1.11
@@ -297,11 +298,11 @@ rocALUTION 2.1.11 for ROCm 5.7.0
##### Added ##### Added
- Added support for gfx940, gfx941 and gfx942 * Added support for gfx940, gfx941 and gfx942
##### Improved ##### Improved
- Fixed OpenMP runtime issue with Windows toolchain * Fixed OpenMP runtime issue with Windows toolchain
#### rocBLAS 3.1.0 #### rocBLAS 3.1.0
@@ -309,33 +310,33 @@ rocBLAS 3.1.0 for ROCm 5.7.0
##### Added ##### Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details. * yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems. * rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
##### Fixed ##### Fixed
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow: * make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2 * Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv * Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam * Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
- General: set_vector, get_vector, set_matrix, get_matrix * General: set_vector, get_vector, set_matrix, get_matrix
- Related fixes: internal scalar loads with &gt; 32bit offsets * Related fixes: internal scalar loads with &gt; 32bit offsets
- fix in-place functionality for all trtri sizes * fix in-place functionality for all trtri sizes
##### Changed ##### Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory * dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile) * enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions * standardized internal rocblas C++ interface across most functions
##### Deprecated ##### Deprecated
- Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release * Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
##### Dependencies ##### Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients * Optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil * Optional build tool only dependency on python psutil
#### rocFFT 1.0.24 #### rocFFT 1.0.24
@@ -343,16 +344,16 @@ rocFFT 1.0.24 for ROCm 5.7.0
##### Optimizations ##### Optimizations
- Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm. * Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
##### Added ##### Added
- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds) * Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
##### Changed ##### Changed
- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux. * Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
- Moved library kernel cache to lib/rocFFT directory. * Moved library kernel cache to lib/rocFFT directory.
#### rocm-cmake 0.10.0 #### rocm-cmake 0.10.0
@@ -360,8 +361,8 @@ rocm-cmake 0.10.0 for ROCm 5.7.0
##### Added ##### Added
- Added ROCMTest module * Added ROCMTest module
- ROCMCreatePackage: Added support for ASAN packages * ROCMCreatePackage: Added support for ASAN packages
#### rocPRIM 2.13.1 #### rocPRIM 2.13.1
@@ -369,14 +370,14 @@ rocPRIM 2.13.1 for ROCm 5.7.0
##### Changed ##### Changed
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead. * Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation. * Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`. * The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead. * Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
##### Fixed ##### Fixed
- Fixed build issue caused by missing header in `thread/thread_search.hpp`. * Fixed build issue caused by missing header in `thread/thread_search.hpp`.
#### rocRAND 2.10.17 #### rocRAND 2.10.17
@@ -384,14 +385,14 @@ rocRAND 2.10.17 for ROCm 5.7.0
##### Added ##### Added
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. * MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. * New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
- experimental HIP-CPU feature * Experimental HIP-CPU feature
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;. * ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
##### Changed ##### Changed
- Python 2.7 is no longer officially supported. * Python 2.7 is no longer officially supported.
#### rocSOLVER 3.23.0 #### rocSOLVER 3.23.0
@@ -399,19 +400,19 @@ rocSOLVER 3.23.0 for ROCm 5.7.0
##### Added ##### Added
- LU factorization without pivoting for block tridiagonal matrices: * LU factorization without pivoting for block tridiagonal matrices:
- GEBLTTRF_NPVT now supports interleaved\_batched format * GEBLTTRF_NPVT now supports interleaved\_batched format
- Linear system solver without pivoting for block tridiagonal matrices: * Linear system solver without pivoting for block tridiagonal matrices:
- GEBLTTRS_NPVT now supports interleaved\_batched format * GEBLTTRS_NPVT now supports interleaved\_batched format
##### Fixed ##### Fixed
- Fixed stack overflow in sparse tests on Windows * Fixed stack overflow in sparse tests on Windows
##### Changed ##### Changed
- Changed rocsolver-test sparse input data search paths to be relative to the test executable * Changed rocsolver-test sparse input data search paths to be relative to the test executable
- Changed build scripts to default to compressed debug symbols in Debug builds * Changed build scripts to default to compressed debug symbols in Debug builds
#### rocSPARSE 2.5.4 #### rocSPARSE 2.5.4
@@ -419,12 +420,12 @@ rocSPARSE 2.5.4 for ROCm 5.7.0
##### Added ##### Added
- Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex) * Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
- Added support for gfx940, gfx941 and gfx942 * Added support for gfx940, gfx941 and gfx942
##### Improved ##### Improved
- Fixed a bug in csrsm and bsrsm * Fixed a bug in csrsm and bsrsm
##### Known Issues ##### Known Issues
@@ -434,15 +435,15 @@ In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accur
rocThrust 2.18.0 for ROCm 5.7.0 rocThrust 2.18.0 for ROCm 5.7.0
##### Fixed ##### Fixed
- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. * `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
- Fixed issue where `transform_iterator` would not compile with `__device__`-only operators. * Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
##### Changed ##### Changed
- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). * Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
- Removed references to and workarounds for deprecated hcc * Removed references to and workarounds for deprecated hcc
#### rocWMMA 1.2.0 #### rocWMMA 1.2.0
@@ -450,8 +451,8 @@ rocWMMA 1.2.0 for ROCm 5.7.0
##### Changed ##### Changed
- Fixed a bug with synchronization * Fixed a bug with synchronization
- Updated rocWMMA cmake versioning * Updated rocWMMA cmake versioning
#### Tensile 4.38.0 #### Tensile 4.38.0
@@ -459,29 +460,29 @@ Tensile 4.38.0 for ROCm 5.7.0
##### Added ##### Added
- Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation) * Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
- Added user-driven solution selection feature * Added user-driven solution selection feature
##### Optimizations ##### Optimizations
- Enabled LocalSplitU with MFMA for I8 data type * Enabled LocalSplitU with MFMA for I8 data type
- Optimized K mask code in mfmaIter * Optimized K mask code in mfmaIter
- Enabled TailLoop code in NoLoadLoop to prefetch global/local read * Enabled TailLoop code in NoLoadLoop to prefetch global/local read
- Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations * Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
- Optimized DirectToLds test cases to reduce the test duration * Optimized DirectToLds test cases to reduce the test duration
##### Changed ##### Changed
- Removed DGEMM NT custom kernels and related test cases * Removed DGEMM NT custom kernels and related test cases
- Changed noTailLoop logic to apply noTailLoop only for NT * Changed noTailLoop logic to apply noTailLoop only for NT
- Changed the range of AssertFree0ElementMultiple and Free1 * Changed the range of AssertFree0ElementMultiple and Free1
- Unified aStr, bStr generation code in mfmaIter * Unified aStr, bStr generation code in mfmaIter
##### Fixed ##### Fixed
- Fixed LocalSplitU mismatch issue for SGEMM * Fixed LocalSplitU mismatch issue for SGEMM
- Fixed BufferStore=0 and Ldc != Ldd case * Fixed BufferStore=0 and Ldc != Ldd case
- Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1 * Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
------------------- -------------------
@@ -2282,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
- Enable new device RTL in libomptarget as default. - Enable new device RTL in libomptarget as default.
- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. - New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. - Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
- Initial implementation of optimized cross-team sum reduction for float and double type scalars. - Initial implementation of optimized cross-team sum reduction for float and double type scalars.
- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. - Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -21,7 +21,7 @@ The release notes for the ROCm platform.
### What's new in this release ### What's new in this release
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only. ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure. Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.

View File

@@ -15,6 +15,477 @@ The release notes for the ROCm platform.
------------------- -------------------
## ROCm 5.7.0
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### Release Highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
* Bug fixes during the maintenance will be made to the next ROCm point release.
* Bug fixes will not be backported to older ROCm releases for gfx906.
* Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature updates
##### Non-hostcall HIP printf
**Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
```
Pcie atomics not enabled, hostcall not supported
```
**Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
* “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround:
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
#### Fixed defects
The following defects are fixed in ROCm v5.7:
* Test hangs observed in HMM RCCL
* NoGpuTst test of Catch2 fails with Docker
* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0
##### Optimizations
##### Added
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
* Added new APIs supporting Windows only, under development on Linux
* `hipMallocMipmappedArray` for allocating a mipmapped array on the device
* `hipFreeMipmappedArray` for freeing a mipmapped array on the device
* `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
* `hipMipmappedArrayCreate` for creating a mipmapped array
* `hipMipmappedArrayDestroy` for destroy a mipmapped array
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed
##### Fixed
##### Known issues
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release
* Removal of `gcnarch` from hipDeviceProp_t structure
* Addition of new fields in hipDeviceProp_t structure
* maxTexture1D
* maxTexture2D
* maxTexture1DLayered
* maxTexture2DLayered
* sharedMemPerMultiprocessor
* deviceOverlap
* asyncEngineCount
* surfaceAlignment
* unifiedAddressing
* computePreemptionSupported
* hostRegisterSupported
* uuid
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t
### Library changes in ROCM 5.7.0
| Library | Version |
|---------|---------|
| hipBLAS | 0.54.0 ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.0) |
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.0) |
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.0) |
| hipSOLVER | 1.8.0 ⇒ [1.8.1](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.0) |
| hipSPARSE | 2.3.7 ⇒ [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.0) |
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.0) |
| rccl | 2.15.5 ⇒ [2.17.1-1](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.7.0) |
| rocALUTION | 2.1.9 ⇒ [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.0) |
| rocBLAS | 3.0.0 ⇒ [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.0) |
| rocFFT | 1.0.23 ⇒ [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.0) |
| rocm-cmake | 0.9.0 ⇒ [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.0) |
| rocPRIM | 2.13.0 ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.0) |
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.0) |
| rocSOLVER | 3.22.0 ⇒ [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.0) |
| rocSPARSE | 2.5.2 ⇒ [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.0) |
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.0) |
| rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.0) |
| Tensile | 4.37.0 ⇒ [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.0) |
#### hipBLAS 1.1.0
hipBLAS 1.1.0 for ROCm 5.7.0
##### Changed
- updated documentation requirements
##### Dependencies
- dependency rocSOLVER now depends on rocSPARSE
#### hipCUB 2.13.1
hipCUB 2.13.1 for ROCm 5.7.0
##### Changed
* CUB backend references CUB and Thrust version 2.0.1.
* Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
* Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
##### Known issues
- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
- `DeviceHistogram::HistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int]`.
- `DeviceHistogram::MultiHistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]` and `[LevelT, SampleIteratorT] = [float, double]`.
#### hipFFT 1.0.12
hipFFT 1.0.12 for ROCm 5.7.0
##### Added
* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
##### Changed
* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
#### hipSOLVER 1.8.1
hipSOLVER 1.8.1 for ROCm 5.7.0
##### Changed
* Changed hipsolver-test sparse input data search paths to be relative to the test executable
#### hipSPARSE 2.3.8
hipSPARSE 2.3.8 for ROCm 5.7.0
##### Improved
* Fix compilation failures when using cusparse 12.1.0 backend
* Fix compilation failures when using cusparse 12.0.0 backend
* Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
* Minor improvements
#### MIOpen 2.19.0
MIOpen 2.19.0 for ROCm 5.7.0
##### Added
* ROCm 5.5 support for gfx1101 (Navi32)
##### Changed
* Tuning results for MLIR on ROCm 5.5
* Bumping MLIR commit to 5.5.0 release tag
##### Fixed
* Fix 3d convolution Host API bug
* [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
#### RCCL 2.17.1-1
RCCL 2.17.1-1 for ROCm 5.7.0
##### Changed
* Compatibility with NCCL 2.17.1-1
* Performance tuning for some collective operations
##### Added
* Minor improvements to MSCCL codepath
* NCCL_NCHANNELS_PER_PEER support
* Improved compilation performance
* Support for gfx94x
##### Fixed
* Potential race-condition during ncclSocketClose()
#### rocALUTION 2.1.11
rocALUTION 2.1.11 for ROCm 5.7.0
##### Added
* Added support for gfx940, gfx941 and gfx942
##### Improved
* Fixed OpenMP runtime issue with Windows toolchain
#### rocBLAS 3.1.0
rocBLAS 3.1.0 for ROCm 5.7.0
##### Added
* yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
* rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
##### Fixed
* make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
* Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
* Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
* Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
* General: set_vector, get_vector, set_matrix, get_matrix
* Related fixes: internal scalar loads with &gt; 32bit offsets
* fix in-place functionality for all trtri sizes
##### Changed
* dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
* enhanced reporting of installation issues caused by runtime libraries (Tensile)
* standardized internal rocblas C++ interface across most functions
##### Deprecated
* Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
##### Dependencies
* Optional use of AOCL BLIS 4.0 on Linux for clients
* Optional build tool only dependency on python psutil
#### rocFFT 1.0.24
rocFFT 1.0.24 for ROCm 5.7.0
##### Optimizations
* Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
##### Added
* Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
##### Changed
* Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
* Moved library kernel cache to lib/rocFFT directory.
#### rocm-cmake 0.10.0
rocm-cmake 0.10.0 for ROCm 5.7.0
##### Added
* Added ROCMTest module
* ROCMCreatePackage: Added support for ASAN packages
#### rocPRIM 2.13.1
rocPRIM 2.13.1 for ROCm 5.7.0
##### Changed
* Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
* Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
* The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
* Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
##### Fixed
* Fixed build issue caused by missing header in `thread/thread_search.hpp`.
#### rocRAND 2.10.17
rocRAND 2.10.17 for ROCm 5.7.0
##### Added
* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
* Experimental HIP-CPU feature
* ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
##### Changed
* Python 2.7 is no longer officially supported.
#### rocSOLVER 3.23.0
rocSOLVER 3.23.0 for ROCm 5.7.0
##### Added
* LU factorization without pivoting for block tridiagonal matrices:
* GEBLTTRF_NPVT now supports interleaved\_batched format
* Linear system solver without pivoting for block tridiagonal matrices:
* GEBLTTRS_NPVT now supports interleaved\_batched format
##### Fixed
* Fixed stack overflow in sparse tests on Windows
##### Changed
* Changed rocsolver-test sparse input data search paths to be relative to the test executable
* Changed build scripts to default to compressed debug symbols in Debug builds
#### rocSPARSE 2.5.4
rocSPARSE 2.5.4 for ROCm 5.7.0
##### Added
* Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
* Added support for gfx940, gfx941 and gfx942
##### Improved
* Fixed a bug in csrsm and bsrsm
##### Known Issues
In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
#### rocThrust 2.18.0
rocThrust 2.18.0 for ROCm 5.7.0
##### Fixed
* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
* Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
##### Changed
* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
* Removed references to and workarounds for deprecated hcc
#### rocWMMA 1.2.0
rocWMMA 1.2.0 for ROCm 5.7.0
##### Changed
* Fixed a bug with synchronization
* Updated rocWMMA cmake versioning
#### Tensile 4.38.0
Tensile 4.38.0 for ROCm 5.7.0
##### Added
* Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
* Added user-driven solution selection feature
##### Optimizations
* Enabled LocalSplitU with MFMA for I8 data type
* Optimized K mask code in mfmaIter
* Enabled TailLoop code in NoLoadLoop to prefetch global/local read
* Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
* Optimized DirectToLds test cases to reduce the test duration
##### Changed
* Removed DGEMM NT custom kernels and related test cases
* Changed noTailLoop logic to apply noTailLoop only for NT
* Changed the range of AssertFree0ElementMultiple and Free1
* Unified aStr, bStr generation code in mfmaIter
##### Fixed
* Fixed LocalSplitU mismatch issue for SGEMM
* Fixed BufferStore=0 and Ldc != Ldd case
* Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
-------------------
## ROCm 5.6.1 ## ROCm 5.6.1
<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header --> <!-- markdownlint-disable no-duplicate-header -->
@@ -32,6 +503,36 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs * Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
* Using *hipGraphAddMemFreeNode* no longer results in a crash * Using *hipGraphAddMemFreeNode* no longer results in a crash
### Library Changes in ROCM 5.6.1
| Library | Version |
|---------|---------|
| hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.6.1) |
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.6.1) |
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.6.1) |
| hipSOLVER | [1.8.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.6.1) |
| hipSPARSE | 2.3.6 ⇒ [2.3.7](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.6.1) |
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.6.1) |
| rccl | [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.6.1) |
| rocALUTION | [2.1.9](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.6.1) |
| rocBLAS | [3.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.6.1) |
| rocFFT | [1.0.23](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.6.1) |
| rocm-cmake | [0.9.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.6.1) |
| rocPRIM | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.6.1) |
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.6.1) |
| rocSOLVER | [3.22.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.6.1) |
| rocSPARSE | [2.5.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.6.1) |
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.6.1) |
| Tensile | [4.37.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.6.1) |
#### hipSPARSE 2.3.7
hipSPARSE 2.3.7 for ROCm 5.6.1
##### Bugfix
* Reverted an undocumented API change in hipSPARSE 2.3.6 that affected hipsparseSpSV_solve function
------------------- -------------------
## ROCm 5.6.0 ## ROCm 5.6.0
@@ -531,7 +1032,7 @@ rocWMMA 1.1.0 for ROCm 5.6.0
* Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) * Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
* Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) * Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
* Added performance gemm samples for half, single and double precision * Added performance gemm samples for half, single, and double precision
* Added rocWMMA cmake versioning * Added rocWMMA cmake versioning
* Added vectorized support in coordinate transforms * Added vectorized support in coordinate transforms
* Included ROCm smi for runtime clock rate detection * Included ROCm smi for runtime clock rate detection
@@ -1782,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
- Enable new device RTL in libomptarget as default. - Enable new device RTL in libomptarget as default.
- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. - New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. - Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
- Initial implementation of optimized cross-team sum reduction for float and double type scalars. - Initial implementation of optimized cross-team sum reduction for float and double type scalars.
- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. - Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -15,15 +15,15 @@ The release notes for the ROCm platform.
------------------- -------------------
## ROCm 5.6.1 ## ROCm 5.7.0
<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header --> <!-- markdownlint-disable no-duplicate-header -->
### What's new in this release ### What's new in this release
ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
## HIP 5.6.1 (for ROCm 5.6.1) Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
### Fixed defects ### Fixed defects

View File

@@ -3,7 +3,7 @@ How ROCm uses PCIe atomics
=========================== ===========================
ROCm PCIe feature and overview base address register (BAR) memory ROCm PCIe feature and overview of BAR memory
====================================================================== ======================================================================
@@ -57,7 +57,7 @@ New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_ * `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_ * `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets: In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
| 1. write to another GPU memory, | 1. write to another GPU memory,
@@ -86,7 +86,7 @@ For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual add
* BAR4 register: Optional, not a boot device. * BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB. * BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
Here is how our BAR works on GFX 8 GPUs with 40 bit Physical Address Limit :: Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1) 11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)

View File

@@ -467,11 +467,11 @@ Plotting the train and test loss shows both metrics reducing over training epoch
### Custom model with CIFAR-10 on PyTorch ### Custom model with CIFAR-10 on PyTorch
The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below. The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
Follow these steps: Follow these steps:
1. Import dependencies, including torch, os, and [torchvision](https://github.com/pytorch/vision). 1. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision).
```py ```py
import torch import torch
@@ -662,13 +662,13 @@ Follow these steps:
print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy)) print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
``` ```
### Case study: TensorFlow with Fashion MNIST ### Case study: TensorFlow with Fashion-MNIST
Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories. Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts. Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.
The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries. The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries.
Access the source code from the following repository: Access the source code from the following repository:
@@ -690,7 +690,7 @@ To understand the code step by step, follow these steps:
print(tf._version__) r print(tf._version__) r
``` ```
3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn. 3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
4. The model is tested against the test set, test_images, and test_labels arrays. 4. The model is tested against the test set, test_images, and test_labels arrays.

View File

@@ -1,4 +1,4 @@
# ROCm Linux Filesystem Hierarchy Standard (FHS) reorganization # ROCm Linux Filesystem Hierarchy Standard reorganization
## Introduction ## Introduction

View File

@@ -9,7 +9,7 @@ This document lists and describes the hardware performance counters and the deri
Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation. Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
``` ```
### Graphics register bus management (GRBM) ### GRBM
#### GRBM counters #### GRBM counters
@@ -30,7 +30,7 @@ Preliminary validation of all MI200 performance counters is in progress. Those w
The command processor counters are further classified into fetcher and compute. The command processor counters are further classified into fetcher and compute.
#### Command processor - fetcher (CPF) #### CPF
##### CPF counters ##### CPF counters
@@ -43,7 +43,7 @@ The command processor counters are further classified into fetcher and compute.
| `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle | | `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle |
| `cpf_cpf_tciu_stall[]` | Cycles | CPF TCIU interface is stalled waiting on free tags. | | `cpf_cpf_tciu_stall[]` | Cycles | CPF TCIU interface is stalled waiting on free tags. |
#### Command processor - compute (CPC) #### CPC
##### CPC counters ##### CPC counters
@@ -61,7 +61,7 @@ The command processor counters are further classified into fetcher and compute.
| `cpc_cpc_utcl2iu_stall[]` | Cycles | CPC UTCL2 interface stalled waiting | | `cpc_cpc_utcl2iu_stall[]` | Cycles | CPC UTCL2 interface stalled waiting |
| `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy | | `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy |
### Shader processor input (SPI) ### SPI
#### SPI counters #### SPI counters
@@ -325,7 +325,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
| `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache | | `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache |
| `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache | | `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache |
#### Texture cache arbiter (TCA) #### TCA
| Hardware Counter | Unit | Definition | | Hardware Counter | Unit | Definition |
| :----------------| :------| ------------------------------------------: | | :----------------| :------| ------------------------------------------: |
@@ -400,9 +400,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) | | `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) |
| `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) | | `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
## Abbreviations ## MI200 acronyms
### MI200 abbreviations
| Abbreviation | Meaning | | Abbreviation | Meaning |
| :------------| --------------------------------------------------------------------------------: | | :------------| --------------------------------------------------------------------------------: |

View File

@@ -1,4 +1,4 @@
# GPU Memory # GPU memory
For the HIP reference documentation, see: For the HIP reference documentation, see:

View File

@@ -44,9 +44,9 @@ product lines.
::: :::
:::{grid-item-card} GPU Memory :::{grid-item-card}
:link: gpu_memory **[GPU memory](./gpu-memory.md)**
:link-type: doc
Learn about the different types of memory allocations. Learn about the different types of memory allocations.
::: :::

View File

@@ -1,4 +1,4 @@
# Using the LLVM AddressSanitizer (ASan) on a GPU (beta release) # Using the LLVM ASan on a GPU (beta release)
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement. The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
@@ -7,7 +7,7 @@ Until now, the LLVM ASan process was only available for traditional purely CPU a
This document provides documentation on using ROCm ASan. This document provides documentation on using ROCm ASan.
For information about LLVM ASan, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html). For information about LLVM ASan, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04. **Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
## Compiling for ASan ## Compiling for ASan
@@ -51,11 +51,11 @@ For a complete ROCm GPU Sanitizer installation, the following must be installed
sudo apt-get install hipfft-asan hipsparse-asan migraphx-asan miopen-hip-asan rocalution-asan rocblas-asan rocfft-asan rocm-core-asan rocsparse-asan hipblaslt-asan mivisionx-asan rocsolver-asan sudo apt-get install hipfft-asan hipsparse-asan migraphx-asan miopen-hip-asan rocalution-asan rocblas-asan rocfft-asan rocm-core-asan rocsparse-asan hipblaslt-asan mivisionx-asan rocsolver-asan
``` ```
**Note**: It is recommended to install all address sanitizer packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries. **Note**: It is recommended to install all ASan packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries.
## Using AMD-supplied ASan instrumented libraries ## Using AMD-supplied ASan instrumented libraries
ROCm releases have optional packages containing additional address sanitizer instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`. ROCm releases have optional packages containing additional ASan instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`.
These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as descibed above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`. These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as descibed above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves. These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
@@ -235,7 +235,7 @@ $ rocgdb <path to application>
### Using ASan with a short HIP application ### Using ASan with a short HIP application
Refer to the following example to use address sanitizer with a short HIP application, Refer to the following example to use ASan with a short HIP application,
https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md

View File

@@ -41,10 +41,10 @@ python3 -mvenv .venv
Then open up `_build/html/index.html` in your favorite browser. Then open up `_build/html/index.html` in your favorite browser.
## Build documentation using Visual Studio (VS) Code ## Build documentation using Visual Studio Code
One can put together a productive environment to author documentation and also One can put together a productive environment to author documentation and also
test it locally using VS Code with only a handful of extensions. Even though the test it locally using Visual Studio (VS) Code with only a handful of extensions. Even though the
extension landscape of VS Code is ever changing, here is one example setup that extension landscape of VS Code is ever changing, here is one example setup that
proved useful at the time of writing. In it, one can change/add content, build a proved useful at the time of writing. In it, one can change/add content, build a
new version of the docs using a single VS Code Task (or hotkey), see all errors/ new version of the docs using a single VS Code Task (or hotkey), see all errors/

View File

@@ -17,8 +17,7 @@ memory with RDMA capabilities. These interfaces are currently registered as a
These interfaces are used to optimize inter-node MPI message communication. These interfaces are used to optimize inter-node MPI message communication.
This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
MPI project is an open source implementation of the Message Passing Interface MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
(MPI) that is developed and maintained by a consortium of academic, research,
and industry partners. and industry partners.
Several MPI implementations can be made ROCm-aware by compiling them with Several MPI implementations can be made ROCm-aware by compiling them with
@@ -72,7 +71,7 @@ make -j $(nproc)
make -j $(nproc) install make -j $(nproc) install
``` ```
The [communication libraries tables](#communication-libraries) The [communication libraries tables](../reference/libraries/gpu-libraries/communication.md)
documents the compatibility of UCX versions with ROCm versions. documents the compatibility of UCX versions with ROCm versions.
## Install Open MPI ## Install Open MPI
@@ -157,7 +156,7 @@ Unified Collective Communication Library (UCC) component in Open MPI.
For this, the UCC library has to be configured and compiled with ROCm For this, the UCC library has to be configured and compiled with ROCm
support. support.
Please note the compatibility [tables](#communication-libraries) Please note the compatibility tables in the [communication libraries](../reference/libraries/gpu-libraries/communication.md)
for UCC versions with the various ROCm versions. for UCC versions with the various ROCm versions.
An example for configuring UCC and Open MPI with ROCm support An example for configuring UCC and Open MPI with ROCm support

View File

@@ -405,7 +405,7 @@ SIMD pipelines, memory information, and instruction set architecture:
![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system") ![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system")
For a complete list of architecture (LLVM target) names, refer to For a complete list of architecture (LLVM target) names, refer to
[Linux Support](../../about/compatibility/linux-support.md) and [Windows Support](../../about/compatibility/windows-support.md). [Linux support](../../about/compatibility/linux-support.md) and [Windows support](../../about/compatibility/windows-support.md).
### Testing inter-device bandwidth ### Testing inter-device bandwidth

View File

@@ -1,4 +1,4 @@
# AI libraries # Artificial intelligence libraries
::::{grid} 1 1 2 2 ::::{grid} 1 1 2 2
:gutter: 1 :gutter: 1

View File

@@ -9,7 +9,6 @@ ROCm libraries for fast Fourier transforms (FFTs) are as follows:
rocFFT is an AMD GPU optimized library for FFT. rocFFT is an AMD GPU optimized library for FFT.
* {doc}`Documentation <rocfft:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT) * [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md) * [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md)
@@ -21,7 +20,6 @@ hipFFT is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs
using rocFFT. hipFFT allows for a common interface for other non AMD GPU using rocFFT. hipFFT allows for a common interface for other non AMD GPU
FFT libraries. FFT libraries.
* {doc}`Documentation <hipfft:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT) * [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md) * [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md)

View File

@@ -436,7 +436,7 @@ See the complete sample code for global buffer overflow
You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options: You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options:
* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions. * `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including no-loop and Cross-team reductions.
* `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution. * `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution.
@@ -448,13 +448,13 @@ You can use the clang compiler option `-fopenmp-target-fast` for kernel optimiza
Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported: Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:
* No-Loop * No-loop
* Big-Jump-Loop * Big-jump-loop
* Cross-Team (Xteam) Reductions * Cross-team reductions
To enable the generation of specialized kernels, follow these guidelines: To enable the generation of specialized kernels, follow these guidelines:
* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel. * Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.
* Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`. * Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`.
@@ -464,11 +464,11 @@ To enable the generation of specialized kernels, follow these guidelines:
#### No-loop kernel generation #### No-loop kernel generation
The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance. The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
#### Big-jump-loop kernel generation #### Big-jump-loop kernel generation
A No-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause. A no-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
#### Cross-team optimized reduction kernel generation #### Cross-team optimized reduction kernel generation

View File

@@ -656,7 +656,7 @@ of target triple and the target GPU (along with the associated target features).
modified to query this structure to identify a compatible image based on the modified to query this structure to identify a compatible image based on the
capability of the current system. capability of the current system.
#### Unified shared memory (USM) #### Unified shared memory
The following OpenMP pragma is available on MI200, and it must be executed with The following OpenMP pragma is available on MI200, and it must be executed with
`xnack+` support. `xnack+` support.
@@ -665,7 +665,7 @@ The following OpenMP pragma is available on MI200, and it must be executed with
omp requires unified_shared_memory omp requires unified_shared_memory
``` ```
For more details on USM refer to the {ref}`openmp_usm` section of the OpenMP For more details on unified shared memory refer to the {ref}`openmp_usm` section of the OpenMP
Guide. Guide.
### Support status of other Clang options ### Support status of other Clang options

View File

@@ -119,7 +119,7 @@ subtrees:
- file: tutorials/install/docker.md - file: tutorials/install/docker.md
title: ROCm Docker containers title: ROCm Docker containers
- file: tutorials/install/pytorch-install.md - file: tutorials/install/pytorch-install.md
title: Pytorch for ROCm title: PyTorch for ROCm
- file: tutorials/install/tensorflow-install.md - file: tutorials/install/tensorflow-install.md
title: Tensorflow for ROCm title: Tensorflow for ROCm
- file: tutorials/install/magma-install.md - file: tutorials/install/magma-install.md
@@ -199,6 +199,8 @@ subtrees:
title: MI200 title: MI200
- file: conceptual/gpu-arch/mi250.md - file: conceptual/gpu-arch/mi250.md
title: MI250 title: MI250
- file: conceptual/gpu-memory.md
title: GPU memory
- file: conceptual/compiler-disambiguation.md - file: conceptual/compiler-disambiguation.md
title: Compiler disambiguation title: Compiler disambiguation
- file: conceptual/file-reorg.md - file: conceptual/file-reorg.md
@@ -214,7 +216,7 @@ subtrees:
- file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst - file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
title: ROCm & PCIe atomics title: ROCm & PCIe atomics
- file: conceptual/ai-pytorch-inception.md - file: conceptual/ai-pytorch-inception.md
title: Inception v3 with Pytorch title: Inception v3 with PyTorch
- file: conceptual/ai-migraphx-optimization.md - file: conceptual/ai-migraphx-optimization.md
title: Inference optimization with MIGraphX title: Inference optimization with MIGraphX

View File

@@ -43,8 +43,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`.
**Q: Can I install PyTorch directly on bare metal?** **Q: Can I install PyTorch directly on bare metal?**
Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to
Option 2: Install PyTorch Using Wheels Package in the section Option 2: Install PyTorch Using Wheels Package. See [Installing PyTorch](../tutorials/install/pytorch-install.md) for more information.
{ref}`install-pytorch-using-wheels` of this guide for more information.
**Q: How do I profile PyTorch workloads?** **Q: How do I profile PyTorch workloads?**

View File

@@ -1,370 +1,399 @@
# PyTorch Installation for ROCm # Installing PyTorch for ROCm
## PyTorch [PyTorch](https://pytorch.org/) is an open-source tensor library designed for deep learning. PyTorch on
ROCm provides mixed-precision and large-scale training using our
[MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) and
[RCCL](https://github.com/ROCmSoftwarePlatform/rccl) libraries.
PyTorch is an open-source machine learning Python library, primarily To install [PyTorch for ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), you have the following options:
differentiated by Tensor computing with GPU acceleration and a type-based
automatic differentiation. Other advanced features include:
* Support for distributed training * [Use a Docker image with PyTorch pre-installed](#using-a-docker-image-with-pytorch-pre-installed)
* Native ONNX support (recommended)
* C++ front-end * [Use a wheels package](#using-a-wheels-package)
* The ability to deploy at scale using TorchServe * [Use the PyTorch ROCm base Docker image](#using-the-pytorch-rocm-base-docker-image)
* A production-ready deployment mechanism through TorchScript * [Use the PyTorch upstream Docker file](#using-the-pytorch-upstream-docker-file)
### Installing PyTorch For hardware, software, and third-party framework compatibility between ROCm and PyTorch, refer to:
To install ROCm on bare metal, refer to the sections * [GPU and OS support (Linux)](../../about/compatibility/linux-support.md)
[GPU and OS Support (Linux)](../../about/compatibility/linux-support.md) and * [Compatibility](../../about/compatibility/3rd-party-support-matrix.md)
[Compatibility](../../about/compatibility/index.md) for hardware, software and
3rd-party framework compatibility between ROCm and PyTorch. The recommended
option to get a PyTorch environment is through Docker. However, installing the
PyTorch wheels package on bare metal is also supported.
#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-Installed ## Using a Docker image with PyTorch pre-installed
Using Docker gives you portability and access to a prebuilt Docker container 1. Download the latest public PyTorch Docker image
that has been rigorously tested within AMD. This might also save on the ([https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch)).
compilation time and should perform as it did when tested without facing
potential installation issues.
Follow these steps:
1. Pull the latest public PyTorch Docker image.
```bash ```bash
docker pull rocm/pytorch:latest docker pull rocm/pytorch:latest
``` ```
Optionally, you may download a specific and supported configuration with You can also download a specific and supported configuration with different user-space ROCm
different user-space ROCm versions, PyTorch versions, and supported operating versions, PyTorch versions, and operating systems.
systems. To download the PyTorch Docker image, refer to
[https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).
2. Start a Docker container using the downloaded image. 2. Start a Docker container using the image.
```bash ```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video \
--ipc=host --shm-size 8G rocm/pytorch:latest
``` ```
```{note} :::{note}
This will automatically download the image if it does not exist on the host. This will automatically download the image if it does not exist on the host. You can also pass the `-v`
You can also pass the -v argument to mount any data directories from the host argument to mount any data directories from the host onto the container.
onto the container. :::
(install_pytorch_wheels)=
## Using a wheels package
PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, go
to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). In the interactive
table, choose ROCm from the _Compute Platform_ row.
1. Choose one of the following three options:
**Option 1:**
a. Download a base Docker image with the correct user-space ROCm version.
| Base OS | Docker image | Link to Docker image|
|----------------|-----------------------------|----------------|
| Ubuntu 20.04 | `rocm/dev-ubuntu-20.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-20.04](https://hub.docker.com/r/rocm/dev-ubuntu-20.04)
| Ubuntu 22.04 | `rocm/dev-ubuntu-22.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-22.04](https://hub.docker.com/r/rocm/dev-ubuntu-22.04)
| CentOS 7 | `rocm/dev-centos-7` | [https://hub.docker.com/r/rocm/dev-centos-7](https://hub.docker.com/r/rocm/dev-centos-7)
b. Pull the selected image.
```bash
docker pull rocm/dev-ubuntu-20.04:latest
``` ```
(install-pytorch-using-wheels)= c. Start a Docker container using the downloaded image.
#### Option 2: Install PyTorch Using Wheels Package ```bash
PyTorch supports the ROCm platform by providing tested wheels packages. To
access this feature, refer to
[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
and choose the "ROCm" compute platform. The following image is a matrix from <https://pytorch.org/> that illustrates the installation compatibility between ROCm and the PyTorch build.
![Pytorch installation matrix](../../data/tutorials/install/magma-install/magma006.png "Pytorch installation matrix")
To install PyTorch using the wheels package, follow these installation steps:
1. Choose one of the following options:
a. Obtain a base Docker image with the correct user-space ROCm version
installed from
[https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04).
or
b. Download a base OS Docker image and install ROCm following the
installation directions in the [Installation](../../tutorials/install/linux/index) section. ROCm 5.2 is installed in
this example, as supported by the installation matrix from
<https://pytorch.org/>.
or
c. Install on bare metal. Skip to Step 3.
```bash
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
```
2. Start the Docker container, if not installing on bare metal.
```dockerfile
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
``` ```
3. Install any dependencies needed for installing the wheels package. **Option 2:**
Select a base OS Docker image (Check [OS compatibility](../../about/compatibility/linux-support.md))
Pull selected base OS image (Ubuntu 20.04 for example)
```docker
docker pull ubuntu:20.04
```
Start a Docker container using the downloaded image
```docker
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:20.04
```
Install ROCm using the directions in the [Installation section](../install/linux/install-options.md).
**Option 3:**
Install on bare metal. Check [OS compatibility](../../about/compatibility/linux-support.md) and install ROCm using the
directions in the [Installation section](../install/linux/install-options.md).
2. Install the required dependencies for the wheels package.
```bash ```bash
sudo apt update sudo apt update
sudo apt install libjpeg-dev python3-dev sudo apt install libjpeg-dev python3-dev python3-pip
pip3 install wheel setuptools pip3 install wheel setuptools
``` ```
4. Install Torch, TorchVision, and TorchAudio as specified by the installation 3. Install `torch`, `torchvision`, and `torchaudio`, as specified in the [installation matrix](https://pytorch.org/get-started/locally/).
matrix.
```{note} :::{note}
ROCm 5.2 PyTorch wheel in the command below is shown for reference. The following command uses the ROCm 5.6 PyTorch wheel. If you want a different version of ROCm,
``` modify the command accordingly.
:::
```bash ```bash
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6/
``` ```
#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image 4. (Optional) Use MIOpen kdb files with ROCm PyTorch wheels.
A prebuilt base Docker image is used to build PyTorch in this option. The base PyTorch uses [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for machine learning
Docker has all dependencies installed, including: primitives, which are compiled into kernels at runtime. Runtime compilation causes a small warm-up
phase when starting PyTorch, and MIOpen kdb files contain precompiled kernels that can speed up
application warm-up phases. For more information, refer to the
{doc}`MIOpen installation page <miopen:install>`.
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in
a specific location with respect to the PyTorch installation path. A helper script simplifies this task by
taking the ROCm version and GPU architecture as inputs. This works for Ubuntu and CentOS.
You can download the helper script here:
[install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/ install_kdb_files_for_pytorch_wheels.sh), or use:
`wget https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh`
After installing ROCm PyTorch wheels, run the following code:
```bash
#Optional; replace 'gfx90a' with your architecture
export GFX_ARCH=gfx90a
#Optional
export ROCM_VERSION=5.5
./install_kdb_files_for_pytorch_wheels.sh
```
## Using the PyTorch ROCm base Docker image
The pre-built base Docker image has all dependencies installed, including:
* ROCm * ROCm
* Torchvision * Torchvision
* Conda packages * Conda packages
* Compiler toolchain * The compiler toolchain
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, which is used by the build
the build scripts utilize that to determine the build environment configuration. scripts to determine the configuration of the build environment.
Follow these steps: 1. Download the Docker image. This is the base image, which does not contain PyTorch.
1. Obtain the Docker image.
```bash ```bash
docker pull rocm/pytorch:latest-base docker pull rocm/pytorch:latest-base
``` ```
The above will download the base container, which does not contain PyTorch. 2. Start a Docker container using the downloaded image.
2. Start a Docker container using the image.
```bash ```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
``` ```
You can also pass the -v argument to mount any data directories from the host You can also pass the `-v` argument to mount any data directories from the host onto the container.
onto the container.
3. Clone the PyTorch repository. 3. Clone the PyTorch repository.
```bash ```bash
cd ~ cd ~
git clone https://github.com/pytorch/pytorch.git git clone https://github.com/pytorch/pytorch.git
cd pytorch cd /pytorch
git submodule update --init --recursive git submodule update --init --recursive
``` ```
4. Build PyTorch for ROCm. 4. Set ROCm architecture (optional). The Docker image tag is `rocm/pytorch:latest-base`.
```{note} :::{note}
By default in the `rocm/pytorch:latest-base`, PyTorch builds for these By default in the `rocm/pytorch:latest-base` image, PyTorch builds simultaneously for the following
architectures simultaneously: architectures:
* gfx900 * gfx900
* gfx906 * gfx906
* gfx908 * gfx908
* gfx90a * gfx90a
* gfx1030 * gfx1030
``` :::
5. To determine your AMD uarch, run: If you want to compile _only_ for your microarchitecture (uarch), run:
```bash
rocminfo | grep gfx
```
6. In the event you want to compile only for your uarch, use:
```bash ```bash
export PYTORCH_ROCM_ARCH=<uarch> export PYTORCH_ROCM_ARCH=<uarch>
``` ```
`<uarch>` is the architecture reported by the `rocminfo` command. Where `<uarch>` is the architecture reported by the `rocminfo` command.
7. Build PyTorch using the following command: To find your uarch, run:
```bash ```bash
./.jenkins/pytorch/build.sh rocminfo | grep gfx
``` ```
This will first convert PyTorch sources for HIP compatibility and build the 5. Build PyTorch.
```bash
./.ci/pytorch/build.sh
```
This converts PyTorch sources for
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework. PyTorch framework.
8. Alternatively, build PyTorch by issuing the following commands: To check if your build is successful, run:
```bash ```bash
python3 tools/amd_build/build_amd.py echo $? # should return 0 if success
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
``` ```
#### Option 4: Install Using PyTorch Upstream Docker File ## Using the PyTorch upstream Docker file
Instead of using a prebuilt base Docker image, you can build a custom base If you don't want to use a prebuilt base Docker image, you can build a custom base Docker image
Docker image using scripts from the PyTorch repository. This will utilize a using scripts from the PyTorch repository. This uses a standard Docker image from operating system
standard Docker image from operating system maintainers and install all the maintainers and installs all the required dependencies, including:
dependencies required to build PyTorch, including
* ROCm * ROCm
* Torchvision * Torchvision
* Conda packages * Conda packages
* Compiler toolchain * The compiler toolchain
Follow these steps: 1. Clone the PyTorch repository.
1. Clone the PyTorch repository on the host.
```bash ```bash
cd ~ cd ~
git clone https://github.com/pytorch/pytorch.git git clone https://github.com/pytorch/pytorch.git
cd pytorch cd /pytorch
git submodule update --init --recursive git submodule update --init --recursive
``` ```
2. Build the PyTorch Docker image. 2. Build the PyTorch Docker image.
```bash ```bash
cd.circleci/docker cd .ci/docker
./build.sh pytorch-linux-bionic-rocm<version>-py3.7 ./build.sh pytorch-linux-<os-version>-rocm<rocm-version>-py<python-version> -t rocm/pytorch:build_from_dockerfile
# eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
``` ```
This should be complete with a message "Successfully build `<image_id>`." Where:
* `<os-version>`: `ubuntu20.04` (or `focal`), `ubuntu22.04` (or `jammy`), `centos7.5`, or `centos9`
* `<rocm-version>`: `5.4`, `5.5`, or `5.6`
* `<python-version>`: `3.8`-`3.11`
3. Start a Docker container using the image: To verify that your image was successfully created, run:
`docker image ls rocm/pytorch:build_from_dockerfile`
If successful, the output looks like this:
```bash ```bash
docker run -it --cap-add=SYS_PTRACE --security-opt REPOSITORY TAG IMAGE ID CREATED SIZE
seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add rocm/pytorch build_from_dockerfile 17071499be47 2 minutes ago 32.8GB
video --ipc=host --shm-size 8G <image_id>
``` ```
You can also pass -v argument to mount any data directories from the host 3. Start a Docker container using the image with the mounted PyTorch folder.
onto the container.
```bash
4. Clone the PyTorch repository. docker run -it --cap-add=SYS_PTRACE --security-opt --user root \
seccomp=unconfined --device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size 8G \
-v ~/pytorch:/pytorch rocm/pytorch:build_from_dockerfile
```
You can also pass the `-v` argument to mount any data directories from the host onto the container.
4. Go to the PyTorch directory.
```bash ```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd pytorch cd pytorch
git submodule update --init --recursive
``` ```
5. Build PyTorch for ROCm. 5. Set ROCm architecture.
```{note} To determine your AMD architecture, run:
By default in the `rocm/pytorch:latest-base`, PyTorch builds for these
architectures simultaneously:
* gfx900
* gfx906
* gfx908
* gfx90a
* gfx1030
```
6. To determine your AMD uarch, run:
```bash ```bash
rocminfo | grep gfx rocminfo | grep gfx
``` ```
7. If you want to compile only for your uarch: The result looks like this (for `gfx1030` architecture):
```bash
Name: gfx1030
Name: amdgcn-amd-amdhsa--gfx1030
```
Set the `PYTORCH_ROCM_ARCH` environment variable to specify the architectures you want to
build PyTorch for.
```bash ```bash
export PYTORCH_ROCM_ARCH=<uarch> export PYTORCH_ROCM_ARCH=<uarch>
``` ```
`<uarch>` is the architecture reported by the `rocminfo` command. where `<uarch>` is the architecture reported by the `rocminfo` command.
8. Build PyTorch using: 6. Build PyTorch.
```bash ```bash
./.jenkins/pytorch/build.sh ./.ci/pytorch/build.sh
``` ```
This will first convert PyTorch sources to be HIP compatible and then build the This converts PyTorch sources for
PyTorch framework. [HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework.
Alternatively, build PyTorch by issuing the following commands: To check if your build is successful, run:
```bash ```bash
python3 tools/amd_build/build_amd.py echo $? # should return 0 if success
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
```
### Test the PyTorch Installation
You can use PyTorch unit tests to validate a PyTorch installation. If using a
prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official
wheels package, these tests are already run on those configurations.
Alternatively, you can manually run the unit tests to validate the PyTorch
installation fully.
Follow these steps:
1. Test if PyTorch is installed and accessible by importing the torch package in
Python.
```{note}
Do not run in the PyTorch git folder.
``` ```
## Testing the PyTorch installation
You can use PyTorch unit tests to validate your PyTorch installation. If you used a
**prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
**official wheels package**, validation tests are not necessary.
If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps:
1. Import the torch package in Python to test if PyTorch is installed and accessible.
:::{note}
Do not run the following command in the PyTorch git folder.
:::
```bash ```bash
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure' python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
``` ```
2. Test if the GPU is accessible from PyTorch. In the PyTorch framework, 2. Check if the GPU is accessible from PyTorch. In the PyTorch framework, `torch.cuda` is a generic way
`torch.cuda` is a generic mechanism to access the GPU; it will access an AMD to access the GPU. This can only access an AMD GPU if one is available.
GPU only if available.
```bash ```bash
python3 -c 'import torch; print(torch.cuda.is_available())' python3 -c 'import torch; print(torch.cuda.is_available())'
``` ```
3. Run the unit tests to validate the PyTorch installation fully. Run the 3. Run unit tests to validate the PyTorch installation fully.
following command from the PyTorch home directory:
:::{note}
You must run the following command from the PyTorch home directory.
:::
```bash ```bash
BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh PYTORCH_TEST_WITH_ROCM=1 python3 test/run_test.py --verbose \
--include test_nn test_torch test_cuda test_ops \
test_unary_ufuncs test_binary_ufuncs test_autograd
``` ```
This ensures that even for wheel installs in a non-controlled environment, This command ensures that the required environment variable is set to skip certain unit tests for
the required environment variable will be set to skip certain unit tests for ROCm. This also applies to wheel installs in a non-controlled environment.
ROCm.
```{note} :::{note}
Make sure the PyTorch source code is corresponding to the PyTorch wheel or Make sure your PyTorch source code corresponds to the PyTorch wheel or the installation in the
installation in the Docker image. Incompatible PyTorch source code might give Docker image. Incompatible PyTorch source code can give errors when running unit tests.
errors when running the unit tests. :::
```
This will first install some dependencies, such as a supported [torchvision](https://pytorch.org/vision/stable/index.html) Some tests may be skipped, as appropriate, based on your system configuration. ROCm doesn't
version for PyTorch. Torchvision is used in some PyTorch tests for loading support all PyTorch features; tests that evaluate unsupported features are skipped. Other tests might
models. Next, this will run all the unit tests. be skipped, depending on the host or GPU memory and the number of available GPUs.
```{note} If the compilation and installation are correct, all tests will pass.
Some tests may be skipped, as appropriate, based on your system
configuration. All features of PyTorch are not supported on ROCm, and the
tests that evaluate these features are skipped. In addition, depending on the
host memory, or the number of available GPUs, other tests may be skipped. No
test should fail if the compilation and installation are correct.
```
4. Run individual unit tests with the following command: 4. Run individual unit tests.
```bash ```bash
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
``` ```
`test_nn.py` can be replaced with any other test set. You can replace `test_nn.py` with any other test set.
### Run a Basic PyTorch Example ## Running a basic PyTorch example
The PyTorch examples repository provides basic examples that exercise the The PyTorch examples repository provides basic examples that exercise the functionality of your
functionality of the framework. MNIST (Modified National Institute of Standards framework.
and Technology) database is a collection of handwritten digits that may be used
to train a Convolutional Neural Network for handwriting recognition.
Alternatively, ImageNet is a database of images used to train a network for
visual object recognition.
Follow these steps: Two of our favorite testing databases are:
* **MNIST** (Modified National Institute of Standards and Technology): A database of handwritten
digits that can be used to train a Convolutional Neural Network for **handwriting recognition**.
* **ImageNet**: A database of images that can be used to train a network for
**visual object recognition**.
### MNIST PyTorch example
1. Clone the PyTorch examples repository. 1. Clone the PyTorch examples repository.
@@ -372,44 +401,45 @@ Follow these steps:
git clone https://github.com/pytorch/examples.git git clone https://github.com/pytorch/examples.git
``` ```
2. Run the MNIST example. 2. Go to the MNIST example folder.
```bash ```bash
cd examples/mnist cd examples/mnist
``` ```
3. Follow the instructions in the `README` file in this folder. In this case: 3. Follow the instructions in the `README.md`` file in this folder to install the requirements. Then run:
```bash ```bash
pip3 install -r requirements.txt
python3 main.py python3 main.py
``` ```
4. Run the ImageNet example. This generates the following output:
```bash
...
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.010128
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.001348
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.005261
Test set: Average loss: 0.0252, Accuracy: 9921/10000 (99%)
```
### ImageNet PyTorch example
1. Clone the PyTorch examples repository (if you didn't already do this step in the preceding MNIST example).
```bash
git clone https://github.com/pytorch/examples.git
```
2. Go to the ImageNet example folder.
```bash ```bash
cd examples/imagenet cd examples/imagenet
``` ```
5. Follow the instructions in the `README` file in this folder. In this case: 3. Follow the instructions in the `README.md` file in this folder to install the Requirements. Then run:
```bash ```bash
pip3 install -r requirements.txt
python3 main.py python3 main.py
``` ```
## Using MIOpen kdb files with ROCm PyTorch wheels
PyTorch uses MIOpen for machine learning primitives. These primitives are compiled into kernels at runtime. Runtime compilation causes a small warm-up phase when starting PyTorch. MIOpen kdb files contain precompiled kernels that can speed up the warm-up phase of an application. More information is available in the {doc}`MIOpeninstallation page <miopen:install>`.
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in a specific location with respect to the PyTorch installation path. A helper script simplifies this task for the user. The script takes in the ROCm version and user's GPU architecture as inputs, and works for Ubuntu and CentOS.
Helper script: [install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh)
Usage:
After installing ROCm PyTorch wheels:
1. [Optional] `export GFX_ARCH=gfx90a`
2. [Optional] `export ROCM_VERSION=5.5`
3. `./install_kdb_files_for_pytorch_wheels.sh`

View File

@@ -26,7 +26,7 @@ version support matrix:
The following sections contain options for installing TensorFlow. The following sections contain options for installing TensorFlow.
#### Option 1: Using a Docker image #### Option 1: using a Docker image
To install ROCm on bare metal, follow the section To install ROCm on bare metal, follow the section
[Installation (Linux)](../../tutorials/install/linux/os-native/install). The recommended option to [Installation (Linux)](../../tutorials/install/linux/os-native/install). The recommended option to

View File

@@ -1,4 +1,4 @@
# Installation using the command-line interface (CLI) # Installation using the CLI
The steps to install the HIP SDK for Windows are described in this document. The steps to install the HIP SDK for Windows are described in this document.

View File

@@ -1,4 +1,4 @@
# Uninstallation using the command-line interface (CLI) # Uninstallation using the CLI
The steps to uninstall the HIP SDK for Windows are described in this document. The steps to uninstall the HIP SDK for Windows are described in this document.

View File

@@ -81,7 +81,7 @@ This release consists of the following OpenMP enhancements:
* Enable new device RTL in libomptarget as default. * Enable new device RTL in libomptarget as default.
* New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. * New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
* Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. * Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
* Initial implementation of optimized cross-team sum reduction for float and double type scalars. * Initial implementation of optimized cross-team sum reduction for float and double type scalars.
* Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. * Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -1,78 +1,80 @@
<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header --> <!-- markdownlint-disable no-duplicate-header -->
### Release Highlights for ROCm 5.7 ### Release highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only. ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure. Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 End of Support Notice #### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023. AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state. As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support) * ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7). * No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release). * Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release. * Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906. * Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. * Distribution and operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature Updates #### Feature updates
##### Non-hostcall HIP Printf ##### Non-hostcall HIP printf
**Current behavior** **Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above), The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
``` ```
Pcie atomics not enabled, hostcall not supported Pcie atomics not enabled, hostcall not supported
``` ```
**Workaround** **Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics. The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values, The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme. * “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution. * “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround, **NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output. * The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed. * The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU ##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm 5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement. The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](./docs/understand/using_gpu_sanitizer.md). Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04. Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using_gpu_sanitizer.md).
#### Fixed Defects **Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
The following defects are fixed in ROCm v5.7, #### Fixed defects
- Test hangs observed in HMM RCCL The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker * Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+ * NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+ * Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0 #### HIP 5.7.0
@@ -80,73 +82,73 @@ The following defects are fixed in ROCm v5.7,
##### Added ##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition * Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux * Added new APIs supporting Windows only, under development on Linux
- `hipMallocMipmappedArray` for allocating a mipmapped array on the device * `hipMallocMipmappedArray` for allocating a mipmapped array on the device
- `hipFreeMipmappedArray` for freeing a mipmapped array on the device * `hipFreeMipmappedArray` for freeing a mipmapped array on the device
- `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array * `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
- `hipMipmappedArrayCreate` for creating a mipmapped array * `hipMipmappedArrayCreate` for creating a mipmapped array
- `hipMipmappedArrayDestroy` for destroy a mipmapped array * `hipMipmappedArrayDestroy` for destroy a mipmapped array
- `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level * `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed ##### Changed
##### Fixed ##### Fixed
##### Known Issues ##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility. * HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host. * HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release ##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure * Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure * Addition of new fields in hipDeviceProp_t structure
- maxTexture1D * maxTexture1D
- maxTexture2D * maxTexture2D
- maxTexture1DLayered * maxTexture1DLayered
- maxTexture2DLayered * maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA * sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() * deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' * asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code * surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]" * unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess * computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t * hostRegisterSupported
* uuid
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t