Link and formatting fixes (#2482)

This commit is contained in:
Lisa
2023-09-20 09:55:21 -06:00
committed by GitHub
parent acde6284a0
commit 940d2933ff
27 changed files with 1057 additions and 528 deletions

5
.gitignore vendored
View File

@@ -13,7 +13,6 @@ _doxygen/
_readthedocs/
# avoid duplicating contributing.md due to conf.py
CHANGELOG.md
docs/contributing.md
docs/release.md
docs/CHANGELOG.md
docs/contribute/index.md
docs/about/release-notes.md

View File

@@ -21,76 +21,78 @@ The release notes for the ROCm platform.
### Release Highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 End of Support Notice
#### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support)
* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release.
* Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906.
* Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
* Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature Updates
#### Feature updates
##### Non-hostcall HIP Printf
##### Non-hostcall HIP printf
**Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above),
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
```
Pcie atomics not enabled, hostcall not supported
```
**Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
* “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround,
**NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm v5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](understand/using_gpu_sanitizer.md).
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
#### Fixed Defects
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
The following defects are fixed in ROCm v5.7,
#### Fixed defects
- Test hangs observed in HMM RCCL
The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker
* Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0
@@ -98,79 +100,78 @@ The following defects are fixed in ROCm v5.7,
##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux
* Added new APIs supporting Windows only, under development on Linux
- `hipMallocMipmappedArray` for allocating a mipmapped array on the device
* `hipMallocMipmappedArray` for allocating a mipmapped array on the device
- `hipFreeMipmappedArray` for freeing a mipmapped array on the device
* `hipFreeMipmappedArray` for freeing a mipmapped array on the device
- `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
* `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
- `hipMipmappedArrayCreate` for creating a mipmapped array
* `hipMipmappedArrayCreate` for creating a mipmapped array
- `hipMipmappedArrayDestroy` for destroy a mipmapped array
* `hipMipmappedArrayDestroy` for destroy a mipmapped array
- `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed
##### Fixed
##### Known Issues
##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure
* Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
* Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
* maxTexture1D
- maxTexture2D
* maxTexture2D
- maxTexture1DLayered
* maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from hip code tree
* maxTexture2DLayered
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code
* surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t
* hostRegisterSupported
* uuid
### Library Changes in ROCM 5.7.0
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t
### Library changes in ROCM 5.7.0
| Library | Version |
|---------|---------|
@@ -211,11 +212,11 @@ hipCUB 2.13.1 for ROCm 5.7.0
##### Changed
- CUB backend references CUB and Thrust version 2.0.1.
- Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
- Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
* CUB backend references CUB and Thrust version 2.0.1.
* Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
* Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
##### Known Issues
##### Known issues
- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
@@ -228,11 +229,11 @@ hipFFT 1.0.12 for ROCm 5.7.0
##### Added
- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
##### Changed
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
#### hipSOLVER 1.8.1
@@ -240,7 +241,7 @@ hipSOLVER 1.8.1 for ROCm 5.7.0
##### Changed
- Changed hipsolver-test sparse input data search paths to be relative to the test executable
* Changed hipsolver-test sparse input data search paths to be relative to the test executable
#### hipSPARSE 2.3.8
@@ -248,10 +249,10 @@ hipSPARSE 2.3.8 for ROCm 5.7.0
##### Improved
- Fix compilation failures when using cusparse 12.1.0 backend
- Fix compilation failures when using cusparse 12.0.0 backend
- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
- Minor improvements
* Fix compilation failures when using cusparse 12.1.0 backend
* Fix compilation failures when using cusparse 12.0.0 backend
* Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
* Minor improvements
#### MIOpen 2.19.0
@@ -259,17 +260,17 @@ MIOpen 2.19.0 for ROCm 5.7.0
##### Added
- ROCm 5.5 support for gfx1101 (Navi32)
* ROCm 5.5 support for gfx1101 (Navi32)
##### Changed
- Tuning results for MLIR on ROCm 5.5
- Bumping MLIR commit to 5.5.0 release tag
* Tuning results for MLIR on ROCm 5.5
* Bumping MLIR commit to 5.5.0 release tag
##### Fixed
- Fix 3d convolution Host API bug
- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
* Fix 3d convolution Host API bug
* [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
#### RCCL 2.17.1-1
@@ -277,19 +278,19 @@ RCCL 2.17.1-1 for ROCm 5.7.0
##### Changed
- Compatibility with NCCL 2.17.1-1
- Performance tuning for some collective operations
* Compatibility with NCCL 2.17.1-1
* Performance tuning for some collective operations
##### Added
- Minor improvements to MSCCL codepath
- NCCL_NCHANNELS_PER_PEER support
- Improved compilation performance
- Support for gfx94x
* Minor improvements to MSCCL codepath
* NCCL_NCHANNELS_PER_PEER support
* Improved compilation performance
* Support for gfx94x
##### Fixed
- Potential race-condition during ncclSocketClose()
* Potential race-condition during ncclSocketClose()
#### rocALUTION 2.1.11
@@ -297,11 +298,11 @@ rocALUTION 2.1.11 for ROCm 5.7.0
##### Added
- Added support for gfx940, gfx941 and gfx942
* Added support for gfx940, gfx941 and gfx942
##### Improved
- Fixed OpenMP runtime issue with Windows toolchain
* Fixed OpenMP runtime issue with Windows toolchain
#### rocBLAS 3.1.0
@@ -309,33 +310,33 @@ rocBLAS 3.1.0 for ROCm 5.7.0
##### Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
* yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
* rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
##### Fixed
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
- General: set_vector, get_vector, set_matrix, get_matrix
- Related fixes: internal scalar loads with &gt; 32bit offsets
- fix in-place functionality for all trtri sizes
* make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
* Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
* Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
* Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
* General: set_vector, get_vector, set_matrix, get_matrix
* Related fixes: internal scalar loads with &gt; 32bit offsets
* fix in-place functionality for all trtri sizes
##### Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions
* dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
* enhanced reporting of installation issues caused by runtime libraries (Tensile)
* standardized internal rocblas C++ interface across most functions
##### Deprecated
- Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
* Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
##### Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil
* Optional use of AOCL BLIS 4.0 on Linux for clients
* Optional build tool only dependency on python psutil
#### rocFFT 1.0.24
@@ -343,16 +344,16 @@ rocFFT 1.0.24 for ROCm 5.7.0
##### Optimizations
- Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
* Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
##### Added
- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
* Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
##### Changed
- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
- Moved library kernel cache to lib/rocFFT directory.
* Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
* Moved library kernel cache to lib/rocFFT directory.
#### rocm-cmake 0.10.0
@@ -360,8 +361,8 @@ rocm-cmake 0.10.0 for ROCm 5.7.0
##### Added
- Added ROCMTest module
- ROCMCreatePackage: Added support for ASAN packages
* Added ROCMTest module
* ROCMCreatePackage: Added support for ASAN packages
#### rocPRIM 2.13.1
@@ -369,14 +370,14 @@ rocPRIM 2.13.1 for ROCm 5.7.0
##### Changed
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
* Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
* Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
* The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
* Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
##### Fixed
- Fixed build issue caused by missing header in `thread/thread_search.hpp`.
* Fixed build issue caused by missing header in `thread/thread_search.hpp`.
#### rocRAND 2.10.17
@@ -384,14 +385,14 @@ rocRAND 2.10.17 for ROCm 5.7.0
##### Added
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
- experimental HIP-CPU feature
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
* Experimental HIP-CPU feature
* ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
##### Changed
- Python 2.7 is no longer officially supported.
* Python 2.7 is no longer officially supported.
#### rocSOLVER 3.23.0
@@ -399,19 +400,19 @@ rocSOLVER 3.23.0 for ROCm 5.7.0
##### Added
- LU factorization without pivoting for block tridiagonal matrices:
- GEBLTTRF_NPVT now supports interleaved\_batched format
- Linear system solver without pivoting for block tridiagonal matrices:
- GEBLTTRS_NPVT now supports interleaved\_batched format
* LU factorization without pivoting for block tridiagonal matrices:
* GEBLTTRF_NPVT now supports interleaved\_batched format
* Linear system solver without pivoting for block tridiagonal matrices:
* GEBLTTRS_NPVT now supports interleaved\_batched format
##### Fixed
- Fixed stack overflow in sparse tests on Windows
* Fixed stack overflow in sparse tests on Windows
##### Changed
- Changed rocsolver-test sparse input data search paths to be relative to the test executable
- Changed build scripts to default to compressed debug symbols in Debug builds
* Changed rocsolver-test sparse input data search paths to be relative to the test executable
* Changed build scripts to default to compressed debug symbols in Debug builds
#### rocSPARSE 2.5.4
@@ -419,12 +420,12 @@ rocSPARSE 2.5.4 for ROCm 5.7.0
##### Added
- Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
- Added support for gfx940, gfx941 and gfx942
* Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
* Added support for gfx940, gfx941 and gfx942
##### Improved
- Fixed a bug in csrsm and bsrsm
* Fixed a bug in csrsm and bsrsm
##### Known Issues
@@ -434,15 +435,15 @@ In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accur
rocThrust 2.18.0 for ROCm 5.7.0
##### Fixed
##### Fixed
- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
- Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
* Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
##### Changed
- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
- Removed references to and workarounds for deprecated hcc
* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
* Removed references to and workarounds for deprecated hcc
#### rocWMMA 1.2.0
@@ -450,8 +451,8 @@ rocWMMA 1.2.0 for ROCm 5.7.0
##### Changed
- Fixed a bug with synchronization
- Updated rocWMMA cmake versioning
* Fixed a bug with synchronization
* Updated rocWMMA cmake versioning
#### Tensile 4.38.0
@@ -459,29 +460,29 @@ Tensile 4.38.0 for ROCm 5.7.0
##### Added
- Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
- Added user-driven solution selection feature
* Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
* Added user-driven solution selection feature
##### Optimizations
- Enabled LocalSplitU with MFMA for I8 data type
- Optimized K mask code in mfmaIter
- Enabled TailLoop code in NoLoadLoop to prefetch global/local read
- Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
- Optimized DirectToLds test cases to reduce the test duration
* Enabled LocalSplitU with MFMA for I8 data type
* Optimized K mask code in mfmaIter
* Enabled TailLoop code in NoLoadLoop to prefetch global/local read
* Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
* Optimized DirectToLds test cases to reduce the test duration
##### Changed
- Removed DGEMM NT custom kernels and related test cases
- Changed noTailLoop logic to apply noTailLoop only for NT
- Changed the range of AssertFree0ElementMultiple and Free1
- Unified aStr, bStr generation code in mfmaIter
* Removed DGEMM NT custom kernels and related test cases
* Changed noTailLoop logic to apply noTailLoop only for NT
* Changed the range of AssertFree0ElementMultiple and Free1
* Unified aStr, bStr generation code in mfmaIter
##### Fixed
- Fixed LocalSplitU mismatch issue for SGEMM
- Fixed BufferStore=0 and Ldc != Ldd case
- Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
* Fixed LocalSplitU mismatch issue for SGEMM
* Fixed BufferStore=0 and Ldc != Ldd case
* Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
-------------------
@@ -2282,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
- Enable new device RTL in libomptarget as default.
- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
- Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
- Initial implementation of optimized cross-team sum reduction for float and double type scalars.
- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -21,7 +21,7 @@ The release notes for the ROCm platform.
### What's new in this release
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.

View File

@@ -15,6 +15,477 @@ The release notes for the ROCm platform.
-------------------
## ROCm 5.7.0
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### Release Highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
* Bug fixes during the maintenance will be made to the next ROCm point release.
* Bug fixes will not be backported to older ROCm releases for gfx906.
* Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature updates
##### Non-hostcall HIP printf
**Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
```
Pcie atomics not enabled, hostcall not supported
```
**Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
* “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround:
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using-gpu-sanitizer.md).
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
#### Fixed defects
The following defects are fixed in ROCm v5.7:
* Test hangs observed in HMM RCCL
* NoGpuTst test of Catch2 fails with Docker
* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0
##### Optimizations
##### Added
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
* Added new APIs supporting Windows only, under development on Linux
* `hipMallocMipmappedArray` for allocating a mipmapped array on the device
* `hipFreeMipmappedArray` for freeing a mipmapped array on the device
* `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
* `hipMipmappedArrayCreate` for creating a mipmapped array
* `hipMipmappedArrayDestroy` for destroy a mipmapped array
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed
##### Fixed
##### Known issues
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release
* Removal of `gcnarch` from hipDeviceProp_t structure
* Addition of new fields in hipDeviceProp_t structure
* maxTexture1D
* maxTexture2D
* maxTexture1DLayered
* maxTexture2DLayered
* sharedMemPerMultiprocessor
* deviceOverlap
* asyncEngineCount
* surfaceAlignment
* unifiedAddressing
* computePreemptionSupported
* hostRegisterSupported
* uuid
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t
### Library changes in ROCM 5.7.0
| Library | Version |
|---------|---------|
| hipBLAS | 0.54.0 ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.0) |
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.0) |
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.0) |
| hipSOLVER | 1.8.0 ⇒ [1.8.1](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.0) |
| hipSPARSE | 2.3.7 ⇒ [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.0) |
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.0) |
| rccl | 2.15.5 ⇒ [2.17.1-1](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.7.0) |
| rocALUTION | 2.1.9 ⇒ [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.0) |
| rocBLAS | 3.0.0 ⇒ [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.0) |
| rocFFT | 1.0.23 ⇒ [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.0) |
| rocm-cmake | 0.9.0 ⇒ [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.0) |
| rocPRIM | 2.13.0 ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.0) |
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.0) |
| rocSOLVER | 3.22.0 ⇒ [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.0) |
| rocSPARSE | 2.5.2 ⇒ [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.0) |
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.0) |
| rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.0) |
| Tensile | 4.37.0 ⇒ [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.0) |
#### hipBLAS 1.1.0
hipBLAS 1.1.0 for ROCm 5.7.0
##### Changed
- updated documentation requirements
##### Dependencies
- dependency rocSOLVER now depends on rocSPARSE
#### hipCUB 2.13.1
hipCUB 2.13.1 for ROCm 5.7.0
##### Changed
* CUB backend references CUB and Thrust version 2.0.1.
* Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
* Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
##### Known issues
- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
- `DeviceHistogram::HistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int]`.
- `DeviceHistogram::MultiHistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]` and `[LevelT, SampleIteratorT] = [float, double]`.
#### hipFFT 1.0.12
hipFFT 1.0.12 for ROCm 5.7.0
##### Added
* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
##### Changed
* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
#### hipSOLVER 1.8.1
hipSOLVER 1.8.1 for ROCm 5.7.0
##### Changed
* Changed hipsolver-test sparse input data search paths to be relative to the test executable
#### hipSPARSE 2.3.8
hipSPARSE 2.3.8 for ROCm 5.7.0
##### Improved
* Fix compilation failures when using cusparse 12.1.0 backend
* Fix compilation failures when using cusparse 12.0.0 backend
* Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
* Minor improvements
#### MIOpen 2.19.0
MIOpen 2.19.0 for ROCm 5.7.0
##### Added
* ROCm 5.5 support for gfx1101 (Navi32)
##### Changed
* Tuning results for MLIR on ROCm 5.5
* Bumping MLIR commit to 5.5.0 release tag
##### Fixed
* Fix 3d convolution Host API bug
* [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
#### RCCL 2.17.1-1
RCCL 2.17.1-1 for ROCm 5.7.0
##### Changed
* Compatibility with NCCL 2.17.1-1
* Performance tuning for some collective operations
##### Added
* Minor improvements to MSCCL codepath
* NCCL_NCHANNELS_PER_PEER support
* Improved compilation performance
* Support for gfx94x
##### Fixed
* Potential race-condition during ncclSocketClose()
#### rocALUTION 2.1.11
rocALUTION 2.1.11 for ROCm 5.7.0
##### Added
* Added support for gfx940, gfx941 and gfx942
##### Improved
* Fixed OpenMP runtime issue with Windows toolchain
#### rocBLAS 3.1.0
rocBLAS 3.1.0 for ROCm 5.7.0
##### Added
* yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
* rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
##### Fixed
* make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
* Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
* Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
* Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
* General: set_vector, get_vector, set_matrix, get_matrix
* Related fixes: internal scalar loads with &gt; 32bit offsets
* fix in-place functionality for all trtri sizes
##### Changed
* dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
* enhanced reporting of installation issues caused by runtime libraries (Tensile)
* standardized internal rocblas C++ interface across most functions
##### Deprecated
* Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
##### Dependencies
* Optional use of AOCL BLIS 4.0 on Linux for clients
* Optional build tool only dependency on python psutil
#### rocFFT 1.0.24
rocFFT 1.0.24 for ROCm 5.7.0
##### Optimizations
* Improved performance of complex forward/inverse 1D FFTs (2049 &lt;= length &lt;= 131071) that use Bluestein&#39;s algorithm.
##### Added
* Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
##### Changed
* Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
* Moved library kernel cache to lib/rocFFT directory.
#### rocm-cmake 0.10.0
rocm-cmake 0.10.0 for ROCm 5.7.0
##### Added
* Added ROCMTest module
* ROCMCreatePackage: Added support for ASAN packages
#### rocPRIM 2.13.1
rocPRIM 2.13.1 for ROCm 5.7.0
##### Changed
* Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm&#39;s parameters. New configuration `radix_sort_config_v2` is preferred instead.
* Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
* The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
* Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm&#39;s parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
##### Fixed
* Fixed build issue caused by missing header in `thread/thread_search.hpp`.
#### rocRAND 2.10.17
rocRAND 2.10.17 for ROCm 5.7.0
##### Added
* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
* Experimental HIP-CPU feature
* ThreeFry pseudorandom number generator based on Salmon et al., 2011, &#34;Parallel random numbers: as easy as 1, 2, 3&#34;.
##### Changed
* Python 2.7 is no longer officially supported.
#### rocSOLVER 3.23.0
rocSOLVER 3.23.0 for ROCm 5.7.0
##### Added
* LU factorization without pivoting for block tridiagonal matrices:
* GEBLTTRF_NPVT now supports interleaved\_batched format
* Linear system solver without pivoting for block tridiagonal matrices:
* GEBLTTRS_NPVT now supports interleaved\_batched format
##### Fixed
* Fixed stack overflow in sparse tests on Windows
##### Changed
* Changed rocsolver-test sparse input data search paths to be relative to the test executable
* Changed build scripts to default to compressed debug symbols in Debug builds
#### rocSPARSE 2.5.4
rocSPARSE 2.5.4 for ROCm 5.7.0
##### Added
* Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
* Added support for gfx940, gfx941 and gfx942
##### Improved
* Fixed a bug in csrsm and bsrsm
##### Known Issues
In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
#### rocThrust 2.18.0
rocThrust 2.18.0 for ROCm 5.7.0
##### Fixed
* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
* Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
##### Changed
* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
* Removed references to and workarounds for deprecated hcc
#### rocWMMA 1.2.0
rocWMMA 1.2.0 for ROCm 5.7.0
##### Changed
* Fixed a bug with synchronization
* Updated rocWMMA cmake versioning
#### Tensile 4.38.0
Tensile 4.38.0 for ROCm 5.7.0
##### Added
* Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
* Added user-driven solution selection feature
##### Optimizations
* Enabled LocalSplitU with MFMA for I8 data type
* Optimized K mask code in mfmaIter
* Enabled TailLoop code in NoLoadLoop to prefetch global/local read
* Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
* Optimized DirectToLds test cases to reduce the test duration
##### Changed
* Removed DGEMM NT custom kernels and related test cases
* Changed noTailLoop logic to apply noTailLoop only for NT
* Changed the range of AssertFree0ElementMultiple and Free1
* Unified aStr, bStr generation code in mfmaIter
##### Fixed
* Fixed LocalSplitU mismatch issue for SGEMM
* Fixed BufferStore=0 and Ldc != Ldd case
* Fixed mismatch issue with TailLoop + MatrixInstB &gt; 1
-------------------
## ROCm 5.6.1
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
@@ -32,6 +503,36 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
* Using *hipGraphAddMemFreeNode* no longer results in a crash
### Library Changes in ROCM 5.6.1
| Library | Version |
|---------|---------|
| hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.6.1) |
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.6.1) |
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.6.1) |
| hipSOLVER | [1.8.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.6.1) |
| hipSPARSE | 2.3.6 ⇒ [2.3.7](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.6.1) |
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.6.1) |
| rccl | [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.6.1) |
| rocALUTION | [2.1.9](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.6.1) |
| rocBLAS | [3.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.6.1) |
| rocFFT | [1.0.23](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.6.1) |
| rocm-cmake | [0.9.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.6.1) |
| rocPRIM | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.6.1) |
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.6.1) |
| rocSOLVER | [3.22.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.6.1) |
| rocSPARSE | [2.5.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.6.1) |
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.6.1) |
| Tensile | [4.37.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.6.1) |
#### hipSPARSE 2.3.7
hipSPARSE 2.3.7 for ROCm 5.6.1
##### Bugfix
* Reverted an undocumented API change in hipSPARSE 2.3.6 that affected hipsparseSpSV_solve function
-------------------
## ROCm 5.6.0
@@ -531,7 +1032,7 @@ rocWMMA 1.1.0 for ROCm 5.6.0
* Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp)
* Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation)
* Added performance gemm samples for half, single and double precision
* Added performance gemm samples for half, single, and double precision
* Added rocWMMA cmake versioning
* Added vectorized support in coordinate transforms
* Included ROCm smi for runtime clock rate detection
@@ -1782,7 +2283,7 @@ This release consists of the following OpenMP enhancements:
- Enable new device RTL in libomptarget as default.
- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
- Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
- Initial implementation of optimized cross-team sum reduction for float and double type scalars.
- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -15,15 +15,15 @@ The release notes for the ROCm platform.
-------------------
## ROCm 5.6.1
## ROCm 5.7.0
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### What's new in this release
ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
## HIP 5.6.1 (for ROCm 5.6.1)
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
### Fixed defects

View File

@@ -3,7 +3,7 @@ How ROCm uses PCIe atomics
===========================
ROCm PCIe feature and overview base address register (BAR) memory
ROCm PCIe feature and overview of BAR memory
======================================================================
@@ -57,7 +57,7 @@ New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
| 1. write to another GPU memory,
@@ -86,7 +86,7 @@ For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual add
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
Here is how our BAR works on GFX 8 GPUs with 40 bit Physical Address Limit ::
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)

View File

@@ -467,11 +467,11 @@ Plotting the train and test loss shows both metrics reducing over training epoch
### Custom model with CIFAR-10 on PyTorch
The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
Follow these steps:
1. Import dependencies, including torch, os, and [torchvision](https://github.com/pytorch/vision).
1. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision).
```py
import torch
@@ -662,13 +662,13 @@ Follow these steps:
print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
```
### Case study: TensorFlow with Fashion MNIST
### Case study: TensorFlow with Fashion-MNIST
Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.
The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries.
The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries.
Access the source code from the following repository:
@@ -690,7 +690,7 @@ To understand the code step by step, follow these steps:
print(tf._version__) r
```
3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
4. The model is tested against the test set, test_images, and test_labels arrays.

View File

@@ -1,4 +1,4 @@
# ROCm Linux Filesystem Hierarchy Standard (FHS) reorganization
# ROCm Linux Filesystem Hierarchy Standard reorganization
## Introduction

View File

@@ -9,7 +9,7 @@ This document lists and describes the hardware performance counters and the deri
Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
```
### Graphics register bus management (GRBM)
### GRBM
#### GRBM counters
@@ -30,7 +30,7 @@ Preliminary validation of all MI200 performance counters is in progress. Those w
The command processor counters are further classified into fetcher and compute.
#### Command processor - fetcher (CPF)
#### CPF
##### CPF counters
@@ -43,7 +43,7 @@ The command processor counters are further classified into fetcher and compute.
| `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle |
| `cpf_cpf_tciu_stall[]` | Cycles | CPF TCIU interface is stalled waiting on free tags. |
#### Command processor - compute (CPC)
#### CPC
##### CPC counters
@@ -61,7 +61,7 @@ The command processor counters are further classified into fetcher and compute.
| `cpc_cpc_utcl2iu_stall[]` | Cycles | CPC UTCL2 interface stalled waiting |
| `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy |
### Shader processor input (SPI)
### SPI
#### SPI counters
@@ -325,7 +325,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
| `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache |
| `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache |
#### Texture cache arbiter (TCA)
#### TCA
| Hardware Counter | Unit | Definition |
| :----------------| :------| ------------------------------------------: |
@@ -400,9 +400,7 @@ The vector L1 cache subsystem counters are further classified into texture addre
| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) |
| `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
## Abbreviations
### MI200 abbreviations
## MI200 acronyms
| Abbreviation | Meaning |
| :------------| --------------------------------------------------------------------------------: |

View File

@@ -1,4 +1,4 @@
# GPU Memory
# GPU memory
For the HIP reference documentation, see:

View File

@@ -44,9 +44,9 @@ product lines.
:::
:::{grid-item-card} GPU Memory
:link: gpu_memory
:link-type: doc
:::{grid-item-card}
**[GPU memory](./gpu-memory.md)**
Learn about the different types of memory allocations.
:::

View File

@@ -1,4 +1,4 @@
# Using the LLVM AddressSanitizer (ASan) on a GPU (beta release)
# Using the LLVM ASan on a GPU (beta release)
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
@@ -7,7 +7,7 @@ Until now, the LLVM ASan process was only available for traditional purely CPU a
This document provides documentation on using ROCm ASan.
For information about LLVM ASan, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
## Compiling for ASan
@@ -51,11 +51,11 @@ For a complete ROCm GPU Sanitizer installation, the following must be installed
sudo apt-get install hipfft-asan hipsparse-asan migraphx-asan miopen-hip-asan rocalution-asan rocblas-asan rocfft-asan rocm-core-asan rocsparse-asan hipblaslt-asan mivisionx-asan rocsolver-asan
```
**Note**: It is recommended to install all address sanitizer packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries.
**Note**: It is recommended to install all ASan packages. If the optional instrumented math libraries are not installed, the address sanitizer cannot find issues within those libraries.
## Using AMD-supplied ASan instrumented libraries
ROCm releases have optional packages containing additional address sanitizer instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`.
ROCm releases have optional packages containing additional ASan instrumented builds of the ROCm libraries usually found in `/opt/rocm-<version>/lib`. The instrumented libraries have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`.
These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as descibed above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
@@ -235,7 +235,7 @@ $ rocgdb <path to application>
### Using ASan with a short HIP application
Refer to the following example to use address sanitizer with a short HIP application,
Refer to the following example to use ASan with a short HIP application,
https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md

View File

@@ -41,10 +41,10 @@ python3 -mvenv .venv
Then open up `_build/html/index.html` in your favorite browser.
## Build documentation using Visual Studio (VS) Code
## Build documentation using Visual Studio Code
One can put together a productive environment to author documentation and also
test it locally using VS Code with only a handful of extensions. Even though the
test it locally using Visual Studio (VS) Code with only a handful of extensions. Even though the
extension landscape of VS Code is ever changing, here is one example setup that
proved useful at the time of writing. In it, one can change/add content, build a
new version of the docs using a single VS Code Task (or hotkey), see all errors/

View File

@@ -17,8 +17,7 @@ memory with RDMA capabilities. These interfaces are currently registered as a
These interfaces are used to optimize inter-node MPI message communication.
This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
MPI project is an open source implementation of the Message Passing Interface
(MPI) that is developed and maintained by a consortium of academic, research,
MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
and industry partners.
Several MPI implementations can be made ROCm-aware by compiling them with
@@ -72,7 +71,7 @@ make -j $(nproc)
make -j $(nproc) install
```
The [communication libraries tables](#communication-libraries)
The [communication libraries tables](../reference/libraries/gpu-libraries/communication.md)
documents the compatibility of UCX versions with ROCm versions.
## Install Open MPI
@@ -157,7 +156,7 @@ Unified Collective Communication Library (UCC) component in Open MPI.
For this, the UCC library has to be configured and compiled with ROCm
support.
Please note the compatibility [tables](#communication-libraries)
Please note the compatibility tables in the [communication libraries](../reference/libraries/gpu-libraries/communication.md)
for UCC versions with the various ROCm versions.
An example for configuring UCC and Open MPI with ROCm support

View File

@@ -405,7 +405,7 @@ SIMD pipelines, memory information, and instruction set architecture:
![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system")
For a complete list of architecture (LLVM target) names, refer to
[Linux Support](../../about/compatibility/linux-support.md) and [Windows Support](../../about/compatibility/windows-support.md).
[Linux support](../../about/compatibility/linux-support.md) and [Windows support](../../about/compatibility/windows-support.md).
### Testing inter-device bandwidth

View File

@@ -1,4 +1,4 @@
# AI libraries
# Artificial intelligence libraries
::::{grid} 1 1 2 2
:gutter: 1

View File

@@ -9,7 +9,6 @@ ROCm libraries for fast Fourier transforms (FFTs) are as follows:
rocFFT is an AMD GPU optimized library for FFT.
* {doc}`Documentation <rocfft:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT)
* [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md)
@@ -21,7 +20,6 @@ hipFFT is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs
using rocFFT. hipFFT allows for a common interface for other non AMD GPU
FFT libraries.
* {doc}`Documentation <hipfft:index>`
* [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT)
* [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md)

View File

@@ -436,7 +436,7 @@ See the complete sample code for global buffer overflow
You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options:
* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions.
* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including no-loop and Cross-team reductions.
* `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution.
@@ -448,13 +448,13 @@ You can use the clang compiler option `-fopenmp-target-fast` for kernel optimiza
Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:
* No-Loop
* Big-Jump-Loop
* Cross-Team (Xteam) Reductions
* No-loop
* Big-jump-loop
* Cross-team reductions
To enable the generation of specialized kernels, follow these guidelines:
* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel.
* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.
* Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`.
@@ -464,11 +464,11 @@ To enable the generation of specialized kernels, follow these guidelines:
#### No-loop kernel generation
The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
#### Big-jump-loop kernel generation
A No-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
A no-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
#### Cross-team optimized reduction kernel generation

View File

@@ -656,7 +656,7 @@ of target triple and the target GPU (along with the associated target features).
modified to query this structure to identify a compatible image based on the
capability of the current system.
#### Unified shared memory (USM)
#### Unified shared memory
The following OpenMP pragma is available on MI200, and it must be executed with
`xnack+` support.
@@ -665,7 +665,7 @@ The following OpenMP pragma is available on MI200, and it must be executed with
omp requires unified_shared_memory
```
For more details on USM refer to the {ref}`openmp_usm` section of the OpenMP
For more details on unified shared memory refer to the {ref}`openmp_usm` section of the OpenMP
Guide.
### Support status of other Clang options

View File

@@ -119,7 +119,7 @@ subtrees:
- file: tutorials/install/docker.md
title: ROCm Docker containers
- file: tutorials/install/pytorch-install.md
title: Pytorch for ROCm
title: PyTorch for ROCm
- file: tutorials/install/tensorflow-install.md
title: Tensorflow for ROCm
- file: tutorials/install/magma-install.md
@@ -199,6 +199,8 @@ subtrees:
title: MI200
- file: conceptual/gpu-arch/mi250.md
title: MI250
- file: conceptual/gpu-memory.md
title: GPU memory
- file: conceptual/compiler-disambiguation.md
title: Compiler disambiguation
- file: conceptual/file-reorg.md
@@ -214,7 +216,7 @@ subtrees:
- file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
title: ROCm & PCIe atomics
- file: conceptual/ai-pytorch-inception.md
title: Inception v3 with Pytorch
title: Inception v3 with PyTorch
- file: conceptual/ai-migraphx-optimization.md
title: Inference optimization with MIGraphX

View File

@@ -43,8 +43,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`.
**Q: Can I install PyTorch directly on bare metal?**
Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to
Option 2: Install PyTorch Using Wheels Package in the section
{ref}`install-pytorch-using-wheels` of this guide for more information.
Option 2: Install PyTorch Using Wheels Package. See [Installing PyTorch](../tutorials/install/pytorch-install.md) for more information.
**Q: How do I profile PyTorch workloads?**

View File

@@ -1,370 +1,399 @@
# PyTorch Installation for ROCm
# Installing PyTorch for ROCm
## PyTorch
[PyTorch](https://pytorch.org/) is an open-source tensor library designed for deep learning. PyTorch on
ROCm provides mixed-precision and large-scale training using our
[MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) and
[RCCL](https://github.com/ROCmSoftwarePlatform/rccl) libraries.
PyTorch is an open-source machine learning Python library, primarily
differentiated by Tensor computing with GPU acceleration and a type-based
automatic differentiation. Other advanced features include:
To install [PyTorch for ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), you have the following options:
* Support for distributed training
* Native ONNX support
* C++ front-end
* The ability to deploy at scale using TorchServe
* A production-ready deployment mechanism through TorchScript
* [Use a Docker image with PyTorch pre-installed](#using-a-docker-image-with-pytorch-pre-installed)
(recommended)
* [Use a wheels package](#using-a-wheels-package)
* [Use the PyTorch ROCm base Docker image](#using-the-pytorch-rocm-base-docker-image)
* [Use the PyTorch upstream Docker file](#using-the-pytorch-upstream-docker-file)
### Installing PyTorch
For hardware, software, and third-party framework compatibility between ROCm and PyTorch, refer to:
To install ROCm on bare metal, refer to the sections
[GPU and OS Support (Linux)](../../about/compatibility/linux-support.md) and
[Compatibility](../../about/compatibility/index.md) for hardware, software and
3rd-party framework compatibility between ROCm and PyTorch. The recommended
option to get a PyTorch environment is through Docker. However, installing the
PyTorch wheels package on bare metal is also supported.
* [GPU and OS support (Linux)](../../about/compatibility/linux-support.md)
* [Compatibility](../../about/compatibility/3rd-party-support-matrix.md)
#### Option 1 (Recommended): Use Docker Image with PyTorch Pre-Installed
## Using a Docker image with PyTorch pre-installed
Using Docker gives you portability and access to a prebuilt Docker container
that has been rigorously tested within AMD. This might also save on the
compilation time and should perform as it did when tested without facing
potential installation issues.
Follow these steps:
1. Pull the latest public PyTorch Docker image.
1. Download the latest public PyTorch Docker image
([https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch)).
```bash
docker pull rocm/pytorch:latest
```
Optionally, you may download a specific and supported configuration with
different user-space ROCm versions, PyTorch versions, and supported operating
systems. To download the PyTorch Docker image, refer to
[https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch).
You can also download a specific and supported configuration with different user-space ROCm
versions, PyTorch versions, and operating systems.
2. Start a Docker container using the downloaded image.
2. Start a Docker container using the image.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video \
--ipc=host --shm-size 8G rocm/pytorch:latest
```
```{note}
This will automatically download the image if it does not exist on the host.
You can also pass the -v argument to mount any data directories from the host
onto the container.
:::{note}
This will automatically download the image if it does not exist on the host. You can also pass the `-v`
argument to mount any data directories from the host onto the container.
:::
(install_pytorch_wheels)=
## Using a wheels package
PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, go
to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). In the interactive
table, choose ROCm from the _Compute Platform_ row.
1. Choose one of the following three options:
**Option 1:**
a. Download a base Docker image with the correct user-space ROCm version.
| Base OS | Docker image | Link to Docker image|
|----------------|-----------------------------|----------------|
| Ubuntu 20.04 | `rocm/dev-ubuntu-20.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-20.04](https://hub.docker.com/r/rocm/dev-ubuntu-20.04)
| Ubuntu 22.04 | `rocm/dev-ubuntu-22.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-22.04](https://hub.docker.com/r/rocm/dev-ubuntu-22.04)
| CentOS 7 | `rocm/dev-centos-7` | [https://hub.docker.com/r/rocm/dev-centos-7](https://hub.docker.com/r/rocm/dev-centos-7)
b. Pull the selected image.
```bash
docker pull rocm/dev-ubuntu-20.04:latest
```
(install-pytorch-using-wheels)=
c. Start a Docker container using the downloaded image.
#### Option 2: Install PyTorch Using Wheels Package
PyTorch supports the ROCm platform by providing tested wheels packages. To
access this feature, refer to
[https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/)
and choose the "ROCm" compute platform. The following image is a matrix from <https://pytorch.org/> that illustrates the installation compatibility between ROCm and the PyTorch build.
![Pytorch installation matrix](../../data/tutorials/install/magma-install/magma006.png "Pytorch installation matrix")
To install PyTorch using the wheels package, follow these installation steps:
1. Choose one of the following options:
a. Obtain a base Docker image with the correct user-space ROCm version
installed from
[https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04](https://hub.docker.com/repository/docker/rocm/dev-ubuntu-20.04).
or
b. Download a base OS Docker image and install ROCm following the
installation directions in the [Installation](../../tutorials/install/linux/index) section. ROCm 5.2 is installed in
this example, as supported by the installation matrix from
<https://pytorch.org/>.
or
c. Install on bare metal. Skip to Step 3.
```bash
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
```
2. Start the Docker container, if not installing on bare metal.
```dockerfile
```bash
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
```
3. Install any dependencies needed for installing the wheels package.
**Option 2:**
Select a base OS Docker image (Check [OS compatibility](../../about/compatibility/linux-support.md))
Pull selected base OS image (Ubuntu 20.04 for example)
```docker
docker pull ubuntu:20.04
```
Start a Docker container using the downloaded image
```docker
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:20.04
```
Install ROCm using the directions in the [Installation section](../install/linux/install-options.md).
**Option 3:**
Install on bare metal. Check [OS compatibility](../../about/compatibility/linux-support.md) and install ROCm using the
directions in the [Installation section](../install/linux/install-options.md).
2. Install the required dependencies for the wheels package.
```bash
sudo apt update
sudo apt install libjpeg-dev python3-dev
sudo apt install libjpeg-dev python3-dev python3-pip
pip3 install wheel setuptools
```
4. Install Torch, TorchVision, and TorchAudio as specified by the installation
matrix.
3. Install `torch`, `torchvision`, and `torchaudio`, as specified in the [installation matrix](https://pytorch.org/get-started/locally/).
```{note}
ROCm 5.2 PyTorch wheel in the command below is shown for reference.
```
:::{note}
The following command uses the ROCm 5.6 PyTorch wheel. If you want a different version of ROCm,
modify the command accordingly.
:::
```bash
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm5.2/
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6/
```
#### Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image
4. (Optional) Use MIOpen kdb files with ROCm PyTorch wheels.
A prebuilt base Docker image is used to build PyTorch in this option. The base
Docker has all dependencies installed, including:
PyTorch uses [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for machine learning
primitives, which are compiled into kernels at runtime. Runtime compilation causes a small warm-up
phase when starting PyTorch, and MIOpen kdb files contain precompiled kernels that can speed up
application warm-up phases. For more information, refer to the
{doc}`MIOpen installation page <miopen:install>`.
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in
a specific location with respect to the PyTorch installation path. A helper script simplifies this task by
taking the ROCm version and GPU architecture as inputs. This works for Ubuntu and CentOS.
You can download the helper script here:
[install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/ install_kdb_files_for_pytorch_wheels.sh), or use:
`wget https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh`
After installing ROCm PyTorch wheels, run the following code:
```bash
#Optional; replace 'gfx90a' with your architecture
export GFX_ARCH=gfx90a
#Optional
export ROCM_VERSION=5.5
./install_kdb_files_for_pytorch_wheels.sh
```
## Using the PyTorch ROCm base Docker image
The pre-built base Docker image has all dependencies installed, including:
* ROCm
* Torchvision
* Conda packages
* Compiler toolchain
* The compiler toolchain
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and
the build scripts utilize that to determine the build environment configuration.
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, which is used by the build
scripts to determine the configuration of the build environment.
Follow these steps:
1. Obtain the Docker image.
1. Download the Docker image. This is the base image, which does not contain PyTorch.
```bash
docker pull rocm/pytorch:latest-base
```
The above will download the base container, which does not contain PyTorch.
2. Start a Docker container using the image.
2. Start a Docker container using the downloaded image.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
```
You can also pass the -v argument to mount any data directories from the host
onto the container.
You can also pass the `-v` argument to mount any data directories from the host onto the container.
3. Clone the PyTorch repository.
```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd pytorch
cd /pytorch
git submodule update --init --recursive
```
4. Build PyTorch for ROCm.
4. Set ROCm architecture (optional). The Docker image tag is `rocm/pytorch:latest-base`.
```{note}
By default in the `rocm/pytorch:latest-base`, PyTorch builds for these
architectures simultaneously:
:::{note}
By default in the `rocm/pytorch:latest-base` image, PyTorch builds simultaneously for the following
architectures:
* gfx900
* gfx906
* gfx908
* gfx90a
* gfx1030
```
:::
5. To determine your AMD uarch, run:
```bash
rocminfo | grep gfx
```
6. In the event you want to compile only for your uarch, use:
If you want to compile _only_ for your microarchitecture (uarch), run:
```bash
export PYTORCH_ROCM_ARCH=<uarch>
```
`<uarch>` is the architecture reported by the `rocminfo` command.
Where `<uarch>` is the architecture reported by the `rocminfo` command.
7. Build PyTorch using the following command:
To find your uarch, run:
```bash
./.jenkins/pytorch/build.sh
rocminfo | grep gfx
```
This will first convert PyTorch sources for HIP compatibility and build the
5. Build PyTorch.
```bash
./.ci/pytorch/build.sh
```
This converts PyTorch sources for
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework.
8. Alternatively, build PyTorch by issuing the following commands:
To check if your build is successful, run:
```bash
python3 tools/amd_build/build_amd.py
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
echo $? # should return 0 if success
```
#### Option 4: Install Using PyTorch Upstream Docker File
## Using the PyTorch upstream Docker file
Instead of using a prebuilt base Docker image, you can build a custom base
Docker image using scripts from the PyTorch repository. This will utilize a
standard Docker image from operating system maintainers and install all the
dependencies required to build PyTorch, including
If you don't want to use a prebuilt base Docker image, you can build a custom base Docker image
using scripts from the PyTorch repository. This uses a standard Docker image from operating system
maintainers and installs all the required dependencies, including:
* ROCm
* Torchvision
* Conda packages
* Compiler toolchain
* The compiler toolchain
Follow these steps:
1. Clone the PyTorch repository on the host.
1. Clone the PyTorch repository.
```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd pytorch
cd /pytorch
git submodule update --init --recursive
```
2. Build the PyTorch Docker image.
```bash
cd.circleci/docker
./build.sh pytorch-linux-bionic-rocm<version>-py3.7
# eg. ./build.sh pytorch-linux-bionic-rocm3.10-py3.7
cd .ci/docker
./build.sh pytorch-linux-<os-version>-rocm<rocm-version>-py<python-version> -t rocm/pytorch:build_from_dockerfile
```
This should be complete with a message "Successfully build `<image_id>`."
Where:
* `<os-version>`: `ubuntu20.04` (or `focal`), `ubuntu22.04` (or `jammy`), `centos7.5`, or `centos9`
* `<rocm-version>`: `5.4`, `5.5`, or `5.6`
* `<python-version>`: `3.8`-`3.11`
3. Start a Docker container using the image:
To verify that your image was successfully created, run:
`docker image ls rocm/pytorch:build_from_dockerfile`
If successful, the output looks like this:
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt
seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add
video --ipc=host --shm-size 8G <image_id>
REPOSITORY TAG IMAGE ID CREATED SIZE
rocm/pytorch build_from_dockerfile 17071499be47 2 minutes ago 32.8GB
```
You can also pass -v argument to mount any data directories from the host
onto the container.
4. Clone the PyTorch repository.
3. Start a Docker container using the image with the mounted PyTorch folder.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt --user root \
seccomp=unconfined --device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size 8G \
-v ~/pytorch:/pytorch rocm/pytorch:build_from_dockerfile
```
You can also pass the `-v` argument to mount any data directories from the host onto the container.
4. Go to the PyTorch directory.
```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd pytorch
git submodule update --init --recursive
```
5. Build PyTorch for ROCm.
5. Set ROCm architecture.
```{note}
By default in the `rocm/pytorch:latest-base`, PyTorch builds for these
architectures simultaneously:
* gfx900
* gfx906
* gfx908
* gfx90a
* gfx1030
```
6. To determine your AMD uarch, run:
To determine your AMD architecture, run:
```bash
rocminfo | grep gfx
```
7. If you want to compile only for your uarch:
The result looks like this (for `gfx1030` architecture):
```bash
Name: gfx1030
Name: amdgcn-amd-amdhsa--gfx1030
```
Set the `PYTORCH_ROCM_ARCH` environment variable to specify the architectures you want to
build PyTorch for.
```bash
export PYTORCH_ROCM_ARCH=<uarch>
```
`<uarch>` is the architecture reported by the `rocminfo` command.
where `<uarch>` is the architecture reported by the `rocminfo` command.
8. Build PyTorch using:
6. Build PyTorch.
```bash
./.jenkins/pytorch/build.sh
./.ci/pytorch/build.sh
```
This will first convert PyTorch sources to be HIP compatible and then build the
PyTorch framework.
This converts PyTorch sources for
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework.
Alternatively, build PyTorch by issuing the following commands:
To check if your build is successful, run:
```bash
python3 tools/amd_build/build_amd.py
USE_ROCM=1 MAX_JOBS=4 python3 setup.py install --user
```
### Test the PyTorch Installation
You can use PyTorch unit tests to validate a PyTorch installation. If using a
prebuilt PyTorch Docker image from AMD ROCm DockerHub or installing an official
wheels package, these tests are already run on those configurations.
Alternatively, you can manually run the unit tests to validate the PyTorch
installation fully.
Follow these steps:
1. Test if PyTorch is installed and accessible by importing the torch package in
Python.
```{note}
Do not run in the PyTorch git folder.
```bash
echo $? # should return 0 if success
```
## Testing the PyTorch installation
You can use PyTorch unit tests to validate your PyTorch installation. If you used a
**prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
**official wheels package**, validation tests are not necessary.
If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps:
1. Import the torch package in Python to test if PyTorch is installed and accessible.
:::{note}
Do not run the following command in the PyTorch git folder.
:::
```bash
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
```
2. Test if the GPU is accessible from PyTorch. In the PyTorch framework,
`torch.cuda` is a generic mechanism to access the GPU; it will access an AMD
GPU only if available.
2. Check if the GPU is accessible from PyTorch. In the PyTorch framework, `torch.cuda` is a generic way
to access the GPU. This can only access an AMD GPU if one is available.
```bash
python3 -c 'import torch; print(torch.cuda.is_available())'
```
3. Run the unit tests to validate the PyTorch installation fully. Run the
following command from the PyTorch home directory:
3. Run unit tests to validate the PyTorch installation fully.
:::{note}
You must run the following command from the PyTorch home directory.
:::
```bash
BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT:-rocm} ./.jenkins/pytorch/test.sh
PYTORCH_TEST_WITH_ROCM=1 python3 test/run_test.py --verbose \
--include test_nn test_torch test_cuda test_ops \
test_unary_ufuncs test_binary_ufuncs test_autograd
```
This ensures that even for wheel installs in a non-controlled environment,
the required environment variable will be set to skip certain unit tests for
ROCm.
This command ensures that the required environment variable is set to skip certain unit tests for
ROCm. This also applies to wheel installs in a non-controlled environment.
```{note}
Make sure the PyTorch source code is corresponding to the PyTorch wheel or
installation in the Docker image. Incompatible PyTorch source code might give
errors when running the unit tests.
```
:::{note}
Make sure your PyTorch source code corresponds to the PyTorch wheel or the installation in the
Docker image. Incompatible PyTorch source code can give errors when running unit tests.
:::
This will first install some dependencies, such as a supported [torchvision](https://pytorch.org/vision/stable/index.html)
version for PyTorch. Torchvision is used in some PyTorch tests for loading
models. Next, this will run all the unit tests.
Some tests may be skipped, as appropriate, based on your system configuration. ROCm doesn't
support all PyTorch features; tests that evaluate unsupported features are skipped. Other tests might
be skipped, depending on the host or GPU memory and the number of available GPUs.
```{note}
Some tests may be skipped, as appropriate, based on your system
configuration. All features of PyTorch are not supported on ROCm, and the
tests that evaluate these features are skipped. In addition, depending on the
host memory, or the number of available GPUs, other tests may be skipped. No
test should fail if the compilation and installation are correct.
```
If the compilation and installation are correct, all tests will pass.
4. Run individual unit tests with the following command:
4. Run individual unit tests.
```bash
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
```
`test_nn.py` can be replaced with any other test set.
You can replace `test_nn.py` with any other test set.
### Run a Basic PyTorch Example
## Running a basic PyTorch example
The PyTorch examples repository provides basic examples that exercise the
functionality of the framework. MNIST (Modified National Institute of Standards
and Technology) database is a collection of handwritten digits that may be used
to train a Convolutional Neural Network for handwriting recognition.
Alternatively, ImageNet is a database of images used to train a network for
visual object recognition.
The PyTorch examples repository provides basic examples that exercise the functionality of your
framework.
Follow these steps:
Two of our favorite testing databases are:
* **MNIST** (Modified National Institute of Standards and Technology): A database of handwritten
digits that can be used to train a Convolutional Neural Network for **handwriting recognition**.
* **ImageNet**: A database of images that can be used to train a network for
**visual object recognition**.
### MNIST PyTorch example
1. Clone the PyTorch examples repository.
@@ -372,44 +401,45 @@ Follow these steps:
git clone https://github.com/pytorch/examples.git
```
2. Run the MNIST example.
2. Go to the MNIST example folder.
```bash
cd examples/mnist
```
3. Follow the instructions in the `README` file in this folder. In this case:
3. Follow the instructions in the `README.md`` file in this folder to install the requirements. Then run:
```bash
pip3 install -r requirements.txt
python3 main.py
```
4. Run the ImageNet example.
This generates the following output:
```bash
...
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.010128
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.001348
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.005261
Test set: Average loss: 0.0252, Accuracy: 9921/10000 (99%)
```
### ImageNet PyTorch example
1. Clone the PyTorch examples repository (if you didn't already do this step in the preceding MNIST example).
```bash
git clone https://github.com/pytorch/examples.git
```
2. Go to the ImageNet example folder.
```bash
cd examples/imagenet
```
5. Follow the instructions in the `README` file in this folder. In this case:
3. Follow the instructions in the `README.md` file in this folder to install the Requirements. Then run:
```bash
pip3 install -r requirements.txt
python3 main.py
```
## Using MIOpen kdb files with ROCm PyTorch wheels
PyTorch uses MIOpen for machine learning primitives. These primitives are compiled into kernels at runtime. Runtime compilation causes a small warm-up phase when starting PyTorch. MIOpen kdb files contain precompiled kernels that can speed up the warm-up phase of an application. More information is available in the {doc}`MIOpeninstallation page <miopen:install>`.
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in a specific location with respect to the PyTorch installation path. A helper script simplifies this task for the user. The script takes in the ROCm version and user's GPU architecture as inputs, and works for Ubuntu and CentOS.
Helper script: [install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh)
Usage:
After installing ROCm PyTorch wheels:
1. [Optional] `export GFX_ARCH=gfx90a`
2. [Optional] `export ROCM_VERSION=5.5`
3. `./install_kdb_files_for_pytorch_wheels.sh`

View File

@@ -26,7 +26,7 @@ version support matrix:
The following sections contain options for installing TensorFlow.
#### Option 1: Using a Docker image
#### Option 1: using a Docker image
To install ROCm on bare metal, follow the section
[Installation (Linux)](../../tutorials/install/linux/os-native/install). The recommended option to

View File

@@ -1,4 +1,4 @@
# Installation using the command-line interface (CLI)
# Installation using the CLI
The steps to install the HIP SDK for Windows are described in this document.

View File

@@ -1,4 +1,4 @@
# Uninstallation using the command-line interface (CLI)
# Uninstallation using the CLI
The steps to uninstall the HIP SDK for Windows are described in this document.

View File

@@ -81,7 +81,7 @@ This release consists of the following OpenMP enhancements:
* Enable new device RTL in libomptarget as default.
* New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
* Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated.
* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
* Initial implementation of optimized cross-team sum reduction for float and double type scalars.
* Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.

View File

@@ -1,78 +1,80 @@
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### Release Highlights for ROCm 5.7
### Release highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. AddressSanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
#### AMD Instinct™ MI50 End of Support Notice
#### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support)
* ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/compatibility/linux-support.md) and [Windows](./about/compatibility/windows-support.md)
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
- Bug fixes during the maintenance will be made to the next ROCm point release.
* Bug fixes during the maintenance will be made to the next ROCm point release.
- Bug fixes will not be backported to older ROCm releases for gfx906.
* Bug fixes will not be backported to older ROCm releases for gfx906.
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
* Distribution and operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
#### Feature Updates
#### Feature updates
##### Non-hostcall HIP Printf
##### Non-hostcall HIP printf
**Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above),
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
```
Pcie atomics not enabled, hostcall not supported
```
**Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
- “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
* “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
- “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround,
**NOTE**: With the new workaround:
- The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm 5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](./docs/understand/using_gpu_sanitizer.md).
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](./docs/conceptual/using_gpu_sanitizer.md).
#### Fixed Defects
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
The following defects are fixed in ROCm v5.7,
#### Fixed defects
- Test hangs observed in HMM RCCL
The following defects are fixed in ROCm v5.7:
- NoGpuTst test of Catch2 fails with Docker
* Test hangs observed in HMM RCCL
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* NoGpuTst test of Catch2 fails with Docker
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
* Failures observed with non-HMM HIP directed catch2 tests with XNACK+
* Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
#### HIP 5.7.0
@@ -80,73 +82,73 @@ The following defects are fixed in ROCm v5.7,
##### Added
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
- Added new APIs supporting Windows only, under development on Linux
* Added new APIs supporting Windows only, under development on Linux
- `hipMallocMipmappedArray` for allocating a mipmapped array on the device
* `hipMallocMipmappedArray` for allocating a mipmapped array on the device
- `hipFreeMipmappedArray` for freeing a mipmapped array on the device
* `hipFreeMipmappedArray` for freeing a mipmapped array on the device
- `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
* `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
- `hipMipmappedArrayCreate` for creating a mipmapped array
* `hipMipmappedArrayCreate` for creating a mipmapped array
- `hipMipmappedArrayDestroy` for destroy a mipmapped array
* `hipMipmappedArrayDestroy` for destroy a mipmapped array
- `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed
##### Fixed
##### Known Issues
##### Known issues
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release
- Removal of gcnarch from hipDeviceProp_t structure
* Removal of `gcnarch` from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
* Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
* maxTexture1D
- maxTexture2D
* maxTexture2D
- maxTexture1DLayered
* maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from hip code tree
* maxTexture2DLayered
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* sharedMemPerMultiprocessor
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* deviceOverlap
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* asyncEngineCount
- Correct hipGetLastError to return the last error instead of last API call's return code
* surfaceAlignment
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* unifiedAddressing
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* computePreemptionSupported
- Remove hiparray* and make it opaque with hipArray_t
* hostRegisterSupported
* uuid
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Remove hiparray* and make it opaque with hipArray_t