diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 482b0d471..2aca9399a 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1 +1 @@ -* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus +* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney diff --git a/.gitignore b/.gitignore index 31ae443c2..271c026c0 100644 --- a/.gitignore +++ b/.gitignore @@ -13,6 +13,7 @@ _doxygen/ _readthedocs/ # avoid duplicating contributing.md due to conf.py +CHANGELOG.md docs/contributing.md docs/release.md -docs/CHANGELOG.md +docs/about/release-notes.md \ No newline at end of file diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml index fa6edfd27..e6921ba66 100644 --- a/.markdownlint-cli2.yaml +++ b/.markdownlint-cli2.yaml @@ -1,5 +1,7 @@ config: default: true + MD004: + style: asterisk MD013: false MD026: punctuation: '.,;:!' @@ -11,6 +13,6 @@ config: MD051: false ignores: - CHANGELOG.md + - docs/CHANGELOG.md - "{,docs/}{RELEASE,release}.md" - - docs/about/release/release_notes.md - tools/autotag/templates/**/*.md diff --git a/.wordlist.txt b/.wordlist.txt index 59904f9ae..f2576ebe6 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -1,3 +1,8 @@ +GEMM +autogenerated +cuFFT +NVCC +CPP ROCProfiler ROCTracer ROCdbgapi diff --git a/CHANGELOG.md b/CHANGELOG.md index 9a48f8466..8d9685d04 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -27,10 +27,10 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. ### Fixed Defects -- *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host -- Enabled xnack+ check in HIP catch2 tests hang when executing tests -- Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs -- Using *hipGraphAddMemFreeNode* no longer results in a crash +* *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host +* Enabled xnack+ check in HIP catch2 tests hang when executing tests +* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs +* Using *hipGraphAddMemFreeNode* no longer results in a crash ------------------- @@ -38,110 +38,111 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. -#### Release Highlights + +### Release Highlights ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include: -- New documentation portal at https://rocm.docs.amd.com -- Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite -- OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements -- Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers -- New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. +* New documentation portal at https://rocm.docs.amd.com +* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite +* OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements +* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers +* New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. -#### OS and GPU Support Changes +### OS and GPU Support Changes -- SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. -- AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. - - No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 - - Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release) - - Bug fixes during the maintenance will be made to the next ROCm point release - - Bug fixes will not be back ported to older ROCm releases for this SKU - - Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. +* SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. +* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. + * No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 + * Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance \[EOM])(will be aligned with the closest ROCm release) + * Bug fixes during the maintenance will be made to the next ROCm point release + * Bug fixes will not be back ported to older ROCm releases for this SKU + * Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. -#### AMDSMI CLI 23.0.0.4 +### AMDSMI CLI 23.0.0.4 -##### Added +#### Added -- AMDSMI CLI tool enabled for Linux Bare Metal & Guest +* AMDSMI CLI tool enabled for Linux Bare Metal & Guest -- Package: amd-smi-lib - -##### Known Issues +* Package: amd-smi-lib -- not all Error Correction Code (ECC) fields are currently supported +#### Known Issues -- RHEL 8 & SLES 15 have extra install steps +* not all Error Correction Code (ECC) fields are currently supported -#### Kernel Modules (DKMS) +* RHEL 8 & SLES 15 have extra install steps -##### Fixes +### Kernel Modules (DKMS) -- Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). +#### Fixes + +* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). #### HIP 5.6 (For ROCm 5.6) ##### Optimizations -- Consolidation of hipamd, rocclr and OpenCL projects in clr -- Optimized lock for graph global capture mode +* Consolidation of hipamd, rocclr and OpenCL projects in clr +* Optimized lock for graph global capture mode ##### Added -- Added hipRTC support for amd_hip_fp16 -- Added hipStreamGetDevice implementation to get the device associated with the stream -- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats -- hipArrayGetInfo for getting information about the specified array -- hipArrayGetDescriptor for getting 1D or 2D array descriptor -- hipArray3DGetDescriptor to get 3D array descriptor +* Added hipRTC support for amd_hip_fp16 +* Added hipStreamGetDevice implementation to get the device associated with the stream +* Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats +* hipArrayGetInfo for getting information about the specified array +* hipArrayGetDescriptor for getting 1D or 2D array descriptor +* hipArray3DGetDescriptor to get 3D array descriptor ##### Changed -- hipMallocAsync to return success for zero size allocation to match hipMalloc -- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package -- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide -- Removed hipBusBandwidth and hipCommander samples from hip-tests +* hipMallocAsync to return success for zero size allocation to match hipMalloc +* Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package +* Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide +* Removed hipBusBandwidth and hipCommander samples from hip-tests ##### Fixed -- Fixed regression in hipMemCpyParam3D when offset is applied +* Fixed regression in hipMemCpyParam3D when offset is applied ##### Known Issues -- Limited testing on xnack+ configuration - - Multiple HIP tests failures (gpuvm fault or hangs) -- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU -- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release +* Limited testing on xnack+ configuration + * Multiple HIP tests failures (gpuvm fault or hangs) +* hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU +* Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release ##### Upcoming changes in future release -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction (unsigned int -> size_t) -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction (unsigned int -> size_t) +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' #### ROCgdb-13 (For ROCm 5.6.0) ##### Optimized -- Improved performances when handling the end of a process with a large number of threads. +* Improved performances when handling the end of a process with a large number of threads. Known Issues -- On certain configurations, ROCgdb can show the following warning message: +* On certain configurations, ROCgdb can show the following warning message: `warning: Probes-based dynamic linker interface failed. Reverting to original interface.` @@ -215,17 +216,17 @@ The resulting `a.out` will depend on ##### Optimized -- Improved Test Suite +* Improved Test Suite ##### Added -- 'end_time' need to be disabled in roctx_trace.txt +* 'end_time' need to be disabled in roctx_trace.txt ##### Fixed -- rocprof in ROcm/5.4.0 gpu selector broken. -- rocprof in ROCm/5.4.1 fails to generate kernel info. -- rocprof clobbers LD_PRELOAD. +* rocprof in ROcm/5.4.0 gpu selector broken. +* rocprof in ROCm/5.4.1 fails to generate kernel info. +* rocprof clobbers LD_PRELOAD. ### Library Changes in ROCM 5.6.0 @@ -256,16 +257,16 @@ hipBLAS 1.0.0 for ROCm 5.6.0 ##### Changed -- added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing +* added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing ##### Removed -- removed support for deprecated hipblasInt8Datatype_t enum -- removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions +* removed support for deprecated hipblasInt8Datatype_t enum +* removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions ##### Deprecated -- in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and +* in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and out-of-place functionality #### hipCUB 2.13.1 @@ -274,18 +275,18 @@ hipCUB 2.13.1 for ROCm 5.6.0 ##### Added -- Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. +* Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. ##### Changed -- CUB backend references CUB and Thrust version 1.17.2. -- Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). +* CUB backend references CUB and Thrust version 1.17.2. +* Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). ##### Known Issues -- `BlockRadixRankMatch` is currently broken under the rocPRIM backend. -- `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. +* `BlockRadixRankMatch` is currently broken under the rocPRIM backend. +* `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. #### hipFFT 1.0.12 @@ -293,11 +294,11 @@ hipFFT 1.0.12 for ROCm 5.6.0 ##### Added -- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. +* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. ##### Changed -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. +* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. #### hipSOLVER 1.8.0 @@ -305,7 +306,7 @@ hipSOLVER 1.8.0 for ROCm 5.6.0 ##### Added -- Added compatibility API with hipsolverRf prefix +* Added compatibility API with hipsolverRf prefix #### hipSPARSE 2.3.6 @@ -313,11 +314,11 @@ hipSPARSE 2.3.6 for ROCm 5.6.0 ##### Added -- Added SpGEMM algorithms +* Added SpGEMM algorithms ##### Changed -- For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE +* For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE #### MIOpen 2.19.0 @@ -325,17 +326,17 @@ MIOpen 2.19.0 for ROCm 5.6.0 ##### Added -- ROCm 5.5 support for gfx1101 (Navi32) +* ROCm 5.5 support for gfx1101 (Navi32) ##### Changed -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag +* Tuning results for MLIR on ROCm 5.5 +* Bumping MLIR commit to 5.5.0 release tag ##### Fixed -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. +* Fix 3d convolution Host API bug +* \[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. #### rccl 2.15.5 @@ -343,25 +344,25 @@ RCCL 2.15.5 for ROCm 5.6.0 ##### Changed -- Compatibility with NCCL 2.15.5 -- Unit test executable renamed to rccl-UnitTests +* Compatibility with NCCL 2.15.5 +* Unit test executable renamed to rccl-UnitTests ##### Added -- HW-topology aware binary tree implementation -- Experimental support for MSCCL -- New unit tests for hipGraph support -- NPKit integration +* HW-topology aware binary tree implementation +* Experimental support for MSCCL +* New unit tests for hipGraph support +* NPKit integration ##### Fixed -- rocm-smi ID conversion -- Support for HIP_VISIBLE_DEVICES for unit tests -- Support for p2p transfers to non (HIP) visible devices +* rocm-smi ID conversion +* Support for HIP_VISIBLE_DEVICES for unit tests +* Support for p2p transfers to non (HIP) visible devices ##### Removed -- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench +* Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench #### rocALUTION 2.1.9 @@ -369,7 +370,7 @@ rocALUTION 2.1.9 for ROCm 5.6.0 ##### Improved -- Fixed synchronization issues in level 1 routines +* Fixed synchronization issues in level 1 routines #### rocBLAS 3.0.0 @@ -377,40 +378,40 @@ rocBLAS 3.0.0 for ROCm 5.6.0 ##### Optimizations -- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256. -- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a. +* Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256. +* Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a. ##### Added -- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex. +* Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex. ##### Deprecated -- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality -- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release -- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release -- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size() -- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release +* trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality +* rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release +* rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release +* rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size() +* rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release ##### Removed -- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead. -- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively. -- rocblas_set_int8_type_for_hipblas was deprecated and is now removed. -- rocblas_get_int8_type_for_hipblas was deprecated and is now removed. +* is_complex helper was deprecated and now removed. Use rocblas_is_complex instead. +* The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively. +* rocblas_set_int8_type_for_hipblas was deprecated and is now removed. +* rocblas_get_int8_type_for_hipblas was deprecated and is now removed. ##### Dependencies -- build only dependency on python joblib added as used by Tensile build -- fix for cmake install on some OS when performed by install.sh -d --cmake_install +* build only dependency on python joblib added as used by Tensile build +* fix for cmake install on some OS when performed by install.sh -d --cmake_install ##### Fixed -- make trsm offset calculations 64 bit safe +* make trsm offset calculations 64 bit safe ##### Changed -- refactor rotg test code +* refactor rotg test code #### rocFFT 1.0.23 @@ -418,19 +419,19 @@ rocFFT 1.0.23 for ROCm 5.6.0 ##### Added -- Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create. -- Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used. -- Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems. +* Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create. + Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used. +* Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems. ##### Changed -- Replaced std::complex with hipComplex data types for data generator. -- FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example). -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. +* Replaced std::complex with hipComplex data types for data generator. +* FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example). +* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. ##### Fixed -- Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure. +* Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure. #### rocm-cmake 0.9.0 @@ -438,9 +439,9 @@ rocm-cmake 0.9.0 for ROCm 5.6.0 ##### Added -- Added the option ROCM_HEADER_WRAPPER_WERROR - - Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings. - - Configure-time CMake option sets the default for the C macro. +* Added the option ROCM_HEADER_WRAPPER_WERROR + * Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings. + * Configure-time CMake option sets the default for the C macro. #### rocPRIM 2.13.0 @@ -448,20 +449,20 @@ rocPRIM 2.13.0 for ROCm 5.6.0 ##### Added -- New block level `radix_rank` primitive. -- New block level `radix_rank_match` primitive. -- Added a stable block sorting implementation. This be used with `block_sort` by using the `block_sort_algorithm::stable_merge_sort` algorithm. +* New block level `radix_rank` primitive. +* New block level `radix_rank_match` primitive. +* Added a stable block sorting implementation. This be used with `block_sort` by using the `block_sort_algorithm::stable_merge_sort` algorithm. ##### Changed -- Improved the performance of `block_radix_sort` and `device_radix_sort`. -- Improved the performance of `device_merge_sort`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). Contributed by: [v01dXYZ](https://github.com/v01dXYZ). +* Improved the performance of `block_radix_sort` and `device_radix_sort`. +* Improved the performance of `device_merge_sort`. +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). Contributed by: [v01dXYZ](https://github.com/v01dXYZ). ##### Known Issues -- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. -- When `ROCPRIM_DISABLE_LOOKBACK_SCAN` is set, `device_scan` fails for input sizes bigger than `scan_config::size_limit`, which defaults to `std::numeric_limits<unsigned int>::max()`. +* Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. +* When `ROCPRIM_DISABLE_LOOKBACK_SCAN` is set, `device_scan` fails for input sizes bigger than `scan_config::size_limit`, which defaults to `std::numeric_limits<unsigned int>::max()`. #### rocRAND 2.10.17 @@ -469,14 +470,14 @@ rocRAND 2.10.17 for ROCm 5.6.0 ##### Added -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". +* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. +* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. +* experimental HIP-CPU feature +* ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". ##### Changed -- Python 2.7 is no longer officially supported. +* Python 2.7 is no longer officially supported. #### rocSOLVER 3.22.0 @@ -484,22 +485,22 @@ rocSOLVER 3.22.0 for ROCm 5.6.0 ##### Added -- LU refactorization for sparse matrices - - CSRRF_ANALYSIS - - CSRRF_SUMLU - - CSRRF_SPLITLU - - CSRRF_REFACTLU -- Linear system solver for sparse matrices - - CSRRF_SOLVE -- Added type `rocsolver_rfinfo` for use with sparse matrix routines +* LU refactorization for sparse matrices + * CSRRF_ANALYSIS + * CSRRF_SUMLU + * CSRRF_SPLITLU + * CSRRF_REFACTLU +* Linear system solver for sparse matrices + * CSRRF_SOLVE +* Added type `rocsolver_rfinfo` for use with sparse matrix routines ##### Optimized -- Improved the performance of BDSQR and GESVD when singular vectors are requested +* Improved the performance of BDSQR and GESVD when singular vectors are requested ##### Fixed -- BDSQR and GESVD should no longer hang when the input contains `NaN` or `Inf` +* BDSQR and GESVD should no longer hang when the input contains `NaN` or `Inf` #### rocSPARSE 2.5.2 @@ -507,20 +508,20 @@ rocSPARSE 2.5.2 for ROCm 5.6.0 ##### Improved -- Fixed a memory leak in csritsv -- Fixed a bug in csrsm and bsrsm +* Fixed a memory leak in csritsv +* Fixed a bug in csrsm and bsrsm #### rocThrust 2.18.0 rocThrust 2.18.0 for ROCm 5.6.0 -##### Fixed +##### Fixed -- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. +* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. ##### Changed -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). #### rocWMMA 1.1.0 @@ -528,21 +529,21 @@ rocWMMA 1.1.0 for ROCm 5.6.0 ##### Added -- Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) -- Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) -- Added performance gemm samples for half, single and double precision -- Added rocWMMA cmake versioning -- Added vectorized support in coordinate transforms -- Included ROCm smi for runtime clock rate detection -- Added fragment transforms for transpose and change data layout +* Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) +* Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) +* Added performance gemm samples for half, single and double precision +* Added rocWMMA cmake versioning +* Added vectorized support in coordinate transforms +* Included ROCm smi for runtime clock rate detection +* Added fragment transforms for transpose and change data layout ##### Changed -- Default to GPU rocBLAS validation against rocWMMA -- Re-enabled int8 gemm tests on gfx9 -- Upgraded to C++17 -- Restructured unit test folder for consistency -- Consolidated rocWMMA samples common code +* Default to GPU rocBLAS validation against rocWMMA +* Re-enabled int8 gemm tests on gfx9 +* Upgraded to C++17 +* Restructured unit test folder for consistency +* Consolidated rocWMMA samples common code #### Tensile 4.37.0 @@ -550,55 +551,55 @@ Tensile 4.37.0 for ROCm 5.6.0 ##### Added -- Added user driven tuning API -- Added decision tree fallback feature -- Added SingleBuffer + AtomicAdd option for GlobalSplitU -- DirectToVgpr support for fp16 and Int8 with TN orientation -- Added new test cases for various functions -- Added SingleBuffer algorithm for ZGEMM/CGEMM -- Added joblib for parallel map calls -- Added support for MFMA + LocalSplitU + DirectToVgprA+B -- Added asmcap check for MIArchVgpr -- Added support for MFMA + LocalSplitU -- Added frequency, power, and temperature data to the output +* Added user driven tuning API +* Added decision tree fallback feature +* Added SingleBuffer + AtomicAdd option for GlobalSplitU +* DirectToVgpr support for fp16 and Int8 with TN orientation +* Added new test cases for various functions +* Added SingleBuffer algorithm for ZGEMM/CGEMM +* Added joblib for parallel map calls +* Added support for MFMA + LocalSplitU + DirectToVgprA+B +* Added asmcap check for MIArchVgpr +* Added support for MFMA + LocalSplitU +* Added frequency, power, and temperature data to the output ##### Optimizations -- Improved the performance of GlobalSplitU with SingleBuffer algorithm -- Reduced the running time of the extended and pre_checkin tests -- Optimized the Tailloop section of the assembly kernel -- Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch) -- Improved the performance of the second kernel of MultipleBuffer algorithm +* Improved the performance of GlobalSplitU with SingleBuffer algorithm +* Reduced the running time of the extended and pre_checkin tests +* Optimized the Tailloop section of the assembly kernel +* Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch) +* Improved the performance of the second kernel of MultipleBuffer algorithm ##### Changed -- Updated custom kernels with 64-bit offsets -- Adapted 64-bit offset arguments for assembly kernels -- Improved temporary register re-use to reduce max sgpr usage -- Removed some restrictions on VectorWidth and DirectToVgpr -- Updated the dependency requirements for Tensile -- Changed the range of AssertSummationElementMultiple -- Modified the error messages for more clarity -- Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used -- Removed dummy vgpr for vectorStaticRemainder -- Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder -- Removed qReg parameter from vectorStaticRemainder +* Updated custom kernels with 64-bit offsets +* Adapted 64-bit offset arguments for assembly kernels +* Improved temporary register re-use to reduce max sgpr usage +* Removed some restrictions on VectorWidth and DirectToVgpr +* Updated the dependency requirements for Tensile +* Changed the range of AssertSummationElementMultiple +* Modified the error messages for more clarity +* Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used +* Removed dummy vgpr for vectorStaticRemainder +* Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder +* Removed qReg parameter from vectorStaticRemainder ##### Fixed -- Fixed tmp sgpr allocation to avoid over-writing values (alpha) -- 64-bit offset parameters for post kernels -- Fixed gfx908 CI test failures -- Fixed offset calculation to prevent overflow for large offsets -- Fixed issues when BufferLoad and BufferStore are equal to zero -- Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch -- Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch -- Fixed the memory access error related to StaggerU + large stride -- Fixed ZGEMM 4x4 MatrixInst mismatch -- Fixed DGEMM 4x4 MatrixInst mismatch -- Fixed ASEM + GSU + NoTailLoop opt mismatch -- Fixed AssertSummationElementMultiple + GlobalSplitU issues -- Fixed ASEM + GSU + TailLoop inner unroll +* Fixed tmp sgpr allocation to avoid over-writing values (alpha) +* 64-bit offset parameters for post kernels +* Fixed gfx908 CI test failures +* Fixed offset calculation to prevent overflow for large offsets +* Fixed issues when BufferLoad and BufferStore are equal to zero +* Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch +* Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch +* Fixed the memory access error related to StaggerU + large stride +* Fixed ZGEMM 4x4 MatrixInst mismatch +* Fixed DGEMM 4x4 MatrixInst mismatch +* Fixed ASEM + GSU + NoTailLoop opt mismatch +* Fixed AssertSummationElementMultiple + GlobalSplitU issues +* Fixed ASEM + GSU + TailLoop inner unroll ------------------- @@ -626,7 +627,7 @@ The following HIP API is updated in the ROCm 5.5.1 release: ##### `hipDeviceSetCacheConfig` -- The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess` +* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess` ### Library Changes in ROCM 5.5.1 @@ -671,40 +672,40 @@ Applications requiring to update the stack size can use hipDeviceSetLimit API. The following hipcc changes are implemented in this release: -- `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them. -- `-use-staticlib` and `-use-sharedlib` options are deprecated. +* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them. +* `-use-staticlib` and `-use-sharedlib` options are deprecated. ##### Future Changes -- Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases. -- In a future ROCm release, the following samples will be removed from the `hip-tests` project. - - `hipBusbandWidth` at - - `hipCommander` at +* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases. +* In a future ROCm release, the following samples will be removed from the `hip-tests` project. + * `hipBusbandWidth` at + * `hipCommander` at Note that the samples will continue to be available in previous release branches. -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - hostRegisterSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' -- Correct hipGetLastError to return the last error instead of last API call's return code -- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]" -- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess -- Remove hiparray* and make it opaque with hipArray_t +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * hostRegisterSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' +* Correct hipGetLastError to return the last error instead of last API call's return code +* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved\[16]" +* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess +* Remove hiparray* and make it opaque with hipArray_t ##### New HIP APIs in This Release @@ -716,9 +717,9 @@ The following hipcc changes are implemented in this release: The new memory management HIP API is as follows: -- Sets information on the specified pointer [BETA]. +* Sets information on the specified pointer \[BETA]. - ```h + ```cpp hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr); ``` @@ -726,15 +727,15 @@ The new memory management HIP API is as follows: The new module management HIP APIs are as follows: -- Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute. +* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute. - ```h + ```cpp hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams); ``` -- Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute. +* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute. - ```h + ```cpp hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags); ``` @@ -742,60 +743,60 @@ The new module management HIP APIs are as follows: The new HIP Graph Management APIs are as follows: -- Creates a memory allocation node and adds it to a graph [BETA] +* Creates a memory allocation node and adds it to a graph \[BETA] - ```h + ```cpp hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams); ``` -- Return parameters for memory allocation node [BETA] +* Return parameters for memory allocation node \[BETA] - ```h + ```cpp hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams); ``` -- Creates a memory free node and adds it to a graph [BETA] +* Creates a memory free node and adds it to a graph \[BETA] - ```h + ```cpp hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr); ``` -- Returns parameters for memory free node [BETA]. +* Returns parameters for memory free node \[BETA]. - ```h + ```cpp hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr); ``` -- Write a DOT file describing graph structure [BETA]. +* Write a DOT file describing graph structure \[BETA]. - ```h + ```cpp hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags); ``` -- Copies attributes from source node to destination node [BETA]. +* Copies attributes from source node to destination node \[BETA]. - ```h + ```cpp hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst); ``` -- Enables or disables the specified node in the given graphExec [BETA] +* Enables or disables the specified node in the given graphExec \[BETA] - ```h + ```cpp hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled); ``` -- Query whether a node in the given graphExec is enabled [BETA] +* Query whether a node in the given graphExec is enabled \[BETA] - ```h + ```cpp hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled); ``` ##### OpenMP Enhancements This release consists of the following OpenMP enhancements: -- Additional support for OMPT functions `get_device_time` and `get_record_type`. -- Add support for min/max fast fp atomics on AMD GPUs. -- Fix the use of the abs function in C device regions. +* Additional support for OMPT functions `get_device_time` and `get_record_type`. +* Add support for min/max fast fp atomics on AMD GPUs. +* Fix the use of the abs function in C device regions. ### Deprecations and Warnings @@ -863,7 +864,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -871,10 +872,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -882,7 +883,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -896,7 +897,7 @@ For backward compatibility, the old CMake locations (`/opt/rocm-xxx// Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/cmake/hip/ total 0 lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake @@ -912,13 +913,13 @@ The following APIs and macros have been marked as deprecated. These are expected ##### API Changes -- `amd_comgr_action_info_set_options()` -- `amd_comgr_action_info_get_options()` +* `amd_comgr_action_info_set_options()` +* `amd_comgr_action_info_get_options()` ##### Actions and Data Types -- `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES` -- `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN` +* `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES` +* `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN` For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the `AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros. @@ -926,28 +927,28 @@ For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, an The following environment variables are removed in this ROCm release: -- `GPU_MAX_COMMAND_QUEUES` -- `GPU_MAX_WORKGROUP_SIZE_2D_X` -- `GPU_MAX_WORKGROUP_SIZE_2D_Y` -- `GPU_MAX_WORKGROUP_SIZE_3D_X` -- `GPU_MAX_WORKGROUP_SIZE_3D_Y` -- `GPU_MAX_WORKGROUP_SIZE_3D_Z` -- `GPU_BLIT_ENGINE_TYPE` -- `GPU_USE_SYNC_OBJECTS` -- `AMD_OCL_SC_LIB` -- `AMD_OCL_ENABLE_MESSAGE_BOX` -- `GPU_FORCE_64BIT_PTR` -- `GPU_FORCE_OCL20_32BIT` -- `GPU_RAW_TIMESTAMP` -- `GPU_SELECT_COMPUTE_RINGS_ID` -- `GPU_USE_SINGLE_SCRATCH` -- `GPU_ENABLE_LARGE_ALLOCATION` -- `HSA_LOCAL_MEMORY_ENABLE` -- `HSA_ENABLE_COARSE_GRAIN_SVM` -- `GPU_IFH_MODE` -- `OCL_SYSMEM_REQUIREMENT` -- `OCL_CODE_CACHE_ENABLE` -- `OCL_CODE_CACHE_RESET` +* `GPU_MAX_COMMAND_QUEUES` +* `GPU_MAX_WORKGROUP_SIZE_2D_X` +* `GPU_MAX_WORKGROUP_SIZE_2D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_X` +* `GPU_MAX_WORKGROUP_SIZE_3D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_Z` +* `GPU_BLIT_ENGINE_TYPE` +* `GPU_USE_SYNC_OBJECTS` +* `AMD_OCL_SC_LIB` +* `AMD_OCL_ENABLE_MESSAGE_BOX` +* `GPU_FORCE_64BIT_PTR` +* `GPU_FORCE_OCL20_32BIT` +* `GPU_RAW_TIMESTAMP` +* `GPU_SELECT_COMPUTE_RINGS_ID` +* `GPU_USE_SINGLE_SCRATCH` +* `GPU_ENABLE_LARGE_ALLOCATION` +* `HSA_LOCAL_MEMORY_ENABLE` +* `HSA_ENABLE_COARSE_GRAIN_SVM` +* `GPU_IFH_MODE` +* `OCL_SYSMEM_REQUIREMENT` +* `OCL_CODE_CACHE_ENABLE` +* `OCL_CODE_CACHE_RESET` ### Known Issues In This Release @@ -993,24 +994,24 @@ hipBLAS 0.54.0 for ROCm 5.5.0 ##### Added -- added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF -- added scripts to plot performance for multiple functions -- data driven hipblas-bench and hipblas-test execution via external yaml format data files -- client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml +* added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF +* added scripts to plot performance for multiple functions +* data driven hipblas-bench and hipblas-test execution via external yaml format data files +* client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml ##### Fixed -- fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes -- fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0 -- fixed getrs to return successfully when given nullptrs with corresponding size = 0 -- fixed getrs to give info = -1 when transpose is not an expected type -- fixed gels to return successfully when given nullptrs with corresponding size = 0 -- fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases +* fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes +* fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0 +* fixed getrs to return successfully when given nullptrs with corresponding size = 0 +* fixed getrs to give info = -1 when transpose is not an expected type +* fixed gels to return successfully when given nullptrs with corresponding size = 0 +* fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases ##### Changed -- changed reference code for Windows to OpenBLAS -- hipblas client executables all now begin with hipblas- prefix +* changed reference code for Windows to OpenBLAS +* hipblas client executables all now begin with hipblas- prefix #### hipCUB 2.13.1 @@ -1018,21 +1019,21 @@ hipCUB 2.13.1 for ROCm 5.5.0 ##### Added -- Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. +* Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. ##### Changed -- CUB backend references CUB and Thrust version 1.17.2. -- Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. +* CUB backend references CUB and Thrust version 1.17.2. +* Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. ##### Fixed -- Windows HIP SDK support +* Windows HIP SDK support ##### Known Issues -- `BlockRadixRankMatch` is currently broken under the rocPRIM backend. -- `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. +* `BlockRadixRankMatch` is currently broken under the rocPRIM backend. +* `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. #### hipFFT 1.0.11 @@ -1040,7 +1041,7 @@ hipFFT 1.0.11 for ROCm 5.5.0 ##### Fixed -- Fixed old version rocm include/lib folders not removed on upgrade. +* Fixed old version rocm include/lib folders not removed on upgrade. #### hipSOLVER 1.7.0 @@ -1048,13 +1049,13 @@ hipSOLVER 1.7.0 for ROCm 5.5.0 ##### Added -- Added functions - - gesvdj - - hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize - - hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj - - gesvdjBatched - - hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize - - hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched +* Added functions + * gesvdj + * hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize + * hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj + * gesvdjBatched + * hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize + * hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched #### hipSPARSE 2.3.5 @@ -1062,11 +1063,11 @@ hipSPARSE 2.3.5 for ROCm 5.5.0 ##### Improved -- Fixed an issue, where the rocm folder was not removed on upgrade of meta packages -- Fixed a compilation issue with cusparse backend -- Added more detailed messages on unit test failures due to missing input data -- Improved documentation -- Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis) +* Fixed an issue, where the rocm folder was not removed on upgrade of meta packages +* Fixed a compilation issue with cusparse backend +* Added more detailed messages on unit test failures due to missing input data +* Improved documentation +* Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis) #### MIOpen 2.19.0 @@ -1074,17 +1075,17 @@ MIOpen 2.19.0 for ROCm 5.5.0 ##### Added -- ROCm 5.5 support for gfx1101 (Navi32) +* ROCm 5.5 support for gfx1101 (Navi32) ##### Changed -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag +* Tuning results for MLIR on ROCm 5.5 +* Bumping MLIR commit to 5.5.0 release tag ##### Fixed -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. +* Fix 3d convolution Host API bug +* \[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. #### rccl 2.15.5 @@ -1092,25 +1093,25 @@ RCCL 2.15.5 for ROCm 5.5.0 ##### Changed -- Compatibility with NCCL 2.15.5 -- Unit test executable renamed to rccl-UnitTests +* Compatibility with NCCL 2.15.5 +* Unit test executable renamed to rccl-UnitTests ##### Added -- HW-topology aware binary tree implementation -- Experimental support for MSCCL -- New unit tests for hipGraph support -- NPKit integration +* HW-topology aware binary tree implementation +* Experimental support for MSCCL +* New unit tests for hipGraph support +* NPKit integration ##### Fixed -- rocm-smi ID conversion -- Support for HIP_VISIBLE_DEVICES for unit tests -- Support for p2p transfers to non (HIP) visible devices +* rocm-smi ID conversion +* Support for HIP_VISIBLE_DEVICES for unit tests +* Support for p2p transfers to non (HIP) visible devices ##### Removed -- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench +* Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench #### rocALUTION 2.1.8 @@ -1118,24 +1119,24 @@ rocALUTION 2.1.8 for ROCm 5.5.0 ##### Added -- Added build support for Navi32 +* Added build support for Navi32 ##### Improved -- Fixed a typo in MPI backend -- Fixed a bug with the backend when HIP support is disabled -- Fixed a bug in SAAMG hierarchy building on HIP backend -- Improved SAAMG hierarchy build performance on HIP backend +* Fixed a typo in MPI backend +* Fixed a bug with the backend when HIP support is disabled +* Fixed a bug in SAAMG hierarchy building on HIP backend +* Improved SAAMG hierarchy build performance on HIP backend ##### Changed -- LocalVector::GetIndexValues(ValueType\*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector\*) instead -- LocalVector::SetIndexValues(const ValueType\*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead -- LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*) instead -- LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix\*) instead -- LocalMatrix::RugeStueben() is deprecated -- LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, int) instead -- LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*) instead +* LocalVector::GetIndexValues(ValueType\*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector\*) instead +* LocalVector::SetIndexValues(const ValueType\*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead +* LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*) instead +* LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix\*) instead +* LocalMatrix::RugeStueben() is deprecated +* LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, int) instead +* LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*) instead #### rocBLAS 2.47.0 @@ -1143,33 +1144,33 @@ rocBLAS 2.47.0 for ROCm 5.5.0 ##### Added -- added functionality rocblas_geam_ex for matrix-matrix minimum operations -- added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions -- added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API -- added support for vector initialization in the rocBLAS test framework with negative increments -- added windows build documentation for forthcoming support using ROCm HIP SDK -- added scripts to plot performance for multiple functions +* added functionality rocblas_geam_ex for matrix-matrix minimum operations +* added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions +* added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API +* added support for vector initialization in the rocBLAS test framework with negative increments +* added windows build documentation for forthcoming support using ROCm HIP SDK +* added scripts to plot performance for multiple functions ##### Optimizations -- improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU. -- improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU. -- improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs. +* improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU. +* improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU. +* improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs. ##### Fixed -- fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench -- fixed deprecated API compatibility with Visual Studio compiler -- fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory +* fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench +* fixed deprecated API compatibility with Visual Studio compiler +* fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory ##### Changed -- install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help) -- rocblas client executables all now begin with rocblas- prefix +* install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help) +* rocblas client executables all now begin with rocblas- prefix ##### Removed -- install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default +* install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default #### rocFFT 1.0.22 @@ -1177,25 +1178,25 @@ rocFFT 1.0.22 for ROCm 5.5.0 ##### Optimizations -- Improved performance of 1D lengths < 2048 that use Bluestein's algorithm. -- Reduced time for generating code during plan creation. -- Optimized 3D R2C/C2R lengths 32, 84, 128. -- Optimized batched small 1D R2C/C2R cases. +* Improved performance of 1D lengths < 2048 that use Bluestein's algorithm. +* Reduced time for generating code during plan creation. +* Optimized 3D R2C/C2R lengths 32, 84, 128. +* Optimized batched small 1D R2C/C2R cases. ##### Added -- Added gfx1101 to default AMDGPU_TARGETS. +* Added gfx1101 to default AMDGPU_TARGETS. ##### Changed -- Moved client programs to C++17. -- Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled. -- Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache. +* Moved client programs to C++17. +* Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled. +* Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache. ##### Fixed -- Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged. -- Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present. +* Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged. +* Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present. #### rocm-cmake 0.8.1 @@ -1203,11 +1204,11 @@ rocm-cmake 0.8.1 for ROCm 5.5.0 ##### Fixed -- ROCMInstallTargets: Added compatibility symlinks for included cmake files in `<ROCM>/lib/cmake/<PACKAGE>`. +* ROCMInstallTargets: Added compatibility symlinks for included cmake files in `<ROCM>/lib/cmake/<PACKAGE>`. ##### Changed -- ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning. +* ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning. #### rocPRIM 2.13.0 @@ -1215,20 +1216,20 @@ rocPRIM 2.13.0 for ROCm 5.5.0 ##### Added -- New block level `radix_rank` primitive. -- New block level `radix_rank_match` primitive. +* New block level `radix_rank` primitive. +* New block level `radix_rank_match` primitive. ##### Changed -- Improved the performance of `block_radix_sort` and `device_radix_sort`. +* Improved the performance of `block_radix_sort` and `device_radix_sort`. ##### Known Issues -- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. +* Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. ##### Fixed -- Fixed benchmark build on Windows +* Fixed benchmark build on Windows #### rocRAND 2.10.17 @@ -1236,18 +1237,18 @@ rocRAND 2.10.17 for ROCm 5.5.0 ##### Added -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". +* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. +* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. +* experimental HIP-CPU feature +* ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". ##### Changed -- Python 2.7 is no longer officially supported. +* Python 2.7 is no longer officially supported. ##### Fixed -- Windows HIP SDK support +* Windows HIP SDK support #### rocSOLVER 3.21.0 @@ -1255,30 +1256,30 @@ rocSOLVER 3.21.0 for ROCm 5.5.0 ##### Added -- SVD for general matrices using Jacobi algorithm: - - GESVDJ (with batched and strided\_batched versions) -- LU factorization without pivoting for block tridiagonal matrices: - - GEBLTTRF_NPVT (with batched and strided\_batched versions) -- Linear system solver without pivoting for block tridiagonal matrices: - - GEBLTTRS_NPVT (with batched and strided\_batched, versions) -- Product of triangular matrices - - LAUUM -- Added experimental hipGraph support for rocSOLVER functions +* SVD for general matrices using Jacobi algorithm: + * GESVDJ (with batched and strided\_batched versions) +* LU factorization without pivoting for block tridiagonal matrices: + * GEBLTTRF_NPVT (with batched and strided\_batched versions) +* Linear system solver without pivoting for block tridiagonal matrices: + * GEBLTTRS_NPVT (with batched and strided\_batched, versions) +* Product of triangular matrices + * LAUUM +* Added experimental hipGraph support for rocSOLVER functions ##### Optimized -- Improved the performance of SYEVJ/HEEVJ. +* Improved the performance of SYEVJ/HEEVJ. ##### Changed -- STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach. +* STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach. ##### Fixed -- SYEVJ/HEEVJ should now be invariant under matrix scaling. -- SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed. -- Fixed GETF2\_NPVT and GETRF\_NPVT input data initialization in tests and benchmarks. -- Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages. +* SYEVJ/HEEVJ should now be invariant under matrix scaling. +* SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed. +* Fixed GETF2\_NPVT and GETRF\_NPVT input data initialization in tests and benchmarks. +* Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages. #### rocSPARSE 2.5.1 @@ -1286,28 +1287,28 @@ rocSPARSE 2.5.1 for ROCm 5.5.0 ##### Added -- Added bsrgemm and spgemm for BSR format -- Added bsrgeam -- Added build support for Navi32 -- Added experimental hipGraph support for some rocSPARSE routines -- Added csritsv, spitsv csr iterative triangular solve -- Added mixed precisions for SpMV -- Added batched SpMM for transpose A in COO format with atomic atomic algorithm +* Added bsrgemm and spgemm for BSR format +* Added bsrgeam +* Added build support for Navi32 +* Added experimental hipGraph support for some rocSPARSE routines +* Added csritsv, spitsv csr iterative triangular solve +* Added mixed precisions for SpMV +* Added batched SpMM for transpose A in COO format with atomic atomic algorithm ##### Improved -- Optimization to csr2bsr -- Optimization to csr2csr_compress -- Optimization to csr2coo -- Optimization to gebsr2csr -- Optimization to csr2gebsr -- Fixes to documentation -- Fixes a bug in COO SpMV gridsize -- Fixes a bug in SpMM gridsize when using very large matrices +* Optimization to csr2bsr +* Optimization to csr2csr_compress +* Optimization to csr2coo +* Optimization to gebsr2csr +* Optimization to csr2gebsr +* Fixes to documentation +* Fixes a bug in COO SpMV gridsize +* Fixes a bug in SpMM gridsize when using very large matrices ##### Known Issues -- In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split. +* In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split. #### rocWMMA 1.0 @@ -1315,15 +1316,15 @@ rocWMMA 1.0 for ROCm 5.5.0 ##### Added -- Added support for wave32 on gfx11+ -- Added infrastructure changes to support hipRTC -- Added performance tracking system +* Added support for wave32 on gfx11+ +* Added infrastructure changes to support hipRTC +* Added performance tracking system ##### Changed -- Modified the assignment of hardware information -- Modified the data access for unsigned datatypes -- Added library config to support multiple architectures +* Modified the assignment of hardware information +* Modified the data access for unsigned datatypes +* Added library config to support multiple architectures #### Tensile 4.36.0 @@ -1331,59 +1332,59 @@ Tensile 4.36.0 for ROCm 5.5.0 ##### Added -- Add functions for user-driven tuning -- Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac -- Add binary search for Grid-Based algorithm -- Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2) -- Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4) -- Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1) -- Add GSU SingleBuffer algorithm for HSS/BSS -- Add gfx900:xnack-, gfx1032, gfx1034, gfx1035 -- Enable gfx1031 support +* Add functions for user-driven tuning +* Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac +* Add binary search for Grid-Based algorithm +* Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2) +* Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4) +* Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1) +* Add GSU SingleBuffer algorithm for HSS/BSS +* Add gfx900:xnack-, gfx1032, gfx1034, gfx1035 +* Enable gfx1031 support ##### Optimizations -- Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1 -- Improve InitAccVgprOpt +* Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1 +* Improve InitAccVgprOpt ##### Changed -- Use global_atomic for GSU instead of flat and global_store for debug code -- Replace flat_load/store with global_load/store -- Use global_load/store for BufferLoad/Store=0 and enable scheduling -- LocalSplitU support for HGEMM+HPA when MFMA disabled -- Update Code Object Version -- Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss -- Update asm cap cache arguments -- Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead -- Change checks, error messages, assembly syntax, and coverage for DirectToLds -- Remove unused cmake file -- Clean up the LLVM dependency code -- Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2 -- Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead +* Use global_atomic for GSU instead of flat and global_store for debug code +* Replace flat_load/store with global_load/store +* Use global_load/store for BufferLoad/Store=0 and enable scheduling +* LocalSplitU support for HGEMM+HPA when MFMA disabled +* Update Code Object Version +* Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss +* Update asm cap cache arguments +* Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead +* Change checks, error messages, assembly syntax, and coverage for DirectToLds +* Remove unused cmake file +* Clean up the LLVM dependency code +* Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2 +* Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead ##### Fixed -- Add build-id to header of compiled source kernels -- Fix solution index collisions -- Fix h beta vectorwidth4 correctness issue for WMMA -- Fix an error with BufferStore=0 -- Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2) -- Fix MoveMIoutToArch bug -- Fix flat load correctness issue on I8 and flat store correctness issue -- Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes -- Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop -- Fix issues with DirectToVgpr + ScheduleIterAlg<3 -- Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2 -- Fix mismatch issue with PrefetchGlobalRead=2 -- Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size -- Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1 -- Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case -- Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical -- Fix for failing CI tests due to CpuThreads=0 -- Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2 -- Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only) -- Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only) +* Add build-id to header of compiled source kernels +* Fix solution index collisions +* Fix h beta vectorwidth4 correctness issue for WMMA +* Fix an error with BufferStore=0 +* Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2) +* Fix MoveMIoutToArch bug +* Fix flat load correctness issue on I8 and flat store correctness issue +* Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes +* Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop +* Fix issues with DirectToVgpr + ScheduleIterAlg<3 +* Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2 +* Fix mismatch issue with PrefetchGlobalRead=2 +* Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size +* Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1 +* Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case +* Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical +* Fix for failing CI tests due to CpuThreads=0 +* Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2 +* Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only) +* Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only) ------------------- @@ -1455,7 +1456,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -1463,10 +1464,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -1474,7 +1475,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -1487,7 +1488,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -1615,7 +1616,7 @@ The following new HIP API is introduced in the ROCm v5.4.1 release. > > This is a pre-official version (beta) release of the new APIs. -```h +```cpp hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); ``` @@ -1710,7 +1711,7 @@ The ROCm v5.4 release consists of the following HIP enhancements: A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device. -```h +```cpp long long int wall_clock64(); ``` @@ -1718,7 +1719,7 @@ It returns wall clock count at a constant frequency on the device, which can be Example: -```h +```cpp int wallClkRate = 0; //in kilohertz +HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId)); ``` @@ -1751,13 +1752,13 @@ The following new HIP APIs are available in the ROCm v5.4 release. ##### Error Handling -```h +```cpp hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString); ``` This returns HIP errors in the text string format. -```h +```cpp hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString); ``` @@ -1853,7 +1854,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -1872,7 +1873,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -1885,7 +1886,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -2416,7 +2417,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -2435,7 +2436,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -2448,7 +2449,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -3000,7 +3001,7 @@ The new device management HIP APIs are as follows: - Gets a UUID for the device. This API returns a UUID for the device. - ```h + ```cpp hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device); ``` @@ -3008,25 +3009,25 @@ The new device management HIP APIs are as follows: > > This new API corresponds to the following CUDA API: > - > ```h + > ```cpp > CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev); > ``` - Gets default memory pool of the specified device - ```h + ```cpp hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device); ``` - Sets the current memory pool of a device - ```h + ```cpp hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool); ``` - Gets the current memory pool for the specified device - ```h + ```cpp hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device); ``` @@ -3036,67 +3037,67 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory - Allocates memory with stream ordered semantics - ```h + ```cpp hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream); ``` - Frees memory with stream ordered semantics - ```h + ```cpp hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream); ``` - Releases freed memory back to the OS - ```h + ```cpp hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold); ``` - Sets attributes of a memory pool - ```h + ```cpp hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); ``` - Gets attributes of a memory pool - ```h + ```cpp hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); ``` - Controls visibility of the specified pool between devices - ```h + ```cpp hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count); ``` - Returns the accessibility of a pool from a device - ```h + ```cpp hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location); ``` - Creates a memory pool - ```h + ```cpp hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props); ``` - Destroys the specified memory pool - ```h + ```cpp hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool); ``` - Allocates memory from a specified pool with stream ordered semantics - ```h + ```cpp hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream); ``` - Exports a memory pool to the requested handle type - ```h + ```cpp hipError_t hipMemPoolExportToShareableHandle( void* shared_handle, hipMemPool_t mem_pool, @@ -3106,7 +3107,7 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory - Imports a memory pool from a shared handle - ```h + ```cpp hipError_t hipMemPoolImportFromShareableHandle( hipMemPool_t* mem_pool, void* shared_handle, @@ -3116,7 +3117,7 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory - Exports data to share a memory pool allocation between processes - ```h + ```cpp hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr); Import a memory pool allocation from another process.t hipError_t hipMemPoolImportPointer( @@ -3131,25 +3132,25 @@ The new HIP Graph Management APIs are as follows: - Enqueues a host function call in a stream - ```h + ```cpp hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); ``` - Swaps the stream capture mode of a thread - ```h + ```cpp hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode); ``` - Sets a node attribute - ```h + ```cpp hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value); ``` - Gets a node attribute - ```h + ```cpp hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value); ``` @@ -3159,85 +3160,85 @@ The new APIs for virtual memory management are as follows: - Frees an address range reservation made via hipMemAddressReserve - ```h + ```cpp hipError_t hipMemAddressFree(void* devPtr, size_t size); ``` - Reserves an address range - ```h + ```cpp hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags); ``` - Creates a memory allocation described by the properties and size - ```h + ```cpp hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags); ``` - Exports an allocation to a requested shareable handle type - ```h + ```cpp hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags); ``` - Gets the access flags set for the given location and ptr - ```h + ```cpp hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr); ``` - Calculates either the minimal or recommended granularity - ```h + ```cpp hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option); ``` - Retrieves the property structure of the given handle - ```h + ```cpp hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle); ``` - Imports an allocation from a requested shareable handle type - ```h + ```cpp hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType); ``` - Maps an allocation handle to a reserved virtual address range - ```h + ```cpp hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags); ``` - Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays - ```h + ```cpp hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream); ``` - Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate - ```h + ```cpp hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle); ``` - Returns the allocation handle of the backing memory allocation given the address - ```h + ```cpp hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr); ``` - Sets the access flags for each location specified in desc for the given virtual address range - ```h + ```cpp hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count); ``` - Unmaps memory allocation of a given address range - ```h + ```cpp hipError_t hipMemUnmap(void* ptr, size_t size); ``` @@ -3259,7 +3260,7 @@ This release introduces a new ROCm C++ library for accelerating mixed precision rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed. For more information, refer to -[Communication Libraries](../../../../docs/reference/gpu_libraries/communication.md). +[Communication Libraries](./reference/libraries/gpu-libraries/communication.md). #### OpenMP Enhancements in This Release @@ -3334,7 +3335,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -3353,7 +3354,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -3366,7 +3367,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -3449,13 +3450,13 @@ So, the function argument that is potentially undef (because it is not intialize ##### Workaround -- Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to for more information. +* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to for more information. -- Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to for more information. +* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to for more information. -- Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute. +* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute. -- Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs. +* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs. #### Issue with Applications Triggering Oversubscription @@ -3490,13 +3491,13 @@ hipBLAS 0.51.0 for ROCm 5.2.0 ##### Added -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions -- Added user-specified initialization option to hipblas-bench +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions +* Added user-specified initialization option to hipblas-bench ##### Fixed -- Fixed version gathering in performance measuring script +* Fixed version gathering in performance measuring script #### hipCUB 2.11.1 @@ -3504,7 +3505,7 @@ hipCUB 2.11.1 for ROCm 5.2.0 ##### Added -- Packages for tests and benchmark executable on all supported OSes using CPack. +* Packages for tests and benchmark executable on all supported OSes using CPack. #### hipFFT 1.0.8 @@ -3512,8 +3513,8 @@ hipFFT 1.0.8 for ROCm 5.2.0 ##### Added -- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. -- Packages for test and benchmark executables on all supported OSes using CPack. +* Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. +* Packages for test and benchmark executables on all supported OSes using CPack. #### hipSOLVER 1.4.0 @@ -3521,13 +3522,13 @@ hipSOLVER 1.4.0 for ROCm 5.2.0 ##### Added -- Package generation for test and benchmark executables on all supported OSes using CPack. -- File/Folder Reorg - - Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. +* Package generation for test and benchmark executables on all supported OSes using CPack. +* File/Folder Reorg + * Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. ##### Fixed -- Fixed the ReadTheDocs documentation generation. +* Fixed the ReadTheDocs documentation generation. #### hipSPARSE 2.2.0 @@ -3535,7 +3536,7 @@ hipSPARSE 2.2.0 for ROCm 5.2.0 ##### Added -- Packages for test and benchmark executables on all supported OSes using CPack. +* Packages for test and benchmark executables on all supported OSes using CPack. #### rocALUTION 2.0.3 @@ -3543,7 +3544,7 @@ rocALUTION 2.0.3 for ROCm 5.2.0 ##### Added -- Packages for test and benchmark executables on all supported OSes using CPack. +* Packages for test and benchmark executables on all supported OSes using CPack. #### rocBLAS 2.44.0 @@ -3551,41 +3552,41 @@ rocBLAS 2.44.0 for ROCm 5.2.0 ##### Added -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions. -- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions. -- Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes. -- Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests. +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions. +* Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions. +* Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes. +* Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests. ##### Optimizations -- Improved performance of non-batched and batched her2 for all sizes and data types. -- Improved performance of non-batched and batched amin for all data types using shuffle reductions. -- Improved performance of non-batched and batched amax for all data types using shuffle reductions. -- Improved performance of trsv for all sizes and data types. +* Improved performance of non-batched and batched her2 for all sizes and data types. +* Improved performance of non-batched and batched amin for all data types using shuffle reductions. +* Improved performance of non-batched and batched amax for all data types using shuffle reductions. +* Improved performance of trsv for all sizes and data types. ##### Changed -- Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16. -- Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions. -- For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset. -- For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset. -- AMD copyright year for all rocBLAS files. -- For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. +* Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16. +* Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions. +* For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset. +* For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset. +* AMD copyright year for all rocBLAS files. +* For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. ##### Fixed -- For function her2 avoid overflow in offset calculation. -- For trsm when alpha == 0 and on host, allow A to be nullptr. -- Fixed memory access issue in trsv. -- Fixed git pre-commit script to update only AMD copyright year. -- Fixed dgmm, geam test functions to set correct stride values. -- For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose. -- Fixed compilation error for clients-only build. +* For function her2 avoid overflow in offset calculation. +* For trsm when alpha == 0 and on host, allow A to be nullptr. +* Fixed memory access issue in trsv. +* Fixed git pre-commit script to update only AMD copyright year. +* Fixed dgmm, geam test functions to set correct stride values. +* For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose. +* Fixed compilation error for clients-only build. ##### Removed -- Remove Navi12 (gfx1011) from fat binary. +* Remove Navi12 (gfx1011) from fat binary. #### rocFFT 1.0.17 @@ -3593,23 +3594,23 @@ rocFFT 1.0.17 for ROCm 5.2.0 ##### Added -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. ##### Changed -- Improved reuse of twiddle memory between plans. -- Set a default load/store callback when only one callback +* Improved reuse of twiddle memory between plans. +* Set a default load/store callback when only one callback type is set via the API for improved performance. ##### Optimizations -- Introduced a new access pattern of lds (non-linear) and applied it on +* Introduced a new access pattern of lds (non-linear) and applied it on sbcc kernels len 64 to get performance improvement. ##### Fixed -- Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers. +* Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers. #### rocPRIM 2.10.14 @@ -3617,8 +3618,8 @@ rocPRIM 2.10.14 for ROCm 5.2.0 ##### Added -- Packages for tests and benchmark executable on all supported OSes using CPack. -- Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers. +* Packages for tests and benchmark executable on all supported OSes using CPack. +* Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers. #### rocRAND 2.10.14 @@ -3626,8 +3627,8 @@ rocRAND 2.10.14 for ROCm 5.2.0 ##### Added -- Backward compatibility for deprecated `#include <rocrand.h>` using wrapper header files. -- Packages for test and benchmark executables on all supported OSes using CPack. +* Backward compatibility for deprecated `#include <rocrand.h>` using wrapper header files. +* Packages for test and benchmark executables on all supported OSes using CPack. #### rocSOLVER 3.18.0 @@ -3635,18 +3636,18 @@ rocSOLVER 3.18.0 for ROCm 5.2.0 ##### Added -- Partial eigenvalue decomposition routines: - - STEBZ - - STEIN -- Package generation for test and benchmark executables on all supported OSes using CPack. -- Added tests for multi-level logging -- Added tests for rocsolver-bench client -- File/Folder Reorg - - Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. +* Partial eigenvalue decomposition routines: + * STEBZ + * STEIN +* Package generation for test and benchmark executables on all supported OSes using CPack. +* Added tests for multi-level logging +* Added tests for rocsolver-bench client +* File/Folder Reorg + * Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. ##### Fixed -- Fixed compatibility with libfmt 8.1 +* Fixed compatibility with libfmt 8.1 #### rocSPARSE 2.2.0 @@ -3654,24 +3655,24 @@ rocSPARSE 2.2.0 for ROCm 5.2.0 ##### Added -- batched SpMM for CSR, COO and Blocked ELL formats. -- Packages for test and benchmark executables on all supported OSes using CPack. -- Clients file importers and exporters. +* batched SpMM for CSR, COO and Blocked ELL formats. +* Packages for test and benchmark executables on all supported OSes using CPack. +* Clients file importers and exporters. ##### Improved -- Clients code size reduction. -- Clients error handling. -- Clients benchmarking for performance tracking. +* Clients code size reduction. +* Clients error handling. +* Clients benchmarking for performance tracking. ##### Changed -- Test adjustments due to roundoff errors. -- Fixing API calls compatiblity with rocPRIM. +* Test adjustments due to roundoff errors. +* Fixing API calls compatiblity with rocPRIM. ##### Known Issues -- none +* none #### rocThrust 2.15.0 @@ -3679,7 +3680,7 @@ rocThrust 2.15.0 for ROCm 5.2.0 ##### Added -- Packages for tests and benchmark executable on all supported OSes using CPack. +* Packages for tests and benchmark executable on all supported OSes using CPack. #### rocWMMA 0.7 @@ -3687,29 +3688,29 @@ rocWMMA 0.7 for ROCm 5.2.0 ##### Added -- Added unit tests for DLRM kernels -- Added GEMM sample -- Added DLRM sample -- Added SGEMV sample -- Added unit tests for cooperative wmma load and stores -- Added unit tests for IOBarrier.h -- Added wmma load/ store tests for different matrix types (A, B and Accumulator) -- Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest -- Added block sizes 4, 8 to test MmaSynMultiLdsTest -- Added support for wmma load / store layouts with block dimension greater than 64 -- Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types -- Added CI testing for rocWMMA +* Added unit tests for DLRM kernels +* Added GEMM sample +* Added DLRM sample +* Added SGEMV sample +* Added unit tests for cooperative wmma load and stores +* Added unit tests for IOBarrier.h +* Added wmma load/ store tests for different matrix types (A, B and Accumulator) +* Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest +* Added block sizes 4, 8 to test MmaSynMultiLdsTest +* Added support for wmma load / store layouts with block dimension greater than 64 +* Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types +* Added CI testing for rocWMMA ##### Changed -- Renamed wmma to rocwmma in cmake, header files and documentation -- Renamed library files -- Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset) -- Opaque load/store continue to use incrementatl offsets as they fill the entire block -- Cooperative load/store use cumulative offsets as they fill only small portions for the entire block -- Increased Max split counts to 64 for cooperative load/store -- Moved all the wmma definitions, API headers to rocwmma namespace -- Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator) +* Renamed wmma to rocwmma in cmake, header files and documentation +* Renamed library files +* Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset) +* Opaque load/store continue to use incrementatl offsets as they fill the entire block +* Cooperative load/store use cumulative offsets as they fill only small portions for the entire block +* Increased Max split counts to 64 for cooperative load/store +* Moved all the wmma definitions, API headers to rocwmma namespace +* Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator) #### Tensile 4.33.0 @@ -3717,26 +3718,26 @@ Tensile 4.33.0 for ROCm 5.2.0 ##### Added -- TensileUpdateLibrary for updating old library logic files -- Support for TensileRetuneLibrary to use sizes from separate file -- ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support -- Tests for denorm correctness -- Option to write different architectures to different TensileLibrary files +* TensileUpdateLibrary for updating old library logic files +* Support for TensileRetuneLibrary to use sizes from separate file +* ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support +* Tests for denorm correctness +* Option to write different architectures to different TensileLibrary files ##### Optimizations -- Optimize MessagePackLoadLibraryFile by switching to fread -- DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr +* Optimize MessagePackLoadLibraryFile by switching to fread +* DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr ##### Changed -- Alpha/beta datatype remains as F32 for HPA HGEMM -- Force assembly kernels to not flush denorms -- Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount +* Alpha/beta datatype remains as F32 for HPA HGEMM +* Force assembly kernels to not flush denorms +* Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount ##### Fixed -- Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80 +* Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80 ------------------- @@ -3837,21 +3838,21 @@ This enhancement enables ROCDebugger users to interact with the HIP source-level ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made: -- Added a new -lane-info command, listing the current thread's lanes. +* Added a new -lane-info command, listing the current thread's lanes. -- The -thread-select command now supports a lane switch to switch to a specific lane of a thread: +* The -thread-select command now supports a lane switch to switch to a specific lane of a thread: ```sh -thread-select -l LANE THREAD ``` -- The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected. +* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected. -- The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop. +* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop. -- MI commands now accept a global --lane option, similar to the global --thread and --frame options. +* MI commands now accept a global --lane option, similar to the global --thread and --frame options. -- MI varobjs are now lane-aware. +* MI varobjs are now lane-aware. For more information, refer to the ROC Debugger User Guide at {doc}`ROCgdb `. @@ -3864,17 +3865,12 @@ The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECI This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below: -- MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498) - -- Fixed a correctness issue with ImplicitGemm algorithm - -- Updated the performance data for new kernel versions - -- Improved MIOpen build time by splitting large kernel header files - -- Fixed an issue in reduction kernels for padded tensors - -- Various other bug fixes and performance improvements +* MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498) +* Fixed a correctness issue with ImplicitGemm algorithm +* Updated the performance data for new kernel versions +* Improved MIOpen build time by splitting large kernel header files +* Fixed an issue in reduction kernels for padded tensors +* Various other bug fixes and performance improvements For more information, see {doc}`Documentation `. @@ -3886,17 +3882,12 @@ CRIU is a userspace tool to Checkpoint and Restore an application. CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes: -- Single and Multi GPU systems (Gfx9) - -- Checkpoint / Restore on a different system - -- Checkpoint / Restore inside a docker container - -- PyTorch - -- Tensorflow - -- Using CRIU Image Streamer +* Single and Multi GPU systems (Gfx9) +* Checkpoint / Restore on a different system +* Checkpoint / Restore inside a docker container +* PyTorch +* Tensorflow +* Using CRIU Image Streamer For more information, refer to @@ -3910,9 +3901,8 @@ For more information, refer to - -- +* +* ### Fixed Defects @@ -4033,24 +4023,24 @@ hipBLAS 0.50.0 for ROCm 5.1.0 ##### Added -- Added library version and device information to hipblas-test output -- Added --rocsolver-path command line option to choose path to pre-built rocSOLVER, as +* Added library version and device information to hipblas-test output +* Added --rocsolver-path command line option to choose path to pre-built rocSOLVER, as absolute or relative path -- Added --cmake_install command line option to update cmake to minimum version if required -- Added cmake-arg parameter to pass in cmake arguments while building -- Added infrastructure to support readthedocs hipBLAS documentation. +* Added --cmake_install command line option to update cmake to minimum version if required +* Added cmake-arg parameter to pass in cmake arguments while building +* Added infrastructure to support readthedocs hipBLAS documentation. ##### Fixed -- Added hipblasVersionMinor define. hipblaseVersionMinor remains defined +* Added hipblasVersionMinor define. hipblaseVersionMinor remains defined for backwards compatibility. -- Doxygen warnings in hipblas.h header file. +* Doxygen warnings in hipblas.h header file. ##### Changed -- rocblas-path command line option can be specified as either absolute or relative path -- Help message improvements in install.sh and rmake.py -- Updated googletest dependency from 1.10.0 to 1.11.0 +* rocblas-path command line option can be specified as either absolute or relative path +* Help message improvements in install.sh and rmake.py +* Updated googletest dependency from 1.10.0 to 1.11.0 #### hipCUB 2.11.0 @@ -4058,16 +4048,16 @@ hipCUB 2.11.0 for ROCm 5.1.0 ##### Added -- Device segmented sort -- Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB -- Device three way partition +* Device segmented sort +* Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB +* Device three way partition ##### Changed -- Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output). - - And low-res input with high-res output (e.g. float input, double output) - - Block merge sort no longer supports non power of two blocksizes +* Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type. + * This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output). + * And low-res input with high-res output (e.g. float input, double output) + * Block merge sort no longer supports non power of two blocksizes #### hipFFT 1.0.7 @@ -4075,7 +4065,7 @@ hipFFT 1.0.7 for ROCm 5.1.0 ##### Changed -- Use fft_params struct for accuracy and benchmark clients. +* Use fft_params struct for accuracy and benchmark clients. #### hipSOLVER 1.3.0 @@ -4083,38 +4073,38 @@ hipSOLVER 1.3.0 for ROCm 5.1.0 ##### Added -- Added functions - - gels - - hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize - - hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels -- Added library version and device information to hipsolver-test output. -- Added compatibility API with hipsolverDn prefix. -- Added compatibility-only functions - - gesvdj - - hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize - - hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj - - gesvdjBatched - - hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize - - hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched - - syevj - - hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize - - hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj - - syevjBatched - - hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize - - hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched - - sygvj - - hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize - - hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj +* Added functions + * gels + * hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize + * hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels +* Added library version and device information to hipsolver-test output. +* Added compatibility API with hipsolverDn prefix. +* Added compatibility-only functions + * gesvdj + * hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize + * hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj + * gesvdjBatched + * hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize + * hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched + * syevj + * hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize + * hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj + * syevjBatched + * hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize + * hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched + * sygvj + * hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize + * hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj ##### Changed -- The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X. -- The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd. +* The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X. +* The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd. ##### Fixed -- bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null. -- Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER. +* bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null. +* Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER. #### hipSPARSE 2.1.0 @@ -4122,21 +4112,21 @@ hipSPARSE 2.1.0 for ROCm 5.1.0 ##### Added -- Added gtsv_interleaved_batch and gpsv_interleaved_batch routines -- Add SpGEMM_reuse +* Added gtsv_interleaved_batch and gpsv_interleaved_batch routines +* Add SpGEMM_reuse ##### Changed -- Changed BUILD_CUDA with USE_CUDA in install script and cmake files -- Update googletest to 11.1 +* Changed BUILD_CUDA with USE_CUDA in install script and cmake files +* Update googletest to 11.1 ##### Improved -- Fixed a bug in SpMM Alg versioning +* Fixed a bug in SpMM Alg versioning ##### Known Issues -- none +* none #### rccl 2.11.4 @@ -4144,11 +4134,11 @@ RCCL 2.11.4 for ROCm 5.1.0 ##### Added -- Compatibility with NCCL 2.11.4 +* Compatibility with NCCL 2.11.4 ##### Known Issues -- Managed memory is not currently supported for clique-based kernels +* Managed memory is not currently supported for clique-based kernels #### rocALUTION 2.0.2 @@ -4156,8 +4146,8 @@ rocALUTION 2.0.2 for ROCm 5.1.0 ##### Added -- Added out-of-place matrix transpose functionality -- Added LocalVector<bool> +* Added out-of-place matrix transpose functionality +* Added LocalVector<bool> #### rocBLAS 2.43.0 @@ -4165,32 +4155,32 @@ rocBLAS 2.43.0 for ROCm 5.1.0 ##### Added -- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs) -- Option to install script to build clients without using any Fortran (--clients_no_fortran) -- rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time. -- Added tests for output of reduction functions when given bad input -- Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench +* Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs) +* Option to install script to build clients without using any Fortran (--clients_no_fortran) +* rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time. +* Added tests for output of reduction functions when given bad input +* Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench ##### Optimizations -- Improved performance of trsm with side == left and n == 1 -- Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32 +* Improved performance of trsm with side == left and n == 1 +* Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32 ##### Changed -- For syrkx and trmm internal API use rocblas_stride datatype for offset -- For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match -- Test client dependencies updated to GTest 1.11 -- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives. -- Help menu messages in install.sh -- For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. -- Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench +* For syrkx and trmm internal API use rocblas_stride datatype for offset +* For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match +* Test client dependencies updated to GTest 1.11 +* non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives. +* Help menu messages in install.sh +* For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. +* Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench ##### Fixed -- For function trmv (non-transposed cases) avoid overflow in offset calculation -- Fixed cppcheck errors/warnings -- Fixed doxygen warnings +* For function trmv (non-transposed cases) avoid overflow in offset calculation +* Fixed cppcheck errors/warnings +* Fixed doxygen warnings #### rocFFT 1.0.16 @@ -4198,23 +4188,23 @@ rocFFT 1.0.16 for ROCm 5.1.0 ##### Changed -- Supported unaligned tile dimension for SBRC_2D kernels. -- Improved (more RAII) test and benchmark infrastructure. -- Enabled runtime compilation of length-2304 FFT kernel during plan creation. +* Supported unaligned tile dimension for SBRC_2D kernels. +* Improved (more RAII) test and benchmark infrastructure. +* Enabled runtime compilation of length-2304 FFT kernel during plan creation. ##### Optimizations -- Optimized more large 1D cases by using L1D_CC plan. -- Optimized 3D 200^3 C2R case. -- Optimized 1D 2^30 double precision on MI200. +* Optimized more large 1D cases by using L1D_CC plan. +* Optimized 3D 200^3 C2R case. +* Optimized 1D 2^30 double precision on MI200. ##### Fixed -- Fixed correctness of some R2C transforms with unusual strides. +* Fixed correctness of some R2C transforms with unusual strides. ##### Removed -- The hipFFT API (header) has been removed from after a long deprecation period. Please use the [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package/repository to obtain the hipFFT API. +* The hipFFT API (header) has been removed from after a long deprecation period. Please use the [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package/repository to obtain the hipFFT API. #### rocPRIM 2.10.13 @@ -4222,20 +4212,20 @@ rocPRIM 2.10.13 for ROCm 5.1.0 ##### Fixed -- Fixed radix sort int64_t bug introduced in [2.10.11] +* Fixed radix sort int64_t bug introduced in \[2.10.11] ##### Added -- Future value -- Added device partition_three_way to partition input to three output iterators based on two predicates +* Future value +* Added device partition_three_way to partition input to three output iterators based on two predicates ##### Changed -- The reduce/scan algorithm precision issues in the tests has been resolved for half types. +* The reduce/scan algorithm precision issues in the tests has been resolved for half types. ##### Known Issues -- device_segmented_radix_sort unit test failing for HIP on Windows +* device_segmented_radix_sort unit test failing for HIP on Windows #### rocRAND 2.10.13 @@ -4243,30 +4233,30 @@ rocRAND 2.10.13 for ROCm 5.1.0 ##### Added -- Generating a random sequence different sizes now produces the same sequence without gaps +* Generating a random sequence different sizes now produces the same sequence without gaps indepent of how many values are generated per call. - - Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64 - - This only holds true if the size in each call is a divisor of the distributions + * Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64 + * This only holds true if the size in each call is a divisor of the distributions `output_width` due to performance - - Similarly the output pointer has to be aligned to `output_width * sizeof(output_type)` + * Similarly the output pointer has to be aligned to `output_width * sizeof(output_type)` ##### Changed -- [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND.git) split into a separate package -- Header file installation location changed to match other libraries. - - Using the `rocrand.h` header file should now use `#include <rocrand/rocrand.h>`, rather than `#include <rocrand/rocrand.h>` -- rocRAND still includes hipRAND using a submodule - - The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it. +* [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND.git) split into a separate package +* Header file installation location changed to match other libraries. + * Using the `rocrand.h` header file should now use `#include <rocrand/rocrand.h>`, rather than `#include <rocrand/rocrand.h>` +* rocRAND still includes hipRAND using a submodule + * The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it. ##### Fixed -- Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now +* Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now correctly generates the same sequence starting from the offset. - - Only uniform int and float will work as these can be generated with a single call to the generator + * Only uniform int and float will work as these can be generated with a single call to the generator ##### Known Issues -- kernel_xorwow unit test is failing for certain GPU architectures. +* kernel_xorwow unit test is failing for certain GPU architectures. #### rocSOLVER 3.17.0 @@ -4274,16 +4264,16 @@ rocSOLVER 3.17.0 for ROCm 5.1.0 ##### Optimized -- Optimized non-pivoting and batch cases of the LU factorization +* Optimized non-pivoting and batch cases of the LU factorization ##### Fixed -- Fixed missing synchronization in SYTRF with `rocblas_fill_lower` that could potentially +* Fixed missing synchronization in SYTRF with `rocblas_fill_lower` that could potentially result in incorrect pivot values. -- Fixed multi-level logging output to file with the `ROCSOLVER_LOG_PATH`, +* Fixed multi-level logging output to file with the `ROCSOLVER_LOG_PATH`, `ROCSOLVER_LOG_TRACE_PATH`, `ROCSOLVER_LOG_BENCH_PATH` and `ROCSOLVER_LOG_PROFILE_PATH` environment variables. -- Fixed performance regression in the batched LU factorization of tiny matrices +* Fixed performance regression in the batched LU factorization of tiny matrices #### rocSPARSE 2.1.0 @@ -4291,19 +4281,19 @@ rocSPARSE 2.1.0 for ROCm 5.1.0 ##### Added -- gtsv_interleaved_batch -- gpsv_interleaved_batch -- SpGEMM_reuse -- Allow copying of mat info struct +* gtsv_interleaved_batch +* gpsv_interleaved_batch +* SpGEMM_reuse +* Allow copying of mat info struct ##### Improved -- Optimization for SDDMM -- Allow unsorted matrices in csrgemm multipass algorithm +* Optimization for SDDMM +* Allow unsorted matrices in csrgemm multipass algorithm ##### Known Issues -- none +* none #### rocThrust 2.14.0 @@ -4311,11 +4301,11 @@ rocThrust 2.14.0 for ROCm 5.1.0 ##### Added -- Updated to match upstream Thrust 1.15.0 +* Updated to match upstream Thrust 1.15.0 ##### Known Issues -- async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows. +* async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows. #### Tensile 4.32.0 @@ -4323,24 +4313,24 @@ Tensile 4.32.0 for ROCm 5.1.0 ##### Added -- Better control of parallelism to control memory usage -- Support for multiprocessing on Windows for TensileCreateLibrary -- New JSD metric and metric selection functionality -- Initial changes to support two-tier solution selection +* Better control of parallelism to control memory usage +* Support for multiprocessing on Windows for TensileCreateLibrary +* New JSD metric and metric selection functionality +* Initial changes to support two-tier solution selection ##### Optimized -- Optimized runtime of TensileCreateLibraries by reducing max RAM usage -- StoreCInUnroll additional optimizations plus adaptive K support -- DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support +* Optimized runtime of TensileCreateLibraries by reducing max RAM usage +* StoreCInUnroll additional optimizations plus adaptive K support +* DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support ##### Changed -- Update Googletest to 1.11.0 +* Update Googletest to 1.11.0 ##### Removed -- Remove no longer supported benchmarking steps +* Remove no longer supported benchmarking steps ------------------- @@ -4360,9 +4350,9 @@ The resolution includes a compiler change, which emits the required metadata by Note: This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. -Compatibility Matrix Updates to ROCm Deep Learning Guide +Compatibility Matrix Updates to the [Deep-learning guide](./how-to/deep-learning-rocm.md). -The compatibility matrix in the AMD Deep Learning Guide is updated for ROCm v5.0.2. +The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for ROCm v5.0.2. ### Library Changes in ROCM 5.0.2 @@ -4697,13 +4687,10 @@ As a workaround, users can run amdgpu.runpm=0, which temporarily disables the ru Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel: -- Dispatch - -- Start - -- End - -- Complete +* Dispatch +* Start +* End +* Complete ##### Issue @@ -4756,14 +4743,14 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV #### ROCm Libraries Changes – Deprecations and Deprecation Removal -- The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too. +* The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too. -- The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead. +* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead. -- The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm +* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0 - ```h + ```cpp rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, @@ -4781,7 +4768,7 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV rocSPARSE_spmm in 4.0 - ```h + ```cpp rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, @@ -4802,9 +4789,9 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV In this release, arithmetic operators of HIP complex and vector types are deprecated. -- As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types. +* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types. -- As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types. +* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types. During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and `_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types. @@ -4850,19 +4837,19 @@ hipBLAS 0.49.0 for ROCm 5.0.0 ##### Added -- Added rocSOLVER functions to hipblas-bench -- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex -- Added compilation warning for future trmm changes -- Added documentation to hipblas.h -- Added option to forgo pivoting for getrf and getri when ipiv is nullptr -- Added code coverage option +* Added rocSOLVER functions to hipblas-bench +* Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex +* Added compilation warning for future trmm changes +* Added documentation to hipblas.h +* Added option to forgo pivoting for getrf and getri when ipiv is nullptr +* Added code coverage option ##### Fixed -- Fixed use of incorrect 'HIP_PATH' when building from source. -- Fixed windows packaging -- Allowing negative increments in hipblas-bench -- Removed boost dependency +* Fixed use of incorrect 'HIP_PATH' when building from source. +* Fixed windows packaging +* Allowing negative increments in hipblas-bench +* Removed boost dependency #### hipCUB 2.10.13 @@ -4870,18 +4857,18 @@ hipCUB 2.10.13 for ROCm 5.0.0 ##### Fixed -- Added missing includes to hipcub.hpp +* Added missing includes to hipcub.hpp ##### Added -- Bfloat16 support to test cases (device_reduce & device_radix_sort) -- Device merge sort -- Block merge sort -- API update to CUB 1.14.0 +* Bfloat16 support to test cases (device_reduce & device_radix_sort) +* Device merge sort +* Block merge sort +* API update to CUB 1.14.0 ##### Changed -- The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend. +* The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend. #### hipFFT 1.0.4 @@ -4889,12 +4876,12 @@ hipFFT 1.0.4 for ROCm 5.0.0 ##### Fixed -- Add calls to rocFFT setup/cleanup. -- Cmake fixes for clients and backend support. +* Add calls to rocFFT setup/cleanup. +* Cmake fixes for clients and backend support. ##### Added -- Added support for Windows 10 as a build target. +* Added support for Windows 10 as a build target. #### hipSOLVER 1.2.0 @@ -4902,14 +4889,14 @@ hipSOLVER 1.2.0 for ROCm 5.0.0 ##### Added -- Added functions - - sytrf - - hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize - - hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf +* Added functions + * sytrf + * hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize + * hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf ##### Fixed -- Fixed use of incorrect `HIP_PATH` when building from source (#40). +* Fixed use of incorrect `HIP_PATH` when building from source (#40). Thanks [@jakub329homola](https://github.com/jakub329homola)! #### hipSPARSE 2.0.0 @@ -4918,7 +4905,7 @@ hipSPARSE 2.0.0 for ROCm 5.0.0 ##### Added -- Added (conjugate) transpose support for csrmv, hybmv and spmv routines +* Added (conjugate) transpose support for csrmv, hybmv and spmv routines #### rccl 2.10.3 @@ -4926,11 +4913,11 @@ RCCL 2.10.3 for ROCm 5.0.0 ##### Added -- Compatibility with NCCL 2.10.3 +* Compatibility with NCCL 2.10.3 ##### Known Issues -- Managed memory is not currently supported for clique-based kernels +* Managed memory is not currently supported for clique-based kernels #### rocALUTION 2.0.1 @@ -4938,13 +4925,13 @@ rocALUTION 2.0.1 for ROCm 5.0.0 ##### Changed -- Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead. -- Changed to C++ 14 Standard +* Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead. +* Changed to C++ 14 Standard ##### Improved -- Added sanitizer option -- Improved documentation +* Added sanitizer option +* Improved documentation #### rocBLAS 2.42.0 @@ -4952,30 +4939,30 @@ rocBLAS 2.42.0 for ROCm 5.0.0 ##### Added -- Added rocblas_get_version_string_size convenience function -- Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm -- Added hpl and trig initialization for gemm_ex to rocblas-bench -- Added source code gemm. It can be used as an alternative to Tensile for debugging and development -- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex +* Added rocblas_get_version_string_size convenience function +* Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm +* Added hpl and trig initialization for gemm_ex to rocblas-bench +* Added source code gemm. It can be used as an alternative to Tensile for debugging and development +* Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex ##### Optimizations -- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU. -- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU. +* Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU. +* Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU. ##### Changed -- Instantiate templated rocBLAS functions to reduce size of librocblas.so -- Removed static library dependency on msgpack -- Removed boost dependencies for clients +* Instantiate templated rocBLAS functions to reduce size of librocblas.so +* Removed static library dependency on msgpack +* Removed boost dependencies for clients ##### Fixed -- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library -- Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input -- Fix for dgmm with side == rocblas_side_left and a negative incx -- Fixed out-of-bounds read for small trsm -- Fixed numerical checking for tbmv_strided_batched +* Option to install script to build only rocBLAS clients with a pre-built rocBLAS library +* Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input +* Fix for dgmm with side == rocblas_side_left and a negative incx +* Fixed out-of-bounds read for small trsm +* Fixed numerical checking for tbmv_strided_batched #### rocFFT 1.0.13 @@ -4983,24 +4970,24 @@ rocFFT 1.0.13 for ROCm 5.0.0 ##### Optimizations -- Improved many plans by removing unnecessary transpose steps. -- Optimized scheme selection for 3D problems. - - Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and +* Improved many plans by removing unnecessary transpose steps. +* Optimized scheme selection for 3D problems. + * Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and have some performance gain. - - Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit. - - Force --length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative + * Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit. + * Force --length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative threshold test. -- Optimized some even-length R2C/C2R cases by doing more operations +* Optimized some even-length R2C/C2R cases by doing more operations in-place and combining pre/post processing into Stockham kernels. -- Added radix-17. +* Added radix-17. ##### Added -- Added new kernel generator for select fused-2D transforms. +* Added new kernel generator for select fused-2D transforms. ##### Fixed -- Improved large 1D transform decompositions. +* Improved large 1D transform decompositions. #### rocPRIM 2.10.12 @@ -5008,37 +4995,37 @@ rocPRIM 2.10.12 for ROCm 5.0.0 ##### Fixed -- Enable bfloat16 tests and reduce threshold for bfloat16 -- Fix device scan limit_size feature -- Non-optimized builds no longer trigger local memory limit errors +* Enable bfloat16 tests and reduce threshold for bfloat16 +* Fix device scan limit_size feature +* Non-optimized builds no longer trigger local memory limit errors ##### Added -- Added scan size limit feature -- Added reduce size limit feature -- Added transform size limit feature -- Add block_load_striped and block_store_striped -- Add gather_to_blocked to gather values from other threads into a blocked arrangement -- The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config - - the block sort step supports multiple items per thread +* Added scan size limit feature +* Added reduce size limit feature +* Added transform size limit feature +* Add block_load_striped and block_store_striped +* Add gather_to_blocked to gather values from other threads into a blocked arrangement +* The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config + * the block sort step supports multiple items per thread ##### Changed -- size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter -- Device_scan and device_segmented_scan: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). - - And low-res input with high-res output (e.g. `float` input, `double` output) -- Revert old Fiji workaround, because they solved the issue at compiler side -- Update README cmake minimum version number -- Block sort support multiple items per thread - - currently only powers of two block sizes, and items per threads are supported and only for full blocks -- Bumped the minimum required version of CMake to 3.16 +* size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter +* Device_scan and device_segmented_scan: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. + * This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). + * And low-res input with high-res output (e.g. `float` input, `double` output) +* Revert old Fiji workaround, because they solved the issue at compiler side +* Update README cmake minimum version number +* Block sort support multiple items per thread + * currently only powers of two block sizes, and items per threads are supported and only for full blocks +* Bumped the minimum required version of CMake to 3.16 ##### Known Issues -- Unit tests may soft hang on MI200 when running in hipMallocManaged mode. -- device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows -- ReduceEmptyInput cause random faulire with bfloat16 +* Unit tests may soft hang on MI200 when running in hipMallocManaged mode. +* device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows +* ReduceEmptyInput cause random faulire with bfloat16 #### rocRAND 2.10.12 @@ -5046,7 +5033,7 @@ rocRAND 2.10.12 for ROCm 5.0.0 ##### Changed -- No updates or changes for ROCm 5.0.0. +* No updates or changes for ROCm 5.0.0. #### rocSOLVER 3.16.0 @@ -5054,27 +5041,27 @@ rocSOLVER 3.16.0 for ROCm 5.0.0 ##### Added -- Symmetric matrix factorizations: - - LASYF - - SYTF2, SYTRF (with batched and strided\_batched versions) -- Added `rocsolver_get_version_string_size` to help with version string queries -- Added `rocblas_layer_mode_ex` and the ability to print kernel calls in the trace and profile logs -- Expanded batched and strided\_batched sample programs. +* Symmetric matrix factorizations: + * LASYF + * SYTF2, SYTRF (with batched and strided\_batched versions) +* Added `rocsolver_get_version_string_size` to help with version string queries +* Added `rocblas_layer_mode_ex` and the ability to print kernel calls in the trace and profile logs +* Expanded batched and strided\_batched sample programs. ##### Optimized -- Improved general performance of LU factorization -- Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems. +* Improved general performance of LU factorization +* Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems. ##### Changed -- The rocsolver-test client now prints the rocSOLVER version used to run the tests, +* The rocsolver-test client now prints the rocSOLVER version used to run the tests, rather than the version used to build them -- The rocsolver-bench client now prints the rocSOLVER version used in the benchmark +* The rocsolver-bench client now prints the rocSOLVER version used in the benchmark ##### Fixed -- Added missing stdint.h include to rocsolver.h +* Added missing stdint.h include to rocsolver.h #### rocSPARSE 2.0.0 @@ -5082,16 +5069,16 @@ rocSPARSE 2.0.0 for ROCm 5.0.0 ##### Added -- csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices -- csrmv for symmetric matrices +* csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices +* csrmv for symmetric matrices ##### Changed -- spmm\_ex is now deprecated and will be removed in the next major release +* spmm\_ex is now deprecated and will be removed in the next major release ##### Improved -- Optimization for gtsv +* Optimization for gtsv #### rocThrust 2.13.0 @@ -5099,15 +5086,15 @@ rocThrust 2.13.0 for ROCm 5.0.0 ##### Added -- Updated to match upstream Thrust 1.13.0 -- Updated to match upstream Thrust 1.14.0 -- Added async scan +* Updated to match upstream Thrust 1.13.0 +* Updated to match upstream Thrust 1.14.0 +* Added async scan ##### Changed -- Scan algorithms: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). - - And low-res input with high-res output (e.g. `float` input, `double` output) +* Scan algorithms: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. + * This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). + * And low-res input with high-res output (e.g. `float` input, `double` output) #### Tensile 4.31.0 @@ -5115,28 +5102,28 @@ Tensile 4.31.0 for ROCm 5.0.0 ##### Added -- DirectToLds support (x2/x4) -- DirectToVgpr support for DGEMM -- Parameter to control number of files kernels are merged into to better parallelize kernel compilation -- FP16 alternate implementation for HPA HGEMM on aldebaran +* DirectToLds support (x2/x4) +* DirectToVgpr support for DGEMM +* Parameter to control number of files kernels are merged into to better parallelize kernel compilation +* FP16 alternate implementation for HPA HGEMM on aldebaran ##### Optimized -- Add DGEMM NN custom kernel for HPL on aldebaran +* Add DGEMM NN custom kernel for HPL on aldebaran ##### Changed -- Update tensile_client executable to std=c++14 +* Update tensile_client executable to std=c++14 ##### Removed -- Remove unused old Tensile client code +* Remove unused old Tensile client code ##### Fixed -- Fix hipErrorInvalidHandle during benchmarks -- Fix addrVgpr for atomic GSU -- Fix for Python 3.8: add case for Constant nodeType -- Fix architecture mapping for gfx1011 and gfx1012 -- Fix PrintSolutionRejectionReason verbiage in KernelWriter.py -- Fix vgpr alignment problem when enabling flat buffer load +* Fix hipErrorInvalidHandle during benchmarks +* Fix addrVgpr for atomic GSU +* Fix for Python 3.8: add case for Constant nodeType +* Fix architecture mapping for gfx1011 and gfx1012 +* Fix PrintSolutionRejectionReason verbiage in KernelWriter.py +* Fix vgpr alignment problem when enabling flat buffer load diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c5e65aa28..84426f8ae 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -35,16 +35,16 @@ guide on writing and formatting on GitHub as a starting point. ROCm documentation adds additional requirements to Markdown and RST based files as follows: -- Level one headers are only used for page titles. There must be only one level +* Level one headers are only used for page titles. There must be only one level 1 header per file for both Markdown and Restructured Text. -- Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via +* Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via our automated GitHub action on a Pull Request (PR). See the {doc}`rocm-docs-core linting user guide ` for more details. ## Filenames and folder structure -Please use snake case (all lower case letters and underscores instead of spaces) -for file names. For example, `example_file_name.md`. +Please use kebab-case (all lower case letters and dashes instead of spaces) +for file names. For example, `example-file-name.md`. Our documentation follows Pitchfork for folder structure. All documentation is in `/docs` except for special files like the contributing guide in the `/` folder. All images used in the documentation are diff --git a/RELEASE.md b/RELEASE.md index 3ce53501f..a3803f1da 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -27,7 +27,7 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. ### Fixed Defects -- *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host -- Enabled xnack+ check in HIP catch2 tests hang when executing tests -- Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs -- Using *hipGraphAddMemFreeNode* no longer results in a crash +* *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host +* Enabled xnack+ check in HIP catch2 tests hang when executing tests +* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs +* Using *hipGraphAddMemFreeNode* no longer results in a crash diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md new file mode 100644 index 000000000..8d9685d04 --- /dev/null +++ b/docs/CHANGELOG.md @@ -0,0 +1,5129 @@ +# Release Notes + + + + + + + + + + + + +The release notes for the ROCm platform. + +------------------- + +## ROCm 5.6.1 + + + +### What's New in This Release + +ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. + +## HIP 5.6.1 (for ROCm 5.6.1) + +### Fixed Defects + +* *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host +* Enabled xnack+ check in HIP catch2 tests hang when executing tests +* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs +* Using *hipGraphAddMemFreeNode* no longer results in a crash + +------------------- + +## ROCm 5.6.0 + + + + +### Release Highlights + +ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include: + +* New documentation portal at https://rocm.docs.amd.com +* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite +* OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements +* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers +* New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. + +### OS and GPU Support Changes + +* SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. +* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. + * No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 + * Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance \[EOM])(will be aligned with the closest ROCm release) + * Bug fixes during the maintenance will be made to the next ROCm point release + * Bug fixes will not be back ported to older ROCm releases for this SKU + * Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. + +### AMDSMI CLI 23.0.0.4 + +#### Added + +* AMDSMI CLI tool enabled for Linux Bare Metal & Guest + +* Package: amd-smi-lib + +#### Known Issues + +* not all Error Correction Code (ECC) fields are currently supported + +* RHEL 8 & SLES 15 have extra install steps + +### Kernel Modules (DKMS) + +#### Fixes + +* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). + +#### HIP 5.6 (For ROCm 5.6) + +##### Optimizations + +* Consolidation of hipamd, rocclr and OpenCL projects in clr +* Optimized lock for graph global capture mode + +##### Added + +* Added hipRTC support for amd_hip_fp16 +* Added hipStreamGetDevice implementation to get the device associated with the stream +* Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats +* hipArrayGetInfo for getting information about the specified array +* hipArrayGetDescriptor for getting 1D or 2D array descriptor +* hipArray3DGetDescriptor to get 3D array descriptor + +##### Changed + +* hipMallocAsync to return success for zero size allocation to match hipMalloc +* Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package +* Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide +* Removed hipBusBandwidth and hipCommander samples from hip-tests + +##### Fixed + +* Fixed regression in hipMemCpyParam3D when offset is applied + +##### Known Issues + +* Limited testing on xnack+ configuration + * Multiple HIP tests failures (gpuvm fault or hangs) +* hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU +* Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release + +##### Upcoming changes in future release + +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction (unsigned int -> size_t) +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' + +#### ROCgdb-13 (For ROCm 5.6.0) + +##### Optimized + +* Improved performances when handling the end of a process with a large number of threads. + +Known Issues + +* On certain configurations, ROCgdb can show the following warning message: + + `warning: Probes-based dynamic linker interface failed. Reverting to original interface.` + + This does not affect ROCgdb's functionalities. + +#### ROCprofiler (For ROCm 5.6.0) + +In ROCm 5.6 the `rocprofilerv1` and `rocprofilerv2` include and library files of +ROCm 5.5 are split into separate files. The `rocmtools` files that were +deprecated in ROCm 5.5 have been removed. + + | ROCm 5.6 | rocprofilerv1 | rocprofilerv2 | + |-----------------|-------------------------------------|----------------------------------------| + | **Tool script** | `bin/rocprof` | `bin/rocprofv2` | + | **API include** | `include/rocprofiler/rocprofiler.h` | `include/rocprofiler/v2/rocprofiler.h` | + | **API library** | `lib/librocprofiler.so.1` | `lib/librocprofiler.so.2` | + +The ROCm Profiler Tool that uses `rocprofilerV1` can be invoked using the +following command: + +```sh +$ rocprof … +``` + +To write a custom tool based on the `rocprofilerV1` API do the following: + +```C +main.c: +#include // Use the rocprofilerV1 API +int main() { + // Use the rocprofilerV1 API + return 0; +} +``` + +This can be built in the following manner: + +```sh +$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64 +``` + +The resulting `a.out` will depend on +`/opt/rocm-5.6.0/lib/librocprofiler64.so.1`. + +The ROCm Profiler that uses `rocprofilerV2` API can be invoked using the +following command: + +```sh +$ rocprofv2 … +``` + +To write a custom tool based on the `rocprofilerV2` API do the following: + +```C +main.c: +#include // Use the rocprofilerV2 API +int main() { + // Use the rocprofilerV2 API + return 0; +} +``` + +This can be built in the following manner: + +```sh +$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2 +``` + +The resulting `a.out` will depend on +`/opt/rocm-5.6.0/lib/librocprofiler64.so.2`. + +##### Optimized + +* Improved Test Suite + +##### Added + +* 'end_time' need to be disabled in roctx_trace.txt + +##### Fixed + +* rocprof in ROcm/5.4.0 gpu selector broken. +* rocprof in ROCm/5.4.1 fails to generate kernel info. +* rocprof clobbers LD_PRELOAD. + +### Library Changes in ROCM 5.6.0 + +| Library | Version | +|---------|---------| +| hipBLAS | ⇒ [1.0.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.6.0) | +| hipCUB | ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.6.0) | +| hipFFT | ⇒ [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.6.0) | +| hipSOLVER | ⇒ [1.8.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.6.0) | +| hipSPARSE | ⇒ [2.3.6](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.6.0) | +| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.6.0) | +| rccl | ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.6.0) | +| rocALUTION | ⇒ [2.1.9](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.6.0) | +| rocBLAS | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.6.0) | +| rocFFT | ⇒ [1.0.23](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.6.0) | +| rocm-cmake | ⇒ [0.9.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.6.0) | +| rocPRIM | ⇒ [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.6.0) | +| rocRAND | ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.6.0) | +| rocSOLVER | ⇒ [3.22.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.6.0) | +| rocSPARSE | ⇒ [2.5.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.6.0) | +| rocThrust | ⇒ [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.6.0) | +| rocWMMA | ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.6.0) | +| Tensile | ⇒ [4.37.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.6.0) | + +#### hipBLAS 1.0.0 + +hipBLAS 1.0.0 for ROCm 5.6.0 + +##### Changed + +* added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing + +##### Removed + +* removed support for deprecated hipblasInt8Datatype_t enum +* removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions + +##### Deprecated + +* in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and + out-of-place functionality + +#### hipCUB 2.13.1 + +hipCUB 2.13.1 for ROCm 5.6.0 + +##### Added + +* Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. + +##### Changed + +* CUB backend references CUB and Thrust version 1.17.2. +* Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). + +##### Known Issues + +* `BlockRadixRankMatch` is currently broken under the rocPRIM backend. +* `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. + +#### hipFFT 1.0.12 + +hipFFT 1.0.12 for ROCm 5.6.0 + +##### Added + +* Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. + +##### Changed + +* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. + +#### hipSOLVER 1.8.0 + +hipSOLVER 1.8.0 for ROCm 5.6.0 + +##### Added + +* Added compatibility API with hipsolverRf prefix + +#### hipSPARSE 2.3.6 + +hipSPARSE 2.3.6 for ROCm 5.6.0 + +##### Added + +* Added SpGEMM algorithms + +##### Changed + +* For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE + +#### MIOpen 2.19.0 + +MIOpen 2.19.0 for ROCm 5.6.0 + +##### Added + +* ROCm 5.5 support for gfx1101 (Navi32) + +##### Changed + +* Tuning results for MLIR on ROCm 5.5 +* Bumping MLIR commit to 5.5.0 release tag + +##### Fixed + +* Fix 3d convolution Host API bug +* \[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. + +#### rccl 2.15.5 + +RCCL 2.15.5 for ROCm 5.6.0 + +##### Changed + +* Compatibility with NCCL 2.15.5 +* Unit test executable renamed to rccl-UnitTests + +##### Added + +* HW-topology aware binary tree implementation +* Experimental support for MSCCL +* New unit tests for hipGraph support +* NPKit integration + +##### Fixed + +* rocm-smi ID conversion +* Support for HIP_VISIBLE_DEVICES for unit tests +* Support for p2p transfers to non (HIP) visible devices + +##### Removed + +* Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench + +#### rocALUTION 2.1.9 + +rocALUTION 2.1.9 for ROCm 5.6.0 + +##### Improved + +* Fixed synchronization issues in level 1 routines + +#### rocBLAS 3.0.0 + +rocBLAS 3.0.0 for ROCm 5.6.0 + +##### Optimizations + +* Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256. +* Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a. + +##### Added + +* Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex. + +##### Deprecated + +* trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality +* rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release +* rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release +* rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size() +* rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release + +##### Removed + +* is_complex helper was deprecated and now removed. Use rocblas_is_complex instead. +* The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively. +* rocblas_set_int8_type_for_hipblas was deprecated and is now removed. +* rocblas_get_int8_type_for_hipblas was deprecated and is now removed. + +##### Dependencies + +* build only dependency on python joblib added as used by Tensile build +* fix for cmake install on some OS when performed by install.sh -d --cmake_install + +##### Fixed + +* make trsm offset calculations 64 bit safe + +##### Changed + +* refactor rotg test code + +#### rocFFT 1.0.23 + +rocFFT 1.0.23 for ROCm 5.6.0 + +##### Added + +* Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create. + Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used. +* Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems. + +##### Changed + +* Replaced std::complex with hipComplex data types for data generator. +* FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example). +* Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. + +##### Fixed + +* Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure. + +#### rocm-cmake 0.9.0 + +rocm-cmake 0.9.0 for ROCm 5.6.0 + +##### Added + +* Added the option ROCM_HEADER_WRAPPER_WERROR + * Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings. + * Configure-time CMake option sets the default for the C macro. + +#### rocPRIM 2.13.0 + +rocPRIM 2.13.0 for ROCm 5.6.0 + +##### Added + +* New block level `radix_rank` primitive. +* New block level `radix_rank_match` primitive. +* Added a stable block sorting implementation. This be used with `block_sort` by using the `block_sort_algorithm::stable_merge_sort` algorithm. + +##### Changed + +* Improved the performance of `block_radix_sort` and `device_radix_sort`. +* Improved the performance of `device_merge_sort`. +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). Contributed by: [v01dXYZ](https://github.com/v01dXYZ). + +##### Known Issues + +* Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. +* When `ROCPRIM_DISABLE_LOOKBACK_SCAN` is set, `device_scan` fails for input sizes bigger than `scan_config::size_limit`, which defaults to `std::numeric_limits<unsigned int>::max()`. + +#### rocRAND 2.10.17 + +rocRAND 2.10.17 for ROCm 5.6.0 + +##### Added + +* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. +* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. +* experimental HIP-CPU feature +* ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". + +##### Changed + +* Python 2.7 is no longer officially supported. + +#### rocSOLVER 3.22.0 + +rocSOLVER 3.22.0 for ROCm 5.6.0 + +##### Added + +* LU refactorization for sparse matrices + * CSRRF_ANALYSIS + * CSRRF_SUMLU + * CSRRF_SPLITLU + * CSRRF_REFACTLU +* Linear system solver for sparse matrices + * CSRRF_SOLVE +* Added type `rocsolver_rfinfo` for use with sparse matrix routines + +##### Optimized + +* Improved the performance of BDSQR and GESVD when singular vectors are requested + +##### Fixed + +* BDSQR and GESVD should no longer hang when the input contains `NaN` or `Inf` + +#### rocSPARSE 2.5.2 + +rocSPARSE 2.5.2 for ROCm 5.6.0 + +##### Improved + +* Fixed a memory leak in csritsv +* Fixed a bug in csrsm and bsrsm + +#### rocThrust 2.18.0 + +rocThrust 2.18.0 for ROCm 5.6.0 + +##### Fixed + +* `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. + +##### Changed + +* Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). + +#### rocWMMA 1.1.0 + +rocWMMA 1.1.0 for ROCm 5.6.0 + +##### Added + +* Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) +* Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) +* Added performance gemm samples for half, single and double precision +* Added rocWMMA cmake versioning +* Added vectorized support in coordinate transforms +* Included ROCm smi for runtime clock rate detection +* Added fragment transforms for transpose and change data layout + +##### Changed + +* Default to GPU rocBLAS validation against rocWMMA +* Re-enabled int8 gemm tests on gfx9 +* Upgraded to C++17 +* Restructured unit test folder for consistency +* Consolidated rocWMMA samples common code + +#### Tensile 4.37.0 + +Tensile 4.37.0 for ROCm 5.6.0 + +##### Added + +* Added user driven tuning API +* Added decision tree fallback feature +* Added SingleBuffer + AtomicAdd option for GlobalSplitU +* DirectToVgpr support for fp16 and Int8 with TN orientation +* Added new test cases for various functions +* Added SingleBuffer algorithm for ZGEMM/CGEMM +* Added joblib for parallel map calls +* Added support for MFMA + LocalSplitU + DirectToVgprA+B +* Added asmcap check for MIArchVgpr +* Added support for MFMA + LocalSplitU +* Added frequency, power, and temperature data to the output + +##### Optimizations + +* Improved the performance of GlobalSplitU with SingleBuffer algorithm +* Reduced the running time of the extended and pre_checkin tests +* Optimized the Tailloop section of the assembly kernel +* Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch) +* Improved the performance of the second kernel of MultipleBuffer algorithm + +##### Changed + +* Updated custom kernels with 64-bit offsets +* Adapted 64-bit offset arguments for assembly kernels +* Improved temporary register re-use to reduce max sgpr usage +* Removed some restrictions on VectorWidth and DirectToVgpr +* Updated the dependency requirements for Tensile +* Changed the range of AssertSummationElementMultiple +* Modified the error messages for more clarity +* Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used +* Removed dummy vgpr for vectorStaticRemainder +* Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder +* Removed qReg parameter from vectorStaticRemainder + +##### Fixed + +* Fixed tmp sgpr allocation to avoid over-writing values (alpha) +* 64-bit offset parameters for post kernels +* Fixed gfx908 CI test failures +* Fixed offset calculation to prevent overflow for large offsets +* Fixed issues when BufferLoad and BufferStore are equal to zero +* Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch +* Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch +* Fixed the memory access error related to StaggerU + large stride +* Fixed ZGEMM 4x4 MatrixInst mismatch +* Fixed DGEMM 4x4 MatrixInst mismatch +* Fixed ASEM + GSU + NoTailLoop opt mismatch +* Fixed AssertSummationElementMultiple + GlobalSplitU issues +* Fixed ASEM + GSU + TailLoop inner unroll + +------------------- + +## ROCm 5.5.1 + + +### What's New in This Release + +#### HIP SDK for Windows + +AMD is pleased to announce the availability of the HIP SDK for Windows as part +of the ROCm platform. The +[HIP SDK OS and GPU support page](https://rocm.docs.amd.com/en/docs-5.5.1/release/windows_support.html) +lists the versions of Windows and GPUs validated by AMD. HIP SDK features on +Windows are described in detail in our +[What is ROCm?](https://rocm.docs.amd.com/en/docs-5.5.1/rocm.html#rocm-on-windows) +page and differs from the Linux feature set. Visit +[Quick Start](https://rocm.docs.amd.com/en/docs-5.5.1/deploy/windows/quick_start.html#) +page to get started. Known issues are tracked on +[GitHub](https://github.com/RadeonOpenCompute/ROCm/issues?q=is%3Aopen+label%3A5.5.1+label%3A%22Verified+Issue%22+label%3AWindows). + +#### HIP API Change + +The following HIP API is updated in the ROCm 5.5.1 release: + +##### `hipDeviceSetCacheConfig` + +* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess` + +### Library Changes in ROCM 5.5.1 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.54.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.5.1) | +| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.5.1) | +| hipFFT | [1.0.11](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.5.1) | +| hipSOLVER | [1.7.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.5.1) | +| hipSPARSE | [2.3.5](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.5.1) | +| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.5.1) | +| rccl | [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.5.1) | +| rocALUTION | [2.1.8](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.5.1) | +| rocBLAS | [2.47.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.5.1) | +| rocFFT | [1.0.22](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.5.1) | +| rocm-cmake | [0.8.1](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.5.1) | +| rocPRIM | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.5.1) | +| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.5.1) | +| rocSOLVER | [3.21.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.5.1) | +| rocSPARSE | [2.5.1](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.5.1) | +| rocThrust | [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.5.1) | +| rocWMMA | [1.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.5.1) | +| Tensile | [4.36.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.5.1) | + +------------------- + +## ROCm 5.5.0 + + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.5 release consists of the following HIP enhancements: + +##### Enhanced Stack Size Limit + +In this release, the stack size limit is increased from 16k to 131056 bytes (or 128K - 16). +Applications requiring to update the stack size can use hipDeviceSetLimit API. + +##### `hipcc` Changes + +The following hipcc changes are implemented in this release: + +* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them. +* `-use-staticlib` and `-use-sharedlib` options are deprecated. + +##### Future Changes + +* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases. +* In a future ROCm release, the following samples will be removed from the `hip-tests` project. + * `hipBusbandWidth` at + * `hipCommander` at + + Note that the samples will continue to be available in previous release branches. +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * hostRegisterSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' +* Correct hipGetLastError to return the last error instead of last API call's return code +* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved\[16]" +* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess +* Remove hiparray* and make it opaque with hipArray_t + +##### New HIP APIs in This Release + +> **Note** +> +> This is a pre-official version (beta) release of the new APIs and may contain unresolved issues. + +###### Memory Management HIP APIs + +The new memory management HIP API is as follows: + +* Sets information on the specified pointer \[BETA]. + + ```cpp + hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr); + ``` + +###### Module Management HIP APIs + +The new module management HIP APIs are as follows: + +* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute. + + ```cpp + hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams); + ``` + +* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute. + + ```cpp + hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags); + ``` + +###### HIP Graph Management APIs + +The new HIP Graph Management APIs are as follows: + +* Creates a memory allocation node and adds it to a graph \[BETA] + + ```cpp + hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams); + ``` + +* Return parameters for memory allocation node \[BETA] + + ```cpp + hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams); + ``` + +* Creates a memory free node and adds it to a graph \[BETA] + + ```cpp + hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr); + ``` + +* Returns parameters for memory free node \[BETA]. + + ```cpp + hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr); + ``` + +* Write a DOT file describing graph structure \[BETA]. + + ```cpp + hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags); + ``` + +* Copies attributes from source node to destination node \[BETA]. + + ```cpp + hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst); + ``` + +* Enables or disables the specified node in the given graphExec \[BETA] + + ```cpp + hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled); + ``` + +* Query whether a node in the given graphExec is enabled \[BETA] + + ```cpp + hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled); + ``` + +##### OpenMP Enhancements +This release consists of the following OpenMP enhancements: + +* Additional support for OMPT functions `get_device_time` and `get_record_type`. +* Add support for min/max fast fp atomics on AMD GPUs. +* Fix the use of the abs function in C device regions. + +### Deprecations and Warnings + +#### HIP Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +##### Linux Filesystem Hierarchy Standard for ROCm + +ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility. + +##### New Filesystem Hierarchy + +The following is the new filesystem hierarchy:4 + +```text +/opt/rocm- + | --bin + | --All externally exposed Binaries + | --libexec + | -- + | -- Component specific private non-ISA executables (architecture independent) + | --include + | -- + | --
+ | --lib + | --lib.so -> lib.so.major -> lib.so.major.minor.patch + (public libraries linked with application) + | -- (component specific private library, executable data) + | -- + | --components + | --.config.cmake + | --share + | --html//*.html + | --info//*.[pdf, md, txt] + | --man + | --doc + | -- + | -- + | -- + | -- (arch independent non-executable) + | --samples + +``` + +> **Note** +> +> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release. + +For more information, refer to . + +##### Backward Compatibility with Older Filesystems + +ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility. + +> **Note** +> +> ROCm will continue supporting backward compatibility until the next major release. + +##### Wrapper header files + +Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: + +```cpp +// Code snippet from hip_runtime.h +#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. +#include "hip/hip_runtime.h" +``` + +The wrapper header files’ backward compatibility deprecation is as follows: + +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release + +##### Library files + +Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx//lib`) has a soft link to the library at the new location. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/ +total 4 +drwxr-xr-x 4 root root 4096 May 12 10:45 cmake +lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so +``` + +##### CMake Config files + +All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/` folder. +For backward compatibility, the old CMake locations (`/opt/rocm-xxx//lib/cmake`) consist of a soft link to the new CMake config. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/cmake/hip/ +total 0 +lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake +``` + +#### ROCm Support For Code Object V3 Deprecated + +Support for Code Object v3 is deprecated and will be removed in a future release. + +#### Comgr V3.0 Changes + +The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0. + +##### API Changes + +* `amd_comgr_action_info_set_options()` +* `amd_comgr_action_info_get_options()` + +##### Actions and Data Types + +* `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES` +* `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN` + +For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the `AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros. + +#### Deprecated Environment Variables + +The following environment variables are removed in this ROCm release: + +* `GPU_MAX_COMMAND_QUEUES` +* `GPU_MAX_WORKGROUP_SIZE_2D_X` +* `GPU_MAX_WORKGROUP_SIZE_2D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_X` +* `GPU_MAX_WORKGROUP_SIZE_3D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_Z` +* `GPU_BLIT_ENGINE_TYPE` +* `GPU_USE_SYNC_OBJECTS` +* `AMD_OCL_SC_LIB` +* `AMD_OCL_ENABLE_MESSAGE_BOX` +* `GPU_FORCE_64BIT_PTR` +* `GPU_FORCE_OCL20_32BIT` +* `GPU_RAW_TIMESTAMP` +* `GPU_SELECT_COMPUTE_RINGS_ID` +* `GPU_USE_SINGLE_SCRATCH` +* `GPU_ENABLE_LARGE_ALLOCATION` +* `HSA_LOCAL_MEMORY_ENABLE` +* `HSA_ENABLE_COARSE_GRAIN_SVM` +* `GPU_IFH_MODE` +* `OCL_SYSMEM_REQUIREMENT` +* `OCL_CODE_CACHE_ENABLE` +* `OCL_CODE_CACHE_RESET` + +### Known Issues In This Release + +The following are the known issues in this release. + +#### `DISTRIBUTED`/`TEST_DISTRIBUTED_SPAWN` Fails + +When user applications call `ncclCommAbort` to destruct communicators and then create new +communicators repeatedly, subsequent communicators may fail to initialize. + +This issue is under investigation and will be resolved in a future release. + +#### Failures In HIP Directed Tests + +Multiple HIP directed tests fail. + +### Library Changes in ROCM 5.5.0 + +| Library | Version | +|---------|---------| +| hipBLAS | 0.53.0 ⇒ [0.54.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.5.0) | +| hipCUB | 2.13.0 ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.5.0) | +| hipFFT | 1.0.10 ⇒ [1.0.11](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.5.0) | +| hipSOLVER | 1.6.0 ⇒ [1.7.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.5.0) | +| hipSPARSE | 2.3.3 ⇒ [2.3.5](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.5.0) | +| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.5.0) | +| rccl | 2.13.4 ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.5.0) | +| rocALUTION | 2.1.3 ⇒ [2.1.8](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.5.0) | +| rocBLAS | 2.46.0 ⇒ [2.47.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.5.0) | +| rocFFT | 1.0.21 ⇒ [1.0.22](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.5.0) | +| rocm-cmake | 0.8.0 ⇒ [0.8.1](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.5.0) | +| rocPRIM | 2.12.0 ⇒ [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.5.0) | +| rocRAND | 2.10.16 ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.5.0) | +| rocSOLVER | 3.20.0 ⇒ [3.21.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.5.0) | +| rocSPARSE | 2.4.0 ⇒ [2.5.1](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.5.0) | +| rocThrust | [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.5.0) | +| rocWMMA | 0.9 ⇒ [1.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.5.0) | +| Tensile | 4.35.0 ⇒ [4.36.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.5.0) | + +#### hipBLAS 0.54.0 + +hipBLAS 0.54.0 for ROCm 5.5.0 + +##### Added + +* added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF +* added scripts to plot performance for multiple functions +* data driven hipblas-bench and hipblas-test execution via external yaml format data files +* client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml + +##### Fixed + +* fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes +* fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0 +* fixed getrs to return successfully when given nullptrs with corresponding size = 0 +* fixed getrs to give info = -1 when transpose is not an expected type +* fixed gels to return successfully when given nullptrs with corresponding size = 0 +* fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases + +##### Changed + +* changed reference code for Windows to OpenBLAS +* hipblas client executables all now begin with hipblas- prefix + +#### hipCUB 2.13.1 + +hipCUB 2.13.1 for ROCm 5.5.0 + +##### Added + +* Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. + +##### Changed + +* CUB backend references CUB and Thrust version 1.17.2. +* Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. + +##### Fixed + +* Windows HIP SDK support + +##### Known Issues + +* `BlockRadixRankMatch` is currently broken under the rocPRIM backend. +* `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. + +#### hipFFT 1.0.11 + +hipFFT 1.0.11 for ROCm 5.5.0 + +##### Fixed + +* Fixed old version rocm include/lib folders not removed on upgrade. + +#### hipSOLVER 1.7.0 + +hipSOLVER 1.7.0 for ROCm 5.5.0 + +##### Added + +* Added functions + * gesvdj + * hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize + * hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj + * gesvdjBatched + * hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize + * hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched + +#### hipSPARSE 2.3.5 + +hipSPARSE 2.3.5 for ROCm 5.5.0 + +##### Improved + +* Fixed an issue, where the rocm folder was not removed on upgrade of meta packages +* Fixed a compilation issue with cusparse backend +* Added more detailed messages on unit test failures due to missing input data +* Improved documentation +* Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis) + +#### MIOpen 2.19.0 + +MIOpen 2.19.0 for ROCm 5.5.0 + +##### Added + +* ROCm 5.5 support for gfx1101 (Navi32) + +##### Changed + +* Tuning results for MLIR on ROCm 5.5 +* Bumping MLIR commit to 5.5.0 release tag + +##### Fixed + +* Fix 3d convolution Host API bug +* \[HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. + +#### rccl 2.15.5 + +RCCL 2.15.5 for ROCm 5.5.0 + +##### Changed + +* Compatibility with NCCL 2.15.5 +* Unit test executable renamed to rccl-UnitTests + +##### Added + +* HW-topology aware binary tree implementation +* Experimental support for MSCCL +* New unit tests for hipGraph support +* NPKit integration + +##### Fixed + +* rocm-smi ID conversion +* Support for HIP_VISIBLE_DEVICES for unit tests +* Support for p2p transfers to non (HIP) visible devices + +##### Removed + +* Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench + +#### rocALUTION 2.1.8 + +rocALUTION 2.1.8 for ROCm 5.5.0 + +##### Added + +* Added build support for Navi32 + +##### Improved + +* Fixed a typo in MPI backend +* Fixed a bug with the backend when HIP support is disabled +* Fixed a bug in SAAMG hierarchy building on HIP backend +* Improved SAAMG hierarchy build performance on HIP backend + +##### Changed + +* LocalVector::GetIndexValues(ValueType\*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector\*) instead +* LocalVector::SetIndexValues(const ValueType\*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead +* LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*) instead +* LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix\*) instead +* LocalMatrix::RugeStueben() is deprecated +* LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, int) instead +* LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*) instead + +#### rocBLAS 2.47.0 + +rocBLAS 2.47.0 for ROCm 5.5.0 + +##### Added + +* added functionality rocblas_geam_ex for matrix-matrix minimum operations +* added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions +* added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API +* added support for vector initialization in the rocBLAS test framework with negative increments +* added windows build documentation for forthcoming support using ROCm HIP SDK +* added scripts to plot performance for multiple functions + +##### Optimizations + +* improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU. +* improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU. +* improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs. + +##### Fixed + +* fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench +* fixed deprecated API compatibility with Visual Studio compiler +* fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory + +##### Changed + +* install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help) +* rocblas client executables all now begin with rocblas- prefix + +##### Removed + +* install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default + +#### rocFFT 1.0.22 + +rocFFT 1.0.22 for ROCm 5.5.0 + +##### Optimizations + +* Improved performance of 1D lengths < 2048 that use Bluestein's algorithm. +* Reduced time for generating code during plan creation. +* Optimized 3D R2C/C2R lengths 32, 84, 128. +* Optimized batched small 1D R2C/C2R cases. + +##### Added + +* Added gfx1101 to default AMDGPU_TARGETS. + +##### Changed + +* Moved client programs to C++17. +* Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled. +* Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache. + +##### Fixed + +* Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged. +* Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present. + +#### rocm-cmake 0.8.1 + +rocm-cmake 0.8.1 for ROCm 5.5.0 + +##### Fixed + +* ROCMInstallTargets: Added compatibility symlinks for included cmake files in `<ROCM>/lib/cmake/<PACKAGE>`. + +##### Changed + +* ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning. + +#### rocPRIM 2.13.0 + +rocPRIM 2.13.0 for ROCm 5.5.0 + +##### Added + +* New block level `radix_rank` primitive. +* New block level `radix_rank_match` primitive. + +##### Changed + +* Improved the performance of `block_radix_sort` and `device_radix_sort`. + +##### Known Issues + +* Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. + +##### Fixed + +* Fixed benchmark build on Windows + +#### rocRAND 2.10.17 + +rocRAND 2.10.17 for ROCm 5.5.0 + +##### Added + +* MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. +* New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. +* experimental HIP-CPU feature +* ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". + +##### Changed + +* Python 2.7 is no longer officially supported. + +##### Fixed + +* Windows HIP SDK support + +#### rocSOLVER 3.21.0 + +rocSOLVER 3.21.0 for ROCm 5.5.0 + +##### Added + +* SVD for general matrices using Jacobi algorithm: + * GESVDJ (with batched and strided\_batched versions) +* LU factorization without pivoting for block tridiagonal matrices: + * GEBLTTRF_NPVT (with batched and strided\_batched versions) +* Linear system solver without pivoting for block tridiagonal matrices: + * GEBLTTRS_NPVT (with batched and strided\_batched, versions) +* Product of triangular matrices + * LAUUM +* Added experimental hipGraph support for rocSOLVER functions + +##### Optimized + +* Improved the performance of SYEVJ/HEEVJ. + +##### Changed + +* STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach. + +##### Fixed + +* SYEVJ/HEEVJ should now be invariant under matrix scaling. +* SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed. +* Fixed GETF2\_NPVT and GETRF\_NPVT input data initialization in tests and benchmarks. +* Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages. + +#### rocSPARSE 2.5.1 + +rocSPARSE 2.5.1 for ROCm 5.5.0 + +##### Added + +* Added bsrgemm and spgemm for BSR format +* Added bsrgeam +* Added build support for Navi32 +* Added experimental hipGraph support for some rocSPARSE routines +* Added csritsv, spitsv csr iterative triangular solve +* Added mixed precisions for SpMV +* Added batched SpMM for transpose A in COO format with atomic atomic algorithm + +##### Improved + +* Optimization to csr2bsr +* Optimization to csr2csr_compress +* Optimization to csr2coo +* Optimization to gebsr2csr +* Optimization to csr2gebsr +* Fixes to documentation +* Fixes a bug in COO SpMV gridsize +* Fixes a bug in SpMM gridsize when using very large matrices + +##### Known Issues + +* In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split. + +#### rocWMMA 1.0 + +rocWMMA 1.0 for ROCm 5.5.0 + +##### Added + +* Added support for wave32 on gfx11+ +* Added infrastructure changes to support hipRTC +* Added performance tracking system + +##### Changed + +* Modified the assignment of hardware information +* Modified the data access for unsigned datatypes +* Added library config to support multiple architectures + +#### Tensile 4.36.0 + +Tensile 4.36.0 for ROCm 5.5.0 + +##### Added + +* Add functions for user-driven tuning +* Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac +* Add binary search for Grid-Based algorithm +* Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2) +* Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4) +* Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1) +* Add GSU SingleBuffer algorithm for HSS/BSS +* Add gfx900:xnack-, gfx1032, gfx1034, gfx1035 +* Enable gfx1031 support + +##### Optimizations + +* Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1 +* Improve InitAccVgprOpt + +##### Changed + +* Use global_atomic for GSU instead of flat and global_store for debug code +* Replace flat_load/store with global_load/store +* Use global_load/store for BufferLoad/Store=0 and enable scheduling +* LocalSplitU support for HGEMM+HPA when MFMA disabled +* Update Code Object Version +* Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss +* Update asm cap cache arguments +* Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead +* Change checks, error messages, assembly syntax, and coverage for DirectToLds +* Remove unused cmake file +* Clean up the LLVM dependency code +* Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2 +* Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead + +##### Fixed + +* Add build-id to header of compiled source kernels +* Fix solution index collisions +* Fix h beta vectorwidth4 correctness issue for WMMA +* Fix an error with BufferStore=0 +* Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2) +* Fix MoveMIoutToArch bug +* Fix flat load correctness issue on I8 and flat store correctness issue +* Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes +* Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop +* Fix issues with DirectToVgpr + ScheduleIterAlg<3 +* Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2 +* Fix mismatch issue with PrefetchGlobalRead=2 +* Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size +* Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1 +* Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case +* Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical +* Fix for failing CI tests due to CpuThreads=0 +* Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2 +* Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only) +* Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only) + +------------------- + +## ROCm 5.4.3 + +### Deprecations and Warnings + +#### HIP Perl Scripts Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +##### Linux Filesystem Hierarchy Standard for ROCm + +ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility. + +##### New Filesystem Hierarchy + +The following is the new filesystem hierarchy:4 + +```text +/opt/rocm- + | --bin + | --All externally exposed Binaries + | --libexec + | -- + | -- Component specific private non-ISA executables (architecture independent) + | --include + | -- + | --
+ | --lib + | --lib.so -> lib.so.major -> lib.so.major.minor.patch + (public libraries linked with application) + | -- (component specific private library, executable data) + | -- + | --components + | --.config.cmake + | --share + | --html//*.html + | --info//*.[pdf, md, txt] + | --man + | --doc + | -- + | -- + | -- + | -- (arch independent non-executable) + | --samples + +``` + +> **Note** +> +> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release. + +For more information, refer to . + +##### Backward Compatibility with Older Filesystems + +ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility. + +> **Note** +> +> ROCm will continue supporting backward compatibility until the next major release. + +##### Wrapper header files + +Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: + +```cpp +// Code snippet from hip_runtime.h +#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. +#include "hip/hip_runtime.h" +``` + +The wrapper header files’ backward compatibility deprecation is as follows: + +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release + +##### Library files + +Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx//lib`) has a soft link to the library at the new location. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/ +total 4 +drwxr-xr-x 4 root root 4096 May 12 10:45 cmake +lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so +``` + +##### CMake Config files + +All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx//lib/cmake`) consist of a soft link to the new CMake config. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/cmake/hip/ +total 0 +lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake +``` + +### Fixed Defects + +#### Compiler Improvements + +In ROCm v5.4.3, improvements to the compiler address errors with the following signatures: + +- "error: unhandled SGPR spill to memory" +- "cannot scavenge register without an emergency spill slot!" +- "error: ran out of registers during register allocation" + +### Known Issues + +#### Compiler Option Error at Runtime + +Some users may encounter a “Cannot find Symbol” error at runtime when using -save-temps. While most -save-temps use cases work correctly, this error may appear occasionally. + +This issue is under investigation, and the known workaround is not to use -save-temps when the error appears. + +### Library Changes in ROCM 5.4.3 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.4.3) | +| hipCUB | [2.13.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.4.3) | +| hipFFT | [1.0.10](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.4.3) | +| hipSOLVER | [1.6.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.4.3) | +| hipSPARSE | [2.3.3](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.4.3) | +| rccl | [2.13.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.4.3) | +| rocALUTION | [2.1.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.4.3) | +| rocBLAS | [2.46.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.4.3) | +| rocFFT | 1.0.20 ⇒ [1.0.21](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.4.3) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.4.3) | +| rocPRIM | [2.12.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.4.3) | +| rocRAND | [2.10.16](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.4.3) | +| rocSOLVER | [3.20.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.4.3) | +| rocSPARSE | [2.4.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.4.3) | +| rocThrust | [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.4.3) | +| rocWMMA | [0.9](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.4.3) | +| Tensile | [4.35.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.4.3) | + +#### rocFFT 1.0.21 + +rocFFT 1.0.21 for ROCm 5.4.3 + +##### Fixed + +- Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location. + +------------------- + +## ROCm 5.4.2 + +### Deprecations and Warnings + +#### HIP Perl Scripts Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +#### `hipcc` Options Deprecation + +The following hipcc options are being deprecated and will be removed in a future release: + +- The `--amdgpu-target` option is being deprecated, and user must use the `–offload-arch` option to specify the GPU architecture. +- The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM option `-mllvm -mcode-object-version` to debug issues related to code object versions. +- The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls are already supported in production on AMD GPUs. + +### Known Issues + +Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages: + +- > `error: unhandled SGPR spill to memory` + +- > `cannot scavenge register without an emergency spill slot!` + +- > `error: ran out of registers during register allocation` + +This is a known issue and will be fixed in a future release. + +### Library Changes in ROCM 5.4.2 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.4.2) | +| hipCUB | [2.13.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.4.2) | +| hipFFT | [1.0.10](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.4.2) | +| hipSOLVER | [1.6.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.4.2) | +| hipSPARSE | [2.3.3](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.4.2) | +| rccl | [2.13.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.4.2) | +| rocALUTION | [2.1.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.4.2) | +| rocBLAS | [2.46.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.4.2) | +| rocFFT | [1.0.20](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.4.2) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.4.2) | +| rocPRIM | [2.12.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.4.2) | +| rocRAND | [2.10.16](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.4.2) | +| rocSOLVER | [3.20.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.4.2) | +| rocSPARSE | [2.4.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.4.2) | +| rocThrust | [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.4.2) | +| rocWMMA | [0.9](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.4.2) | +| Tensile | [4.35.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.4.2) | + +------------------- + +## ROCm 5.4.1 + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.4.1 release consists of the following new HIP API: + +##### New HIP API - hipLaunchHostFunc + +The following new HIP API is introduced in the ROCm v5.4.1 release. + +> **Note** +> +> This is a pre-official version (beta) release of the new APIs. + +```cpp +hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); +``` + +This swaps the stream capture mode of a thread. + +```text +@param [in] mode - Pointer to mode value to swap with the current mode +``` + +This parameter returns `#hipSuccess`, `#hipErrorInvalidValue`. + +For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html. + +### Deprecations and Warnings + +#### HIP Perl Scripts Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +### IFWI Fixes + +These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release. +AMD Instinct™ MI200 Firmware IFWI Maintenance Update #3 + +This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators. + +After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures. + +In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool – AMD FW FLASH 1.2. + +| GPU | Production Part Number | SKU | IFWI Name | +|-------|------------|--------|---------------| +| MI210 | 113-D673XX | D67302 | D6730200V.110 | +| MI210 | 113-D673XX | D67301 | D6730100V.073 | +| MI250 | 113-D652XX | D65209 | D6520900.073 | +| MI250 | 113-D652XX | D65210 | D6521000.073 | + +Instructions on how to download and apply MI200 maintenance updates are available at: + + + +#### AMD Instinct™ MI200 SRIOV Virtualization Support + +Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices. + +### Library Changes in ROCM 5.4.1 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.4.1) | +| hipCUB | [2.13.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.4.1) | +| hipFFT | [1.0.10](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.4.1) | +| hipSOLVER | [1.6.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.4.1) | +| hipSPARSE | [2.3.3](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.4.1) | +| rccl | [2.13.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.4.1) | +| rocALUTION | [2.1.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.4.1) | +| rocBLAS | [2.46.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.4.1) | +| rocFFT | 1.0.19 ⇒ [1.0.20](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.4.1) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.4.1) | +| rocPRIM | [2.12.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.4.1) | +| rocRAND | [2.10.16](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.4.1) | +| rocSOLVER | [3.20.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.4.1) | +| rocSPARSE | [2.4.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.4.1) | +| rocThrust | [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.4.1) | +| rocWMMA | [0.9](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.4.1) | +| Tensile | [4.35.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.4.1) | + +#### rocFFT 1.0.20 + +rocFFT 1.0.20 for ROCm 5.4.1 + +##### Fixed + +- Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride. + +------------------- + +## ROCm 5.4.0 + + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.4 release consists of the following HIP enhancements: + +##### Support for Wall Clock64 + +A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device. + +```cpp +long long int wall_clock64(); +``` + +It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code. + +Example: + +```cpp +int wallClkRate = 0; //in kilohertz ++HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId)); +``` + +Where hipDeviceAttributeWallClockRate is a device attribute. + +> **Note** +> +> The wall clock frequency is a per-device attribute. + +##### New Registry Added for GPU_MAX_HW_QUEUES + +The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device. + +The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner. + +> **Note** +> +> This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device). + +For more details, refer to the HIP Programming Guide. + +#### New HIP APIs in This Release + +The following new HIP APIs are available in the ROCm v5.4 release. + +> **Note** +> +> This is a pre-official version (beta) release of the new APIs. + +##### Error Handling + +```cpp +hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString); +``` + +This returns HIP errors in the text string format. + +```cpp +hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString); +``` + +This returns text string messages with more details about the error. + +For more information, refer to the HIP API Guide. + +##### HIP Tests Source Separation + +With ROCm v5.4, a separate GitHub project is created at + + + +This contains HIP catch2 tests and samples, and new tests will continue to develop. + +In future ROCm releases, catch2 tests and samples will be removed from the HIP project. + +### OpenMP Enhancements + +This release consists of the following OpenMP enhancements: + +- Enable new device RTL in libomptarget as default. +- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. +- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. +- Initial implementation of optimized cross-team sum reduction for float and double type scalars. +- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. + +### Deprecations and Warnings + +#### HIP Perl Scripts Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +(5_4_0_filesystem_reorg_deprecation_notice)= + +##### Linux Filesystem Hierarchy Standard for ROCm + +ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility. + +##### New Filesystem Hierarchy + +The following is the new filesystem hierarchy: + +```text +/opt/rocm- + | --bin + | --All externally exposed Binaries + | --libexec + | -- + | -- Component specific private non-ISA executables (architecture independent) + | --include + | -- + | --
+ | --lib + | --lib.so -> lib.so.major -> lib.so.major.minor.patch + (public libraries linked with application) + | -- (component specific private library, executable data) + | -- + | --components + | --.config.cmake + | --share + | --html//*.html + | --info//*.[pdf, md, txt] + | --man + | --doc + | -- + | -- + | -- + | -- (arch independent non-executable) + | --samples + +``` + +> **Note** +> +> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release. + +For more information, refer to . + +##### Backward Compatibility with Older Filesystems + +ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility. + +> **Note** +> +> ROCm will continue supporting backward compatibility until the next major release. + +##### Wrapper header files + +Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: + +```cpp +// Code snippet from hip_runtime.h +#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. +#include "hip/hip_runtime.h" +``` + +The wrapper header files’ backward compatibility deprecation is as follows: + +- `#pragma` message announcing deprecation -- ROCm v5.2 release +- `#pragma` message changed to `#warning` -- Future release +- `#warning` changed to `#error` -- Future release +- Backward compatibility wrappers removed -- Future release + +##### Library files + +Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx//lib`) has a soft link to the library at the new location. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/ +total 4 +drwxr-xr-x 4 root root 4096 May 12 10:45 cmake +lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so +``` + +##### CMake Config files + +All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx//lib/cmake`) consist of a soft link to the new CMake config. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/cmake/hip/ +total 0 +lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake +``` + +### Fixed Defects + +The following defects are fixed in this release. + +These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release. + +#### Memory Allocated Using hipHostMalloc() with Flags Did Not Exhibit Fine-Grain Behavior + +##### Issue + +The test was incorrectly using the `hipDeviceAttributePageableMemoryAccess` device attribute to determine coherent support. + +##### Fix + +`hipHostMalloc()` allocates memory with fine-grained access by default when the environment variable `HIP_HOST_COHERENT=1` is used. + +For more information, refer to {doc}`hip:.doxygen/docBin/html/index`. + + +#### SoftHang with `hipStreamWithCUMask` test on AMD Instinct™ + +##### Issue + +On GFX10 GPUs, kernel execution hangs when it is launched on streams created using `hipStreamWithCUMask`. + +##### Fix + +On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units must be enabled as a pair. The `hipStreamWithCUMask` API unit test cases are updated to set compute unit mask (cuMask) in pairs for GFX10 GPUs. + +#### ROCm Tools GPU IDs + +The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes. + +GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node. + +### Library Changes in ROCM 5.4.0 + +| Library | Version | +|---------|---------| +| hipBLAS | 0.52.0 ⇒ [0.53.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.4.0) | +| hipCUB | 2.12.0 ⇒ [2.13.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.4.0) | +| hipFFT | 1.0.9 ⇒ [1.0.10](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.4.0) | +| hipSOLVER | 1.5.0 ⇒ [1.6.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.4.0) | +| hipSPARSE | 2.3.1 ⇒ [2.3.3](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.4.0) | +| rccl | 2.12.10 ⇒ [2.13.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.4.0) | +| rocALUTION | 2.1.0 ⇒ [2.1.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.4.0) | +| rocBLAS | 2.45.0 ⇒ [2.46.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.4.0) | +| rocFFT | 1.0.18 ⇒ [1.0.19](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.4.0) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.4.0) | +| rocPRIM | 2.11.0 ⇒ [2.12.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.4.0) | +| rocRAND | 2.10.15 ⇒ [2.10.16](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.4.0) | +| rocSOLVER | 3.19.0 ⇒ [3.20.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.4.0) | +| rocSPARSE | 2.2.0 ⇒ [2.4.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.4.0) | +| rocThrust | 2.16.0 ⇒ [2.17.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.4.0) | +| rocWMMA | 0.8 ⇒ [0.9](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.4.0) | +| Tensile | 4.34.0 ⇒ [4.35.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.4.0) | + +#### hipBLAS 0.53.0 + +hipBLAS 0.53.0 for ROCm 5.4.0 + +##### Added + +- Allow for selection of int8 datatype +- Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions), + only supported with rocBLAS backend +- Added support for hipblasXgelsBatched operations (with s,d,c,z precisions) + +#### hipCUB 2.13.0 + +hipCUB 2.13.0 for ROCm 5.4.0 + +##### Added + +- CMake functionality to improve build parallelism of the test suite that splits compilation units by +function or by parameters. +- New overload for `BlockAdjacentDifference::SubtractLeftPartialTile` that takes a predecessor item. + +##### Changed + +- Improved build parallelism of the test suite by splitting up large compilation units for `DeviceRadixSort`, +`DeviceSegmentedRadixSort` and `DeviceSegmentedSort`. +- CUB backend references CUB and thrust version 1.17.1. + +#### hipFFT 1.0.10 + +hipFFT 1.0.10 for ROCm 5.4.0 + +##### Added + +- Added hipfftExtPlanScaleFactor API to efficiently multiply each output element of a FFT by a given scaling factor. Result scaling must be supported in the backend FFT library. + +##### Changed + +- When hipFFT is built against the rocFFT backend, rocFFT 1.0.19 or higher is now required. + +#### hipSOLVER 1.6.0 + +hipSOLVER 1.6.0 for ROCm 5.4.0 + +##### Added + +- Added compatibility-only functions + - gesvdaStridedBatched + - hipsolverDnSgesvdaStridedBatched_bufferSize, hipsolverDnDgesvdaStridedBatched_bufferSize, hipsolverDnCgesvdaStridedBatched_bufferSize, hipsolverDnZgesvdaStridedBatched_bufferSize + - hipsolverDnSgesvdaStridedBatched, hipsolverDnDgesvdaStridedBatched, hipsolverDnCgesvdaStridedBatched, hipsolverDnZgesvdaStridedBatched + +#### hipSPARSE 2.3.3 + +hipSPARSE 2.3.3 for ROCm 5.4.0 + +##### Added + +- Added hipsparseCsr2cscEx2_bufferSize and hipsparseCsr2cscEx2 routines + +##### Changed + +- HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse + +#### rccl 2.13.4 + +RCCL 2.13.4 for ROCm 5.4.0 + +##### Changed + +- Compatibility with NCCL 2.13.4 +- Improvements to RCCL when running with hipGraphs +- RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support +- Minor latency improvements + +##### Fixed + +- Resolved potential memory access error due to asynchronous memset + +#### rocALUTION 2.1.3 + +rocALUTION 2.1.3 for ROCm 5.4.0 + +##### Added + +- Added build support for Navi31 and Navi33 +- Added support for non-squared global matrices + +##### Improved + +- Fixed a memory leak in MatrixMult on HIP backend +- Global structures can now be used with a single process + +##### Changed + +- Switched GTest death test style to 'threadsafe' +- GlobalVector::GetGhostSize() is deprecated and will be removed +- ParallelManager::GetGlobalSize(), ParallelManager::GetLocalSize(), ParallelManager::SetGlobalSize() and ParallelManager::SetLocalSize() are deprecated and will be removed +- Vector::GetGhostSize() is deprecated and will be removed +- Multigrid::SetOperatorFormat(unsigned int) is deprecated and will be removed, use Multigrid::SetOperatorFormat(unsigned int, int) instead +- RugeStuebenAMG::SetCouplingStrength(ValueType) is deprecated and will be removed, use SetStrengthThreshold(float) instead + +#### rocBLAS 2.46.0 + +rocBLAS 2.46.0 for ROCm 5.4.0 + +##### Added + +- client smoke test dataset added for quick validation using command rocblas-test --yaml rocblas_smoke.yaml +- Added stream order device memory allocation as a non-default beta option. + +##### Optimized + +- Improved trsm performance for small sizes by using a substitution method technique +- Improved syr2k and her2k performance significantly by using a block-recursive algorithm + +##### Changed + +- Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour. +- Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm +- Allow hipBLAS to select int8 datatype +- Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace + +##### Fixed + +- FORTRAN interfaces generalized for FORTRAN compilers other than gfortran +- fix for trsm_strided_batched rocblas-bench performance gathering +- Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above + +#### rocFFT 1.0.19 + +rocFFT 1.0.19 for ROCm 5.4.0 + +##### Optimizations + +- Optimized some strided large 1D plans. + +##### Added + +- Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor. +- Created a rocfft_kernel_cache.db file next to the installed library. SBCC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures. +- Added gfx1100 and gfx1102 to default AMDGPU_TARGETS. + +##### Changed + +- Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems +on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the +ROCFFT_RTC_CACHE_PATH environment variable. + +#### rocPRIM 2.12.0 + +rocPRIM 2.12.0 for ROCm 5.4.0 + +##### Changed + +- `device_partition`, `device_unique`, and `device_reduce_by_key` now support problem + sizes larger than 2^32 items. + +##### Removed + +- `block_sort::sort()` overload for keys and values with a dynamic size. This overload was documented but the + implementation is missing. To avoid further confusion the documentation is removed until a decision is made on + implementing the function. + +##### Fixed + +- Fixed the compilation failure in `device_merge` if the two key iterators don't match. + +#### rocRAND 2.10.16 + +rocRAND 2.10.16 for ROCm 5.4.0 + +##### Added + +- MRG31K3P pseudorandom number generator based on L'Ecuyer and Touzin, 2000, "Fast combined multiple recursive generators with multipliers of the form a = ±2q ±2r". +- LFSR113 pseudorandom number generator based on L'Ecuyer, 1999, "Tables of maximally equidistributed combined LFSR generators". +- SCRAMBLED_SOBOL32 and SCRAMBLED_SOBOL64 quasirandom number generators. The Scrambled Sobol sequences are generated by scrambling the output of a Sobol sequence. + +##### Changed + +- The `mrg_<distribution>_distribution` structures, which provided numbers based on MRG32K3A, are now replaced by `mrg_engine_<distribution>_distribution`, where `<distribution>` is `log_normal`, `normal`, `poisson`, or `uniform`. These structures provide numbers for MRG31K3P (with template type `rocrand_state_mrg31k3p`) and MRG32K3A (with template type `rocrand_state_mrg32k3a`). + +##### Fixed + +- Sobol64 now returns 64 bits random numbers, instead of 32 bits random numbers. As a result, the performance of this generator has regressed. +- Fixed a bug that prevented compiling code in C++ mode (with a host compiler) when it included the rocRAND headers on Windows. + +#### rocSOLVER 3.20.0 + +rocSOLVER 3.20.0 for ROCm 5.4.0 + +##### Added + +- Partial SVD for bidiagonal matrices: + - BDSVDX +- Partial SVD for general matrices: + - GESVDX (with batched and strided\_batched versions) + +##### Changed + +- Changed `ROCSOLVER_EMBED_FMT` default to `ON` for users building directly with CMake. + This matches the existing default when building with install.sh or rmake.py. + +#### rocSPARSE 2.4.0 + +rocSPARSE 2.4.0 for ROCm 5.4.0 + +##### Added + +- Added rocsparse_spmv_ex routine +- Added rocsparse_bsrmv_ex_analysis and rocsparse_bsrmv_ex routines +- Added csritilu0 routine +- Added build support for Navi31 and Navi 33 + +##### Improved + +- Optimization to segmented algorithm for COO SpMV by performing analysis +- Improve performance when generating random matrices. +- Fixed bug in ellmv +- Optimized bsr2csr routine +- Fixed integer overflow bugs + +#### rocThrust 2.17.0 + +rocThrust 2.17.0 for ROCm 5.4.0 + +##### Added + +- Updated to match upstream Thrust 1.17.0 + +#### rocWMMA 0.9 + +rocWMMA 0.9 for ROCm 5.4.0 + +##### Added + +- Added gemm driver APIs for flow control builtins +- Added benchmark logging systems +- Restructured tests to follow naming convention. Added macros for test generation + +##### Changed + +- Changed CMake to accomodate the modified test infrastructure +- Fine tuned the multi-block kernels with and without lds +- Adjusted Maximum Vector Width to dWordx4 Width +- Updated Efficiencies to display as whole number percentages +- Updated throughput from GFlops/s to TFlops/s +- Reset the ad-hoc tests to use smaller sizes +- Modified the output validation to use CPU-based implementation against rocWMMA +- Modified the extended vector test to return error codes for memory allocation failures + +#### Tensile 4.35.0 + +Tensile 4.35.0 for ROCm 5.4.0 + +##### Added + +- Async DMA support for Transpose Data Layout (ThreadSeparateGlobalReadA/B) +- Option to output library logic in dictionary format +- No solution found error message for benchmarking client +- Exact K check for StoreCInUnrollExact +- Support for CGEMM + MIArchVgpr +- client-path parameter for using prebuilt client +- CleanUpBuildFiles global parameter +- Debug flag for printing library logic index of winning solution +- NumWarmups global parameter for benchmarking +- Windows support for benchmarking client +- DirectToVgpr support for CGEMM +- TensileLibLogicToYaml for creating tuning configs from library logic solutions + +##### Optimizations + +- Put beta code and store separately if StoreCInUnroll = x4 store +- Improved performance for StoreCInUnroll + b128 store + +##### Changed + +- Re-enable HardwareMonitor for gfx90a +- Decision trees use MLFeatures instead of Properties + +##### Fixed + +- Reject DirectToVgpr + MatrixInstBM/BN > 1 +- Fix benchmark timings when using warmups and/or validation +- Fix mismatch issue with DirectToVgprB + VectorWidth > 1 +- Fix mismatch issue with DirectToLds + NumLoadsCoalesced > 1 + TailLoop +- Fix incorrect reject condition for DirectToVgpr +- Fix reject condition for DirectToVgpr + MIWaveTile < VectorWidth +- Fix incorrect instruction generation with StoreCInUnroll + +------------------- + +## ROCm 5.3.3 + +### Fixed Defects + +#### Issue with rocTHRUST and rocPRIM Libraries + +There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases. + +- `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. +- `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and `keys_input2`. + +This issue is resolved with the following fixes to compilation failures: + +- rocPRIM: in device_merge if the two key iterators do not match. +- rocTHRUST: in thrust::merge if the two key iterators do not match. + +### Library Changes in ROCM 5.3.3 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.52.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.3.3) | +| hipCUB | [2.12.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.3.3) | +| hipFFT | [1.0.9](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.3.3) | +| hipSOLVER | [1.5.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.3.3) | +| hipSPARSE | [2.3.1](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.3.3) | +| rccl | [2.12.10](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.3.3) | +| rocALUTION | [2.1.0](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.3.3) | +| rocBLAS | [2.45.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.3.3) | +| rocFFT | [1.0.18](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.3.3) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.3.3) | +| rocPRIM | [2.11.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.3.3) | +| rocRAND | [2.10.15](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.3.3) | +| rocSOLVER | [3.19.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.3.3) | +| rocSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.3.3) | +| rocThrust | [2.16.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.3.3) | +| rocWMMA | [0.8](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.3.3) | +| Tensile | [4.34.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.3.3) | + +------------------- + +## ROCm 5.3.2 + +### Fixed Defects + +The following known issues in ROCm v5.3.2 are fixed in this release. + +#### Peer-to-Peer DMA Mapping Errors with SLES and RHEL + +Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release. + +Previously, running rocminfo resulted in Peer-to-Peer DMA mapping errors. + +#### RCCL Tuning Table + +The RCCL tuning table is updated for supported platforms. + +#### SGEMM (F32 GEMM) Routines in rocBLAS + +Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release. + +### Known Issues + +This section consists of known issues in this release. + +#### AMD Instinct™ MI200 SRIOV Virtualization Issue + +There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal. + +Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads. + +#### AMD Instinct™ MI200 Firmware Updates + +Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators. + +An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon. + +#### Known Issue with rocThrust and rocPRIM Libraries + +There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases. + +- thrust::merge no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. + +- rocprim::device_merge no longer correctly supports using different types for `keys_input1` and `keys_input2`. + +This issue is currently under investigation and will be resolved in a future release. + +### Library Changes in ROCM 5.3.2 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.52.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.3.2) | +| hipCUB | [2.12.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.3.2) | +| hipFFT | [1.0.9](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.3.2) | +| hipSOLVER | [1.5.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.3.2) | +| hipSPARSE | [2.3.1](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.3.2) | +| rccl | [2.12.10](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.3.2) | +| rocALUTION | [2.1.0](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.3.2) | +| rocBLAS | [2.45.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.3.2) | +| rocFFT | [1.0.18](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.3.2) | +| rocm-cmake | [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.3.2) | +| rocPRIM | [2.11.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.3.2) | +| rocRAND | [2.10.15](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.3.2) | +| rocSOLVER | [3.19.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.3.2) | +| rocSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.3.2) | +| rocThrust | [2.16.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.3.2) | +| rocWMMA | [0.8](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.3.2) | +| Tensile | [4.34.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.3.2) | + +------------------- + +## ROCm 5.3.0 + +### Deprecations and Warnings + +#### HIP Perl Scripts Deprecation + +The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts. + +> **Note** +> +> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option. + +#### Linux Filesystem Hierarchy Standard for ROCm + +ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility. + +##### New Filesystem Hierarchy + +The following is the new filesystem hierarchy: + +```text +/opt/rocm- + | --bin + | --All externally exposed Binaries + | --libexec + | -- + | -- Component specific private non-ISA executables (architecture independent) + | --include + | -- + | --
+ | --lib + | --lib.so -> lib.so.major -> lib.so.major.minor.patch + (public libraries linked with application) + | -- (component specific private library, executable data) + | -- + | --components + | --.config.cmake + | --share + | --html//*.html + | --info//*.[pdf, md, txt] + | --man + | --doc + | -- + | -- + | -- + | -- (arch independent non-executable) + | --samples + +``` + +> **Note** +> +> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release. + +For more information, refer to . + +##### Backward Compatibility with Older Filesystems + +ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility. + +> **Note** +> +> ROCm will continue supporting backward compatibility until the next major release. + +##### Wrapper header files + +Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: + +```cpp +// Code snippet from hip_runtime.h +#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. +#include "hip/hip_runtime.h" +``` + +The wrapper header files’ backward compatibility deprecation is as follows: + +- `#pragma` message announcing deprecation -- ROCm v5.2 release +- `#pragma` message changed to `#warning` -- Future release +- `#warning` changed to `#error` -- Future release +- Backward compatibility wrappers removed -- Future release + +##### Library files + +Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx//lib`) has a soft link to the library at the new location. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/ +total 4 +drwxr-xr-x 4 root root 4096 May 12 10:45 cmake +lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so +``` + +##### CMake Config files + +All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx//lib/cmake`) consist of a soft link to the new CMake config. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/cmake/hip/ +total 0 +lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake +``` + +### Fixed Defects + +The following defects are fixed in this release. + +These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release. + +#### Kernel produces incorrect results with ROCm 5.2 + +User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner. + +The compiler fix consists of the following patches: + +- A new `noundef` attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits. +For more information, see +- The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution. + +For more information, see . + +### Known Issues + +This section consists of known issues in this release. + +#### Issue with OpenMP-Extras Package Upgrade + +The `openmp-extras` package has been split into runtime (`openmp-extras-runtime`) and dev (`openmp-extras-devel`) packages. This change has broken the upgrade support for the `openmp-extras` package in RHEL/SLES. +An available workaround in RHEL is to use the following command for upgrades: + +```sh +sudo yum upgrade rocm-language-runtime --allowerasing + +``` + +An available workaround in SLES is to use the following command for upgrades: + +```sh +zypper update --force-resolution +``` + +#### AMD Instinct™ MI200 SRIOV Virtualization Issue + +There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal. + +Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads. + +#### System Crash when IMMOU is Enabled + +If IOMMU is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash. + +- IO PAGE FAULT +- IRQ remapping does not support X2APIC mode +- NMI error + +Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as indicated in the warning message. + +### Library Changes in ROCM 5.3.0 + +| Library | Version | +|---------|---------| +| hipBLAS | 0.51.0 ⇒ [0.52.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.3.0) | +| hipCUB | 2.11.1 ⇒ [2.12.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.3.0) | +| hipFFT | 1.0.8 ⇒ [1.0.9](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.3.0) | +| hipSOLVER | 1.4.0 ⇒ [1.5.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.3.0) | +| hipSPARSE | 2.2.0 ⇒ [2.3.1](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.3.0) | +| rccl | [2.12.10](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.3.0) | +| rocALUTION | 2.0.3 ⇒ [2.1.0](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.3.0) | +| rocBLAS | 2.44.0 ⇒ [2.45.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.3.0) | +| rocFFT | 1.0.17 ⇒ [1.0.18](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.3.0) | +| rocm-cmake | ⇒ [0.8.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.3.0) | +| rocPRIM | 2.10.14 ⇒ [2.11.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.3.0) | +| rocRAND | 2.10.14 ⇒ [2.10.15](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.3.0) | +| rocSOLVER | 3.18.0 ⇒ [3.19.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.3.0) | +| rocSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.3.0) | +| rocThrust | 2.15.0 ⇒ [2.16.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.3.0) | +| rocWMMA | 0.7 ⇒ [0.8](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.3.0) | +| Tensile | 4.33.0 ⇒ [4.34.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.3.0) | + +#### hipBLAS 0.52.0 + +hipBLAS 0.52.0 for ROCm 5.3.0 + +##### Added + +- Added --cudapath option to install.sh to allow user to specify which cuda build they would like to use. +- Added --installcuda option to install.sh to install cuda via a package manager. Can be used with new --installcudaversion + option to specify which version of cuda to install. + +##### Fixed + +- Fixed #includes to support a compiler version. +- Fixed client dependency support in install.sh + +#### hipCUB 2.12.0 + +hipCUB 2.12.0 for ROCm 5.3.0 + +##### Added + +- UniqueByKey device algorithm +- SubtractLeft, SubtractLeftPartialTile, SubtractRight, SubtractRightPartialTile overloads in BlockAdjacentDifference. + - The old overloads (FlagHeads, FlagTails, FlagHeadsAndTails) are deprecated. +- DeviceAdjacentDifference algorithm. +- Extended benchmark suite of `DeviceHistogram`, `DeviceScan`, `DevicePartition`, `DeviceReduce`, +`DeviceSegmentedReduce`, `DeviceSegmentedRadixSort`, `DeviceRadixSort`, `DeviceSpmv`, `DeviceMergeSort`, +`DeviceSegmentedSort` + +##### Changed + +- Obsolated type traits defined in util_type.hpp. Use the standard library equivalents instead. +- CUB backend references CUB and thrust version 1.16.0. +- DeviceRadixSort's num_items parameter's type is now templated instead of being an int. + - If an integral type with a size at most 4 bytes is passed (i.e. an int), the former logic applies. + - Otherwise the algorithm uses a larger indexing type that makes it possible to sort input data over 2**32 elements. +- Improved build parallelism of the test suite by splitting up large compilation units + +#### hipFFT 1.0.9 + +hipFFT 1.0.9 for ROCm 5.3.0 + +##### Changed + +- Clean up build warnings. +- GNUInstall Dir enhancements. +- Requires gtest 1.11. + +#### hipSOLVER 1.5.0 + +hipSOLVER 1.5.0 for ROCm 5.3.0 + +##### Added + +- Added functions + - syevj + - hipsolverSsyevj_bufferSize, hipsolverDsyevj_bufferSize, hipsolverCheevj_bufferSize, hipsolverZheevj_bufferSize + - hipsolverSsyevj, hipsolverDsyevj, hipsolverCheevj, hipsolverZheevj + - syevjBatched + - hipsolverSsyevjBatched_bufferSize, hipsolverDsyevjBatched_bufferSize, hipsolverCheevjBatched_bufferSize, hipsolverZheevjBatched_bufferSize + - hipsolverSsyevjBatched, hipsolverDsyevjBatched, hipsolverCheevjBatched, hipsolverZheevjBatched + - sygvj + - hipsolverSsygvj_bufferSize, hipsolverDsygvj_bufferSize, hipsolverChegvj_bufferSize, hipsolverZhegvj_bufferSize + - hipsolverSsygvj, hipsolverDsygvj, hipsolverChegvj, hipsolverZhegvj +- Added compatibility-only functions + - syevdx/heevdx + - hipsolverDnSsyevdx_bufferSize, hipsolverDnDsyevdx_bufferSize, hipsolverDnCheevdx_bufferSize, hipsolverDnZheevdx_bufferSize + - hipsolverDnSsyevdx, hipsolverDnDsyevdx, hipsolverDnCheevdx, hipsolverDnZheevdx + - sygvdx/hegvdx + - hipsolverDnSsygvdx_bufferSize, hipsolverDnDsygvdx_bufferSize, hipsolverDnChegvdx_bufferSize, hipsolverDnZhegvdx_bufferSize + - hipsolverDnSsygvdx, hipsolverDnDsygvdx, hipsolverDnChegvdx, hipsolverDnZhegvdx +- Added --mem_query option to hipsolver-bench, which will print the amount of device memory workspace required by the function. + +##### Changed + +- The rocSOLVER backend will now set `info` to zero if rocSOLVER does not reference `info`. (Applies to orgbr/ungbr, orgqr/ungqr, orgtr/ungtr, ormqr/unmqr, ormtr/unmtr, gebrd, geqrf, getrs, potrs, and sytrd/hetrd). +- gesvdj will no longer require extra workspace to transpose `V` when `jobz` is `HIPSOLVER_EIG_MODE_VECTOR` and `econ` is 1. + +##### Fixed + +- Fixed Fortran return value declarations within hipsolver_module.f90 +- Fixed gesvdj_bufferSize returning `HIPSOLVER_STATUS_INVALID_VALUE` when `jobz` is `HIPSOLVER_EIG_MODE_NOVECTOR` and 1 <= `ldv` < `n` +- Fixed gesvdj returning `HIPSOLVER_STATUS_INVALID_VALUE` when `jobz` is `HIPSOLVER_EIG_MODE_VECTOR`, `econ` is 1, and `m` < `n` + +#### hipSPARSE 2.3.1 + +hipSPARSE 2.3.1 for ROCm 5.3.0 + +##### Added + +- Add SpMM and SpMM batched for CSC format + +#### rocALUTION 2.1.0 + +rocALUTION 2.1.0 for ROCm 5.3.0 + +##### Added + +- Benchmarking tool +- Ext+I Interpolation with sparsify strategies added for RS-AMG + +##### Improved + +- ParallelManager + +#### rocBLAS 2.45.0 + +rocBLAS 2.45.0 for ROCm 5.3.0 + +##### Added + +- install.sh option --upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP. +- install.sh option --relocatable or -r adds rpath and removes ldconf entry on rocBLAS build. +- install.sh option --lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization. +- Support for RHEL9 and CS9. +- Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value. + +##### Optimizations + +- trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm. +- herkx performance improvements for all sizes and data types using block-recursive algorithm. +- syrk/herk performance improvements by utilising optimised syrkx/herkx code. +- symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm. + +##### Changed + +- Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now. +- Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour. +- Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset. +- Add quick-return for internal trmm and gemm template functions. +- Moved function block sizes to a shared header file. +- Level 1, 2, and 3 functions use rocblas_stride datatype for offset. +- Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions. +- Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure. +- Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions. + +##### Fixed + +- Improved logic to #include <filesystem> vs <experimental/filesystem>. +- install.sh -s option to build rocblas as a static library. +- dot function now sets the device results asynchronously for N <= 0 + +##### Deprecated + +- is_complex helper is now deprecated. Use rocblas_is_complex instead. +- The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition. + +##### Removed + +- install.sh options --hip-clang , --no-hip-clang, --merge-files, --no-merge-files are removed. + +#### rocFFT 1.0.18 + +rocFFT 1.0.18 for ROCm 5.3.0 + +##### Changed + +- Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on + Windows) before falling back to HOME. + +##### Optimizations + +- Optimized 2D R2C/C2R to use 2-kernel plans where possible. +- Improved performance of the Bluestein algorithm. +- Optimized sbcc-168 and 100 by using half-lds. + +##### Fixed + +- Fixed occasional failures to parallelize runtime compilation of kernels. + Failures would be retried serially and ultimately succeed, but this would take extra time. +- Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. + An example is 98^3 R2C out-of-place. +- Fixed bugs in SBRC_ERC type. + +#### rocm-cmake 0.8.0 + +rocm-cmake 0.8.0 for ROCm 5.3.0 + +##### Fixed + +- Fixed error in prerm scripts created by `rocm_create_package` that could break uninstall for packages using the `PTH` option. + +##### Changed + +- `ROCM_USE_DEV_COMPONENT` set to on by default for all platforms. This means that Windows will now generate runtime and devel packages by default +- ROCMInstallTargets now defaults `CMAKE_INSTALL_LIBDIR` to `lib` if not otherwise specified. +- Changed default Debian compression type to xz and enabled multi-threaded package compression. +- `rocm_create_package` will no longer warn upon failure to determine version of program rpmbuild. + +#### rocPRIM 2.11.0 + +rocPRIM 2.11.0 for ROCm 5.3.0 + +##### Added + +- New functions `subtract_left` and `subtract_right` in `block_adjacent_difference` to apply functions + on pairs of adjacent items distributed between threads in a block. +- New device level `adjacent_difference` primitives. +- Added experimental tooling for automatic kernel configuration tuning for various architectures +- Benchmarks collect and output more detailed system information +- CMake functionality to improve build parallelism of the test suite that splits compilation units by +function or by parameters. +- Reverse iterator. + +#### rocRAND 2.10.15 + +rocRAND 2.10.15 for ROCm 5.3.0 + +##### Changed + +- Increased number of warmup iterations for rocrand_benchmark_generate from 5 to 15 to eliminate corner cases that would generate artificially high benchmark scores. + +#### rocSOLVER 3.19.0 + +rocSOLVER 3.19.0 for ROCm 5.3.0 + +##### Added + +- Partial eigensolver routines for symmetric/hermitian matrices: + - SYEVX (with batched and strided\_batched versions) + - HEEVX (with batched and strided\_batched versions) +- Generalized symmetric- and hermitian-definite partial eigensolvers: + - SYGVX (with batched and strided\_batched versions) + - HEGVX (with batched and strided\_batched versions) +- Eigensolver routines for symmetric/hermitian matrices using Jacobi algorithm: + - SYEVJ (with batched and strided\_batched versions) + - HEEVJ (with batched and strided\_batched versions) +- Generalized symmetric- and hermitian-definite eigensolvers using Jacobi algorithm: + - SYGVJ (with batched and strided\_batched versions) + - HEGVJ (with batched and strided\_batched versions) +- Added --profile_kernels option to rocsolver-bench, which will include kernel calls in the + profile log (if profile logging is enabled with --profile). + +##### Changed + +- Changed rocsolver-bench result labels `cpu_time` and `gpu_time` to + `cpu_time_us` and `gpu_time_us`, respectively. + +##### Removed + +- Removed dependency on cblas from the rocsolver test and benchmark clients. + +##### Fixed + +- Fixed incorrect SYGS2/HEGS2, SYGST/HEGST, SYGV/HEGV, and SYGVD/HEGVD results for batch counts + larger than 32. +- Fixed STEIN memory access fault when nev is 0. +- Fixed incorrect STEBZ results for close eigenvalues when range = index. +- Fixed git unsafe repository error when building with `./install.sh -cd` as a non-root user. + +#### rocThrust 2.16.0 + +rocThrust 2.16.0 for ROCm 5.3.0 + +##### Changed + +- rocThrust functionality dependent on device malloc works is functional as ROCm 5.2 reneabled device malloc. Device launched `thrust::sort` and `thrust::sort_by_key` are available for use. + +#### rocWMMA 0.8 + +rocWMMA 0.8 for ROCm 5.3.0 + +#### Tensile 4.34.0 + +Tensile 4.34.0 for ROCm 5.3.0 + +##### Added + +- Lazy loading of solution libraries and code object files +- Support for dictionary style logic files +- Support for decision tree based logic files using dictionary format +- DecisionTreeLibrary for solution selection +- DirectToLDS support for HGEMM +- DirectToVgpr support for SGEMM +- Grid based distance metric for solution selection +- Support for gfx11xx +- Support for DirectToVgprA/B + TLU=False +- ForkParameters Groups as a way of specifying solution parameters +- Support for a new Tensile yaml config format +- TensileClientConfig for generating Tensile client config files +- Options for TensileCreateLibrary to build client and create client config file + +##### Optimizations + +- Solution generation is now cached and is not repeated if solution parameters are unchanged + +##### Changed + +- Default MACInstruction to FMA + +##### Fixed + +- Accept StaggerUStride=0 as valid +- Reject invalid data types for UnrollLoopEfficiencyEnable +- Fix invalid code generation issues related to DirectToVgpr +- Return hipErrorNotFound if no modules are loaded +- Fix performance drop for NN ZGEMM with 96x64 macro tile +- Fix memory violation for general batched kernels when alpha/beta/K = 0 + +------------------- + +## ROCm 5.2.3 + +### Changes in This Release + +#### Ubuntu 18.04 End of Life Announcement + +Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04. +HIP and Other Runtimes + +#### HIP Runtime + +##### Fixes + +- A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call. + +- A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the `hiprtcGetLoweredName` call to fail for named expressions with whitespace in it. + +Example: + +The named expression `my_sqrt>` passed but `my_sqrt>` failed. +ROCm Libraries + +#### RCCL + +##### Added + +Compatibility with NCCL 2.12.10 + +- Packages for test and benchmark executables on all supported OSes using CPack + +- Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1 + + - Additional details provided if Binary File Descriptor library (BFD) is pre-installed. + +- Added experimental support for using multiple ranks per device + + - Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details. + + - To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example, + +```sh +export GPU_MAX_HW_QUEUES=16 +``` + +- Added support for reusing ports in NET/IB channels + + - Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1 + + - When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue + + - Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1 + +##### Removed + +- Removed experimental clique-based kernels + +#### Development Tools + +No notable changes in this release for development tools, including the compiler, profiler, and debugger +Deployment and Management Tools + +No notable changes in this release for deployment and management tools. +Older ROCm Releases + +For release information for older ROCm releases, refer to + +### Library Changes in ROCM 5.2.3 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.51.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.2.3) | +| hipCUB | [2.11.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.2.3) | +| hipFFT | [1.0.8](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.2.3) | +| hipSOLVER | [1.4.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.2.3) | +| hipSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.2.3) | +| rccl | 2.11.4 ⇒ [2.12.10](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.2.3) | +| rocALUTION | [2.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.2.3) | +| rocBLAS | [2.44.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.2.3) | +| rocFFT | [1.0.17](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.2.3) | +| rocPRIM | [2.10.14](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.2.3) | +| rocRAND | [2.10.14](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.2.3) | +| rocSOLVER | [3.18.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.2.3) | +| rocSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.2.3) | +| rocThrust | [2.15.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.2.3) | +| rocWMMA | [0.7](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.2.3) | +| Tensile | [4.33.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.2.3) | + +#### rccl 2.12.10 + +RCCL 2.12.10 for ROCm 5.2.3 + +##### Added + +- Compatibility with NCCL 2.12.10 +- Packages for test and benchmark executables on all supported OSes using CPack. +- Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1 + - Additional details provided if Binary File Descriptor library (BFD) is pre-installed +- Adding support for reusing ports in NET/IB channels + - Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1 + - When "Call to bind failed : Address already in use" error happens in large-scale AlltoAll + (e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options + to resolve the massive port usage issue + - Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1 + +##### Removed + +- Removed experimental clique-based kernels + +------------------- + +## ROCm 5.2.1 + + +### Library Changes in ROCM 5.2.1 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.51.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.2.1) | +| hipCUB | [2.11.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.2.1) | +| hipFFT | [1.0.8](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.2.1) | +| hipSOLVER | [1.4.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.2.1) | +| hipSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.2.1) | +| rccl | [2.11.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.2.1) | +| rocALUTION | [2.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.2.1) | +| rocBLAS | [2.44.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.2.1) | +| rocFFT | [1.0.17](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.2.1) | +| rocPRIM | [2.10.14](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.2.1) | +| rocRAND | [2.10.14](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.2.1) | +| rocSOLVER | [3.18.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.2.1) | +| rocSPARSE | [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.2.1) | +| rocThrust | [2.15.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.2.1) | +| rocWMMA | [0.7](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.2.1) | +| Tensile | [4.33.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.2.1) | + +------------------- + +## ROCm 5.2.0 + + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.2 release consists of the following HIP enhancements: + +##### HIP Installation Guide Updates + +The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms. + +For more details, refer to the HIP Installation Guide v5.2. + +##### Support for device-side malloc on HIP-Clang + +HIP-Clang now supports device-side malloc. This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed. + +The test codes at the following link show how to implement applications using malloc and free functions in device kernels: + + + +##### New HIP APIs in This Release + +The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs: + +###### Device management HIP APIs + +The new device management HIP APIs are as follows: + +- Gets a UUID for the device. This API returns a UUID for the device. + + ```cpp + hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device); + ``` + + > **Note** + > + > This new API corresponds to the following CUDA API: + > + > ```cpp + > CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev); + > ``` + +- Gets default memory pool of the specified device + + ```cpp + hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device); + ``` + +- Sets the current memory pool of a device + + ```cpp + hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool); + ``` + +- Gets the current memory pool for the specified device + + ```cpp + hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device); + ``` + +###### New HIP Runtime APIs in Memory Management + +The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows: + +- Allocates memory with stream ordered semantics + + ```cpp + hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream); + ``` + +- Frees memory with stream ordered semantics + + ```cpp + hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream); + ``` + +- Releases freed memory back to the OS + + ```cpp + hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold); + ``` + +- Sets attributes of a memory pool + + ```cpp + hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); + ``` + +- Gets attributes of a memory pool + + ```cpp + hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); + ``` + +- Controls visibility of the specified pool between devices + + ```cpp + hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count); + ``` + +- Returns the accessibility of a pool from a device + + ```cpp + hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location); + ``` + +- Creates a memory pool + + ```cpp + hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props); + ``` + +- Destroys the specified memory pool + + ```cpp + hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool); + ``` + +- Allocates memory from a specified pool with stream ordered semantics + + ```cpp + hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream); + ``` + +- Exports a memory pool to the requested handle type + + ```cpp + hipError_t hipMemPoolExportToShareableHandle( + void* shared_handle, + hipMemPool_t mem_pool, + hipMemAllocationHandleType handle_type, + unsigned int flags); + ``` + +- Imports a memory pool from a shared handle + + ```cpp + hipError_t hipMemPoolImportFromShareableHandle( + hipMemPool_t* mem_pool, + void* shared_handle, + hipMemAllocationHandleType handle_type, + unsigned int flags); + ``` + +- Exports data to share a memory pool allocation between processes + + ```cpp + hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr); + Import a memory pool allocation from another process.t + hipError_t hipMemPoolImportPointer( + void** dev_ptr, + hipMemPool_t mem_pool, + hipMemPoolPtrExportData* export_data); + ``` + +###### HIP Graph Management APIs + +The new HIP Graph Management APIs are as follows: + +- Enqueues a host function call in a stream + + ```cpp + hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); + ``` + +- Swaps the stream capture mode of a thread + + ```cpp + hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode); + ``` + +- Sets a node attribute + + ```cpp + hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value); + ``` + +- Gets a node attribute + + ```cpp + hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value); + ``` + +###### Support for Virtual Memory Management APIs + +The new APIs for virtual memory management are as follows: + +- Frees an address range reservation made via hipMemAddressReserve + + ```cpp + hipError_t hipMemAddressFree(void* devPtr, size_t size); + ``` + +- Reserves an address range + + ```cpp + hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags); + ``` + +- Creates a memory allocation described by the properties and size + + ```cpp + hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags); + ``` + +- Exports an allocation to a requested shareable handle type + + ```cpp + hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags); + ``` + +- Gets the access flags set for the given location and ptr + + ```cpp + hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr); + ``` + +- Calculates either the minimal or recommended granularity + + ```cpp + hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option); + ``` + +- Retrieves the property structure of the given handle + + ```cpp + hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle); + ``` + +- Imports an allocation from a requested shareable handle type + + ```cpp + hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType); + ``` + +- Maps an allocation handle to a reserved virtual address range + + ```cpp + hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags); + ``` + +- Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays + + ```cpp + hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream); + ``` + +- Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate + + ```cpp + hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle); + ``` + +- Returns the allocation handle of the backing memory allocation given the address + + ```cpp + hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr); + ``` + +- Sets the access flags for each location specified in desc for the given virtual address range + + ```cpp + hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count); + ``` + +- Unmaps memory allocation of a given address range + + ```cpp + hipError_t hipMemUnmap(void* ptr, size_t size); + ``` + +For more information, refer to the HIP API documentation at +{doc}`hip:.doxygen/docBin/html/modules`. + +##### Planned HIP Changes in Future Releases + +Changes to `hipDeviceProp_t`, `HIPMEMCPY_3D`, and `hipArray` structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility. + +Refer to the Release Notes document in subsequent releases for more information. +ROCm Math and Communication Libraries + +In this release, ROCm Math and Communication Libraries consist of the following enhancements and fixes: +New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration + +This release introduces a new ROCm C++ library for accelerating mixed precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels. + +rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed. + +For more information, refer to +[Communication Libraries](./reference/libraries/gpu-libraries/communication.md). + +#### OpenMP Enhancements in This Release + +##### OMPT Target Support + +The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification. + +Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below: + +```sh +cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing +make run +``` + +The file `veccopy-ompt-target-tracing.c` simulates how a tool would initiate device activity tracing. The file `callbacks.h` shows the callbacks that may be registered and implemented by the tool. + +### Deprecations and Warnings + +#### Linux Filesystem Hierarchy Standard for ROCm + +ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility. + +##### New Filesystem Hierarchy + +The following is the new filesystem hierarchy: + +```text +/opt/rocm- + | --bin + | --All externally exposed Binaries + | --libexec + | -- + | -- Component specific private non-ISA executables (architecture independent) + | --include + | -- + | --
+ | --lib + | --lib.so -> lib.so.major -> lib.so.major.minor.patch + (public libraries linked with application) + | -- (component specific private library, executable data) + | -- + | --components + | --.config.cmake + | --share + | --html//*.html + | --info//*.[pdf, md, txt] + | --man + | --doc + | -- + | -- + | -- + | -- (arch independent non-executable) + | --samples + +``` + +> **Note** +> +> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release. + +For more information, refer to . + +##### Backward Compatibility with Older Filesystems + +ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility. + +> **Note** +> +> ROCm will continue supporting backward compatibility until the next major release. + +##### Wrapper header files + +Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: + +```cpp +// Code snippet from hip_runtime.h +#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. +#include "hip/hip_runtime.h" +``` + +The wrapper header files’ backward compatibility deprecation is as follows: + +- `#pragma` message announcing deprecation -- ROCm v5.2 release +- `#pragma` message changed to `#warning` -- Future release +- `#warning` changed to `#error` -- Future release +- Backward compatibility wrappers removed -- Future release + +##### Library files + +Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx//lib`) has a soft link to the library at the new location. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/ +total 4 +drwxr-xr-x 4 root root 4096 May 12 10:45 cmake +lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so +``` + +##### CMake Config files + +All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx//lib/cmake`) consist of a soft link to the new CMake config. + +Example: + +```bash +$ ls -l /opt/rocm/hip/lib/cmake/hip/ +total 0 +lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake +``` + +#### Planned deprecation of hip-rocclr and hip-base packages + +In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously. + +Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported. +Deprecation of Integrated HIP Directed Tests + +The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release. + +### Fixed Defects + +| Fixed Defect | Fix | +|------------------------------------------------------------------------------|----------| +| ROCmInfo does not list gpus | Code fix | +| Hang observed while restoring cooperative group samples | Code fix | +| ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | Code fix | + +### Known Issues + +This section consists of known issues in this release. + +#### Compiler Error on gfx1030 When Compiling at -O0 + +##### Issue + +A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030. + +##### Workaround + +The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction. + +#### System Freeze Observed During CUDA Memtest Checkpoint + +##### Issue + +Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications. + +##### Workaround + +To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available. + +#### HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error + +##### Issue + +The compiler may incorrectly compile a program that uses the `__shfl_sync(mask, value, srcLane)` function when the "value" parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for `__shfl_sync` should allow for undefined values. + +##### Workaround + +The workaround is to initialize the parameters to `__shfl_sync`. + +> **Note** +> +> When the `-Wall` compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path. + +Example: + +```cpp +double res = 0.0; // Initialize the input to __shfl_sync. +if (lane == 0) { + res = +} +res = __shfl_sync(mask, res, 0); +``` + +#### Kernel produces incorrect result + +##### Issue + +In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default. + +In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef. + +So, the function argument that is potentially undef (because it is not intialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted noundef attribute). This leads to ambiguous kernel execution. + +##### Workaround + +* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to for more information. + +* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to for more information. + +* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute. + +* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs. + +#### Issue with Applications Triggering Oversubscription + +There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems. + +This issue is under investigation and will be fixed in a future release. + +### Library Changes in ROCM 5.2.0 + +| Library | Version | +|---------|---------| +| hipBLAS | 0.50.0 ⇒ [0.51.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.2.0) | +| hipCUB | 2.11.0 ⇒ [2.11.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.2.0) | +| hipFFT | 1.0.7 ⇒ [1.0.8](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.2.0) | +| hipSOLVER | 1.3.0 ⇒ [1.4.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.2.0) | +| hipSPARSE | 2.1.0 ⇒ [2.2.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.2.0) | +| rccl | [2.11.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.2.0) | +| rocALUTION | 2.0.2 ⇒ [2.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.2.0) | +| rocBLAS | 2.43.0 ⇒ [2.44.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.2.0) | +| rocFFT | 1.0.16 ⇒ [1.0.17](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.2.0) | +| rocPRIM | 2.10.13 ⇒ [2.10.14](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.2.0) | +| rocRAND | 2.10.13 ⇒ [2.10.14](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.2.0) | +| rocSOLVER | 3.17.0 ⇒ [3.18.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.2.0) | +| rocSPARSE | 2.1.0 ⇒ [2.2.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.2.0) | +| rocThrust | 2.14.0 ⇒ [2.15.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.2.0) | +| rocWMMA | ⇒ [0.7](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.2.0) | +| Tensile | 4.32.0 ⇒ [4.33.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.2.0) | + +#### hipBLAS 0.51.0 + +hipBLAS 0.51.0 for ROCm 5.2.0 + +##### Added + +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions +* Added user-specified initialization option to hipblas-bench + +##### Fixed + +* Fixed version gathering in performance measuring script + +#### hipCUB 2.11.1 + +hipCUB 2.11.1 for ROCm 5.2.0 + +##### Added + +* Packages for tests and benchmark executable on all supported OSes using CPack. + +#### hipFFT 1.0.8 + +hipFFT 1.0.8 for ROCm 5.2.0 + +##### Added + +* Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. +* Packages for test and benchmark executables on all supported OSes using CPack. + +#### hipSOLVER 1.4.0 + +hipSOLVER 1.4.0 for ROCm 5.2.0 + +##### Added + +* Package generation for test and benchmark executables on all supported OSes using CPack. +* File/Folder Reorg + * Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. + +##### Fixed + +* Fixed the ReadTheDocs documentation generation. + +#### hipSPARSE 2.2.0 + +hipSPARSE 2.2.0 for ROCm 5.2.0 + +##### Added + +* Packages for test and benchmark executables on all supported OSes using CPack. + +#### rocALUTION 2.0.3 + +rocALUTION 2.0.3 for ROCm 5.2.0 + +##### Added + +* Packages for test and benchmark executables on all supported OSes using CPack. + +#### rocBLAS 2.44.0 + +rocBLAS 2.44.0 for ROCm 5.2.0 + +##### Added + +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions. +* Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions. +* Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes. +* Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests. + +##### Optimizations + +* Improved performance of non-batched and batched her2 for all sizes and data types. +* Improved performance of non-batched and batched amin for all data types using shuffle reductions. +* Improved performance of non-batched and batched amax for all data types using shuffle reductions. +* Improved performance of trsv for all sizes and data types. + +##### Changed + +* Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16. +* Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions. +* For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset. +* For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset. +* AMD copyright year for all rocBLAS files. +* For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. + +##### Fixed + +* For function her2 avoid overflow in offset calculation. +* For trsm when alpha == 0 and on host, allow A to be nullptr. +* Fixed memory access issue in trsv. +* Fixed git pre-commit script to update only AMD copyright year. +* Fixed dgmm, geam test functions to set correct stride values. +* For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose. +* Fixed compilation error for clients-only build. + +##### Removed + +* Remove Navi12 (gfx1011) from fat binary. + +#### rocFFT 1.0.17 + +rocFFT 1.0.17 for ROCm 5.2.0 + +##### Added + +* Packages for test and benchmark executables on all supported OSes using CPack. +* Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. + +##### Changed + +* Improved reuse of twiddle memory between plans. +* Set a default load/store callback when only one callback + type is set via the API for improved performance. + +##### Optimizations + +* Introduced a new access pattern of lds (non-linear) and applied it on + sbcc kernels len 64 to get performance improvement. + +##### Fixed + +* Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers. + +#### rocPRIM 2.10.14 + +rocPRIM 2.10.14 for ROCm 5.2.0 + +##### Added + +* Packages for tests and benchmark executable on all supported OSes using CPack. +* Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers. + +#### rocRAND 2.10.14 + +rocRAND 2.10.14 for ROCm 5.2.0 + +##### Added + +* Backward compatibility for deprecated `#include <rocrand.h>` using wrapper header files. +* Packages for test and benchmark executables on all supported OSes using CPack. + +#### rocSOLVER 3.18.0 + +rocSOLVER 3.18.0 for ROCm 5.2.0 + +##### Added + +* Partial eigenvalue decomposition routines: + * STEBZ + * STEIN +* Package generation for test and benchmark executables on all supported OSes using CPack. +* Added tests for multi-level logging +* Added tests for rocsolver-bench client +* File/Folder Reorg + * Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. + +##### Fixed + +* Fixed compatibility with libfmt 8.1 + +#### rocSPARSE 2.2.0 + +rocSPARSE 2.2.0 for ROCm 5.2.0 + +##### Added + +* batched SpMM for CSR, COO and Blocked ELL formats. +* Packages for test and benchmark executables on all supported OSes using CPack. +* Clients file importers and exporters. + +##### Improved + +* Clients code size reduction. +* Clients error handling. +* Clients benchmarking for performance tracking. + +##### Changed + +* Test adjustments due to roundoff errors. +* Fixing API calls compatiblity with rocPRIM. + +##### Known Issues + +* none + +#### rocThrust 2.15.0 + +rocThrust 2.15.0 for ROCm 5.2.0 + +##### Added + +* Packages for tests and benchmark executable on all supported OSes using CPack. + +#### rocWMMA 0.7 + +rocWMMA 0.7 for ROCm 5.2.0 + +##### Added + +* Added unit tests for DLRM kernels +* Added GEMM sample +* Added DLRM sample +* Added SGEMV sample +* Added unit tests for cooperative wmma load and stores +* Added unit tests for IOBarrier.h +* Added wmma load/ store tests for different matrix types (A, B and Accumulator) +* Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest +* Added block sizes 4, 8 to test MmaSynMultiLdsTest +* Added support for wmma load / store layouts with block dimension greater than 64 +* Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types +* Added CI testing for rocWMMA + +##### Changed + +* Renamed wmma to rocwmma in cmake, header files and documentation +* Renamed library files +* Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset) +* Opaque load/store continue to use incrementatl offsets as they fill the entire block +* Cooperative load/store use cumulative offsets as they fill only small portions for the entire block +* Increased Max split counts to 64 for cooperative load/store +* Moved all the wmma definitions, API headers to rocwmma namespace +* Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator) + +#### Tensile 4.33.0 + +Tensile 4.33.0 for ROCm 5.2.0 + +##### Added + +* TensileUpdateLibrary for updating old library logic files +* Support for TensileRetuneLibrary to use sizes from separate file +* ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support +* Tests for denorm correctness +* Option to write different architectures to different TensileLibrary files + +##### Optimizations + +* Optimize MessagePackLoadLibraryFile by switching to fread +* DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr + +##### Changed + +* Alpha/beta datatype remains as F32 for HPA HGEMM +* Force assembly kernels to not flush denorms +* Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount + +##### Fixed + +* Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80 + +------------------- + +## ROCm 5.1.3 + + +### Library Changes in ROCM 5.1.3 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.50.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.1.3) | +| hipCUB | [2.11.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.1.3) | +| hipFFT | [1.0.7](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.1.3) | +| hipSOLVER | [1.3.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.1.3) | +| hipSPARSE | [2.1.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.1.3) | +| rccl | [2.11.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.1.3) | +| rocALUTION | [2.0.2](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.1.3) | +| rocBLAS | [2.43.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.1.3) | +| rocFFT | [1.0.16](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.1.3) | +| rocPRIM | [2.10.13](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.1.3) | +| rocRAND | [2.10.13](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.1.3) | +| rocSOLVER | [3.17.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.1.3) | +| rocSPARSE | [2.1.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.1.3) | +| rocThrust | [2.14.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.1.3) | +| Tensile | [4.32.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.1.3) | + +------------------- + +## ROCm 5.1.1 + + +### Library Changes in ROCM 5.1.1 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.50.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.1.1) | +| hipCUB | [2.11.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.1.1) | +| hipFFT | [1.0.7](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.1.1) | +| hipSOLVER | [1.3.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.1.1) | +| hipSPARSE | [2.1.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.1.1) | +| rccl | [2.11.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.1.1) | +| rocALUTION | [2.0.2](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.1.1) | +| rocBLAS | [2.43.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.1.1) | +| rocFFT | [1.0.16](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.1.1) | +| rocPRIM | [2.10.13](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.1.1) | +| rocRAND | [2.10.13](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.1.1) | +| rocSOLVER | [3.17.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.1.1) | +| rocSPARSE | [2.1.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.1.1) | +| rocThrust | [2.14.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.1.1) | +| Tensile | [4.32.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.1.1) | + +------------------- + +## ROCm 5.1.0 + + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.1 release consists of the following HIP enhancements. + +##### HIP Installation Guide Updates + +The HIP Installation Guide is updated to include installation and building HIP from source on the AMD and NVIDIA platforms. + +Refer to the HIP Installation Guide v5.1 for more details. + +##### Support for HIP Graph + +ROCm v5.1 extends support for HIP Graph. + +##### Planned Changes for HIP in Future Releases + +###### Separation of hiprtc (libhiprtc) library from hip runtime (amdhip64) + +On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release. + +###### hipDeviceProp_t Structure Enhancements + +Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases. + +#### ROCDebugger Enhancements + +##### Multi-language Source Level Debugger + +The compiler now generates a source-level variable and function argument debug information. + +The accuracy is guaranteed if the compiler options `-g -O0` are used and apply only to HIP. + +This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments. + +> **Note** +> +> The newly-suggested compiler -g option must be used instead of the previously-suggested `-ggdb` option. Although the effect of these two options is currently equivalent, this is not guaranteed for the future and might get changed by the upstream LLVM community. + +##### Machine Interface Lanes Support + +ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made: + +* Added a new -lane-info command, listing the current thread's lanes. + +* The -thread-select command now supports a lane switch to switch to a specific lane of a thread: + + ```sh + -thread-select -l LANE THREAD + ``` + +* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected. + +* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop. + +* MI commands now accept a global --lane option, similar to the global --thread and --frame options. + +* MI varobjs are now lane-aware. + +For more information, refer to the ROC Debugger User Guide at +{doc}`ROCgdb `. + +##### Enhanced - clone-inferior Command + +The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the 'set environment' or 'unset environment' commands are also copied to the new inferior. + +#### MIOpen Support for RDNA GPUs + +This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below: + +* MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498) +* Fixed a correctness issue with ImplicitGemm algorithm +* Updated the performance data for new kernel versions +* Improved MIOpen build time by splitting large kernel header files +* Fixed an issue in reduction kernels for padded tensors +* Various other bug fixes and performance improvements + +For more information, see {doc}`Documentation `. + +#### Checkpoint Restore Support With CRIU + +The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications. + +CRIU is a userspace tool to Checkpoint and Restore an application. + +CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes: + +* Single and Multi GPU systems (Gfx9) +* Checkpoint / Restore on a different system +* Checkpoint / Restore inside a docker container +* PyTorch +* Tensorflow +* Using CRIU Image Streamer + +For more information, refer to + +> **Note** +> +> The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch. + +> **Note** +> +> This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release. + +For more information, refer to the following websites: + +* +* + +### Fixed Defects + +The following defects are fixed in this release. + +#### Driver Fails To Load after Installation + +The issue with the driver failing to load after ROCm installation is now fixed. + +The driver installs successfully, and the server reboots with working rocminfo and clinfo. + +#### ROCDebugger Fixed Defects + +##### Breakpoints in GPU kernel code Before Kernel Is Loaded + +Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code. + +Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel. + +##### Registers Invalidated After Write + +Previously, the stale just-written value was presented as a current value. + +ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits. + +ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change. + +##### Scheduler-locking and GPU Wavefronts + +When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the "set scheduler-locking" command. + +##### ROCDebugger Fails Before Completion of Kernel Execution + +It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information. + +This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information. + +### Known Issues + +#### Random Memory Access Fault Errors Observed While Running Math Libraries Unit Tests + +**Issue:** Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2. + +Note, the faults only occur in the SRIOV environment. + +**Workaround:** Use SDMA to update the page table. The Guest set up steps are as follows: + +```sh +sudo modprobe amdgpu vm_update_mode=0 +``` + +To verify, use + +**Guest:** + +```sh +cat /sys/module/amdgpu/parameters/vm_update_mode 0 +``` + +Where expectation is 0. + +#### CU Masking Causes Application to Freeze + +Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products. + +This issue is under active investigation at this time. + +#### Failed Checkpoint in Docker Containers + +A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID. + +This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers. + +As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-46-generic. + +#### Issue with Restoring Workloads Using Cooperative Groups Feature + +Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly. +This issue is under investigation and will be fixed in a future release. + +#### Radeon Pro V620 and W6800 Workstation GPUs + +##### No Support for ROCDebugger on SRIOV + +ROCDebugger is not supported in the SRIOV environment on any GPU. + +This is a known issue and will be fixed in a future release. + +#### Random Error Messages in ROCm SMI for SR-IOV + +Random error messages are generated by unsupported functions or commands. + +This is a known issue and will be fixed in a future release. + +### Library Changes in ROCM 5.1.0 + +| Library | Version | +|---------|---------| +| hipBLAS | 0.49.0 ⇒ [0.50.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.1.0) | +| hipCUB | 2.10.13 ⇒ [2.11.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.1.0) | +| hipFFT | 1.0.4 ⇒ [1.0.7](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.1.0) | +| hipSOLVER | 1.2.0 ⇒ [1.3.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.1.0) | +| hipSPARSE | 2.0.0 ⇒ [2.1.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.1.0) | +| rccl | 2.10.3 ⇒ [2.11.4](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.1.0) | +| rocALUTION | 2.0.1 ⇒ [2.0.2](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.1.0) | +| rocBLAS | 2.42.0 ⇒ [2.43.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.1.0) | +| rocFFT | 1.0.13 ⇒ [1.0.16](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.1.0) | +| rocPRIM | 2.10.12 ⇒ [2.10.13](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.1.0) | +| rocRAND | 2.10.12 ⇒ [2.10.13](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.1.0) | +| rocSOLVER | 3.16.0 ⇒ [3.17.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.1.0) | +| rocSPARSE | 2.0.0 ⇒ [2.1.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.1.0) | +| rocThrust | 2.13.0 ⇒ [2.14.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.1.0) | +| Tensile | 4.31.0 ⇒ [4.32.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.1.0) | + +#### hipBLAS 0.50.0 + +hipBLAS 0.50.0 for ROCm 5.1.0 + +##### Added + +* Added library version and device information to hipblas-test output +* Added --rocsolver-path command line option to choose path to pre-built rocSOLVER, as + absolute or relative path +* Added --cmake_install command line option to update cmake to minimum version if required +* Added cmake-arg parameter to pass in cmake arguments while building +* Added infrastructure to support readthedocs hipBLAS documentation. + +##### Fixed + +* Added hipblasVersionMinor define. hipblaseVersionMinor remains defined + for backwards compatibility. +* Doxygen warnings in hipblas.h header file. + +##### Changed + +* rocblas-path command line option can be specified as either absolute or relative path +* Help message improvements in install.sh and rmake.py +* Updated googletest dependency from 1.10.0 to 1.11.0 + +#### hipCUB 2.11.0 + +hipCUB 2.11.0 for ROCm 5.1.0 + +##### Added + +* Device segmented sort +* Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB +* Device three way partition + +##### Changed + +* Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type. + * This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output). + * And low-res input with high-res output (e.g. float input, double output) + * Block merge sort no longer supports non power of two blocksizes + +#### hipFFT 1.0.7 + +hipFFT 1.0.7 for ROCm 5.1.0 + +##### Changed + +* Use fft_params struct for accuracy and benchmark clients. + +#### hipSOLVER 1.3.0 + +hipSOLVER 1.3.0 for ROCm 5.1.0 + +##### Added + +* Added functions + * gels + * hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize + * hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels +* Added library version and device information to hipsolver-test output. +* Added compatibility API with hipsolverDn prefix. +* Added compatibility-only functions + * gesvdj + * hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize + * hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj + * gesvdjBatched + * hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize + * hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched + * syevj + * hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize + * hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj + * syevjBatched + * hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize + * hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched + * sygvj + * hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize + * hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj + +##### Changed + +* The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X. +* The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd. + +##### Fixed + +* bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null. +* Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER. + +#### hipSPARSE 2.1.0 + +hipSPARSE 2.1.0 for ROCm 5.1.0 + +##### Added + +* Added gtsv_interleaved_batch and gpsv_interleaved_batch routines +* Add SpGEMM_reuse + +##### Changed + +* Changed BUILD_CUDA with USE_CUDA in install script and cmake files +* Update googletest to 11.1 + +##### Improved + +* Fixed a bug in SpMM Alg versioning + +##### Known Issues + +* none + +#### rccl 2.11.4 + +RCCL 2.11.4 for ROCm 5.1.0 + +##### Added + +* Compatibility with NCCL 2.11.4 + +##### Known Issues + +* Managed memory is not currently supported for clique-based kernels + +#### rocALUTION 2.0.2 + +rocALUTION 2.0.2 for ROCm 5.1.0 + +##### Added + +* Added out-of-place matrix transpose functionality +* Added LocalVector<bool> + +#### rocBLAS 2.43.0 + +rocBLAS 2.43.0 for ROCm 5.1.0 + +##### Added + +* Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs) +* Option to install script to build clients without using any Fortran (--clients_no_fortran) +* rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time. +* Added tests for output of reduction functions when given bad input +* Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench + +##### Optimizations + +* Improved performance of trsm with side == left and n == 1 +* Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32 + +##### Changed + +* For syrkx and trmm internal API use rocblas_stride datatype for offset +* For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match +* Test client dependencies updated to GTest 1.11 +* non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives. +* Help menu messages in install.sh +* For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. +* Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench + +##### Fixed + +* For function trmv (non-transposed cases) avoid overflow in offset calculation +* Fixed cppcheck errors/warnings +* Fixed doxygen warnings + +#### rocFFT 1.0.16 + +rocFFT 1.0.16 for ROCm 5.1.0 + +##### Changed + +* Supported unaligned tile dimension for SBRC_2D kernels. +* Improved (more RAII) test and benchmark infrastructure. +* Enabled runtime compilation of length-2304 FFT kernel during plan creation. + +##### Optimizations + +* Optimized more large 1D cases by using L1D_CC plan. +* Optimized 3D 200^3 C2R case. +* Optimized 1D 2^30 double precision on MI200. + +##### Fixed + +* Fixed correctness of some R2C transforms with unusual strides. + +##### Removed + +* The hipFFT API (header) has been removed from after a long deprecation period. Please use the [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package/repository to obtain the hipFFT API. + +#### rocPRIM 2.10.13 + +rocPRIM 2.10.13 for ROCm 5.1.0 + +##### Fixed + +* Fixed radix sort int64_t bug introduced in \[2.10.11] + +##### Added + +* Future value +* Added device partition_three_way to partition input to three output iterators based on two predicates + +##### Changed + +* The reduce/scan algorithm precision issues in the tests has been resolved for half types. + +##### Known Issues + +* device_segmented_radix_sort unit test failing for HIP on Windows + +#### rocRAND 2.10.13 + +rocRAND 2.10.13 for ROCm 5.1.0 + +##### Added + +* Generating a random sequence different sizes now produces the same sequence without gaps + indepent of how many values are generated per call. + * Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64 + * This only holds true if the size in each call is a divisor of the distributions + `output_width` due to performance + * Similarly the output pointer has to be aligned to `output_width * sizeof(output_type)` + +##### Changed + +* [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND.git) split into a separate package +* Header file installation location changed to match other libraries. + * Using the `rocrand.h` header file should now use `#include <rocrand/rocrand.h>`, rather than `#include <rocrand/rocrand.h>` +* rocRAND still includes hipRAND using a submodule + * The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it. + +##### Fixed + +* Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now + correctly generates the same sequence starting from the offset. + * Only uniform int and float will work as these can be generated with a single call to the generator + +##### Known Issues + +* kernel_xorwow unit test is failing for certain GPU architectures. + +#### rocSOLVER 3.17.0 + +rocSOLVER 3.17.0 for ROCm 5.1.0 + +##### Optimized + +* Optimized non-pivoting and batch cases of the LU factorization + +##### Fixed + +* Fixed missing synchronization in SYTRF with `rocblas_fill_lower` that could potentially + result in incorrect pivot values. +* Fixed multi-level logging output to file with the `ROCSOLVER_LOG_PATH`, + `ROCSOLVER_LOG_TRACE_PATH`, `ROCSOLVER_LOG_BENCH_PATH` and `ROCSOLVER_LOG_PROFILE_PATH` + environment variables. +* Fixed performance regression in the batched LU factorization of tiny matrices + +#### rocSPARSE 2.1.0 + +rocSPARSE 2.1.0 for ROCm 5.1.0 + +##### Added + +* gtsv_interleaved_batch +* gpsv_interleaved_batch +* SpGEMM_reuse +* Allow copying of mat info struct + +##### Improved + +* Optimization for SDDMM +* Allow unsorted matrices in csrgemm multipass algorithm + +##### Known Issues + +* none + +#### rocThrust 2.14.0 + +rocThrust 2.14.0 for ROCm 5.1.0 + +##### Added + +* Updated to match upstream Thrust 1.15.0 + +##### Known Issues + +* async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows. + +#### Tensile 4.32.0 + +Tensile 4.32.0 for ROCm 5.1.0 + +##### Added + +* Better control of parallelism to control memory usage +* Support for multiprocessing on Windows for TensileCreateLibrary +* New JSD metric and metric selection functionality +* Initial changes to support two-tier solution selection + +##### Optimized + +* Optimized runtime of TensileCreateLibraries by reducing max RAM usage +* StoreCInUnroll additional optimizations plus adaptive K support +* DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support + +##### Changed + +* Update Googletest to 1.11.0 + +##### Removed + +* Remove no longer supported benchmarking steps + +------------------- + +## ROCm 5.0.2 + +### Fixed Defects + +The following defects are fixed in the ROCm v5.0.2 release. + +#### Issue with hostcall Facility in HIP Runtime + +In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call. + +The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release. + +The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails. + +Note: +This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. +Compatibility Matrix Updates to the [Deep-learning guide](./how-to/deep-learning-rocm.md). + +The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for ROCm v5.0.2. + +### Library Changes in ROCM 5.0.2 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.49.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.0.2) | +| hipCUB | [2.10.13](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.0.2) | +| hipFFT | [1.0.4](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.0.2) | +| hipSOLVER | [1.2.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.0.2) | +| hipSPARSE | [2.0.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.0.2) | +| rccl | [2.10.3](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.0.2) | +| rocALUTION | [2.0.1](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.0.2) | +| rocBLAS | [2.42.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.0.2) | +| rocFFT | [1.0.13](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.0.2) | +| rocPRIM | [2.10.12](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.0.2) | +| rocRAND | [2.10.12](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.0.2) | +| rocSOLVER | [3.16.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.0.2) | +| rocSPARSE | [2.0.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.0.2) | +| rocThrust | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.0.2) | +| Tensile | [4.31.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.0.2) | + +------------------- + +## ROCm 5.0.1 + +### Deprecations and Warnings + +#### Refactor of HIPCC/HIPCONFIG + +In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately. + +In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT. + +In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts. + +Subsequently, Perl scripts will no longer be available in ROCm in a future release. + +### Library Changes in ROCM 5.0.1 + +| Library | Version | +|---------|---------| +| hipBLAS | [0.49.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.0.1) | +| hipCUB | [2.10.13](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.0.1) | +| hipFFT | [1.0.4](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.0.1) | +| hipSOLVER | [1.2.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.0.1) | +| hipSPARSE | [2.0.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.0.1) | +| rccl | [2.10.3](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.0.1) | +| rocALUTION | [2.0.1](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.0.1) | +| rocBLAS | [2.42.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.0.1) | +| rocFFT | [1.0.13](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.0.1) | +| rocPRIM | [2.10.12](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.0.1) | +| rocRAND | [2.10.12](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.0.1) | +| rocSOLVER | [3.16.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.0.1) | +| rocSPARSE | [2.0.0](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.0.1) | +| rocThrust | [2.13.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.0.1) | +| Tensile | [4.31.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.0.1) | + +------------------- + +## ROCm 5.0.0 + + +### What's New in This Release + +#### HIP Enhancements + +The ROCm v5.0 release consists of the following HIP enhancements. + +##### HIP Installation Guide Updates + +The HIP Installation Guide is updated to include building HIP from source on the NVIDIA platform. + +Refer to the HIP Installation Guide v5.0 for more details. + +##### Managed Memory Allocation + +Managed memory, including the `__managed__` keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed. + +> **Note** +> +> In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example, +> +> ```cpp +> int managed_memory = 0; +> HIPCHECK(hipDeviceGetAttribute(&managed_memory, +> hipDeviceAttributeManagedMemory,p_gpuDevice)); +> if (!managed_memory ) { +> printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice); +> } +> else { +> HIPCHECK(hipSetDevice(p_gpuDevice)); +> HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T))); +> . . . +> } +> ``` + +> **Note** +> +> The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have + +Refer to the HIP API documentation for more details on managed memory APIs. + +For the application, see + + + +#### New Environment Variable + +The following new environment variable is added in this release: + +| Environment Variable | Value | Description | +|----------------------|-----------------------|-------------| +| HSA_COOP_CU_COUNT | 0 or 1 (default is 0) | Some processors support more CUs than can reliably be used in a cooperative dispatch. Setting the environment variable HSA_COOP_CU_COUNT to 1 will cause ROCr to return the correct CU count for cooperative groups through the HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT attribute of hsa_agent_get_info(). Setting HSA_COOP_CU_COUNT to other values, or leaving it unset, will cause ROCr to return the same CU count for the attributes HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT and HSA_AMD_AGENT_INFO_COMPUTE_UNIT_COUNT. Future ROCm releases will make HSA_COOP_CU_COUNT=1 the default. | + +### Breaking Changes + +#### Runtime Breaking Change + +Re-ordering of the enumerated type in hip_runtime_api.h to better match NV. See below for the difference in enumerated types. + +ROCm software will be affected if any of the defined enums listed below are used in the code. Applications built with ROCm v5.0 enumerated types will work with a ROCm 4.5.2 driver. However, an undefined behavior error will occur with a ROCm v4.5.2 application that uses these enumerated types with a ROCm 5.0 runtime. + +```diff +typedef enum hipDeviceAttribute_t { +- hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block. +- hipDeviceAttributeMaxBlockDimX, ///< Maximum x-dimension of a block. +- hipDeviceAttributeMaxBlockDimY, ///< Maximum y-dimension of a block. +- hipDeviceAttributeMaxBlockDimZ, ///< Maximum z-dimension of a block. +- hipDeviceAttributeMaxGridDimX, ///< Maximum x-dimension of a grid. +- hipDeviceAttributeMaxGridDimY, ///< Maximum y-dimension of a grid. +- hipDeviceAttributeMaxGridDimZ, ///< Maximum z-dimension of a grid. +- hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in +- ///< bytes. +- hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes. +- hipDeviceAttributeWarpSize, ///< Warp size in threads. +- hipDeviceAttributeMaxRegistersPerBlock, ///< Maximum number of 32-bit registers available to a +- ///< thread block. This number is shared by all thread +- ///< blocks simultaneously resident on a +- ///< multiprocessor. +- hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz. +- hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz. +- hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits. +- hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device. +- hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in. +- hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 +- ///< cache. +- hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per +- ///< multiprocessor. +- hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number. +- hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number. +- hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels +- ///< concurrently. +- hipDeviceAttributePciBusId, ///< PCI Bus ID. +- hipDeviceAttributePciDeviceId, ///< PCI Device ID. +- hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, ///< Maximum Shared Memory Per +- ///< Multiprocessor. +- hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices. +- hipDeviceAttributeIntegrated, ///< iGPU +- hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch +- hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices +- hipDeviceAttributeMaxTexture1DWidth, ///< Maximum number of elements in 1D images +- hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D images in image elements +- hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension height of 2D images in image elements +- hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D images in image elements +- hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimensions height of 3D images in image elements +- hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimensions depth of 3D images in image elements ++ hipDeviceAttributeCudaCompatibleBegin = 0, + +- hipDeviceAttributeHdpMemFlushCntl, ///< Address of the HDP_MEM_COHERENCY_FLUSH_CNTL register +- hipDeviceAttributeHdpRegFlushCntl, ///< Address of the HDP_REG_COHERENCY_FLUSH_CNTL register ++ hipDeviceAttributeEccEnabled = hipDeviceAttributeCudaCompatibleBegin, ///< Whether ECC support is enabled. ++ hipDeviceAttributeAccessPolicyMaxWindowSize, ///< Cuda only. The maximum size of the window policy in bytes. ++ hipDeviceAttributeAsyncEngineCount, ///< Cuda only. Asynchronous engines number. ++ hipDeviceAttributeCanMapHostMemory, ///< Whether host memory can be mapped into device address space ++ hipDeviceAttributeCanUseHostPointerForRegisteredMem,///< Cuda only. Device can access host registered memory ++ ///< at the same virtual address as the CPU ++ hipDeviceAttributeClockRate, ///< Peak clock frequency in kilohertz. ++ hipDeviceAttributeComputeMode, ///< Compute mode that device is currently in. ++ hipDeviceAttributeComputePreemptionSupported, ///< Cuda only. Device supports Compute Preemption. ++ hipDeviceAttributeConcurrentKernels, ///< Device can possibly execute multiple kernels concurrently. ++ hipDeviceAttributeConcurrentManagedAccess, ///< Device can coherently access managed memory concurrently with the CPU ++ hipDeviceAttributeCooperativeLaunch, ///< Support cooperative launch ++ hipDeviceAttributeCooperativeMultiDeviceLaunch, ///< Support cooperative launch on multiple devices ++ hipDeviceAttributeDeviceOverlap, ///< Cuda only. Device can concurrently copy memory and execute a kernel. ++ ///< Deprecated. Use instead asyncEngineCount. ++ hipDeviceAttributeDirectManagedMemAccessFromHost, ///< Host can directly access managed memory on ++ ///< the device without migration ++ hipDeviceAttributeGlobalL1CacheSupported, ///< Cuda only. Device supports caching globals in L1 ++ hipDeviceAttributeHostNativeAtomicSupported, ///< Cuda only. Link between the device and the host supports native atomic operations ++ hipDeviceAttributeIntegrated, ///< Device is integrated GPU ++ hipDeviceAttributeIsMultiGpuBoard, ///< Multiple GPU devices. ++ hipDeviceAttributeKernelExecTimeout, ///< Run time limit for kernels executed on the device ++ hipDeviceAttributeL2CacheSize, ///< Size of L2 cache in bytes. 0 if the device doesn't have L2 cache. ++ hipDeviceAttributeLocalL1CacheSupported, ///< caching locals in L1 is supported ++ hipDeviceAttributeLuid, ///< Cuda only. 8-byte locally unique identifier in 8 bytes. Undefined on TCC and non-Windows platforms ++ hipDeviceAttributeLuidDeviceNodeMask, ///< Cuda only. Luid device node mask. Undefined on TCC and non-Windows platforms ++ hipDeviceAttributeComputeCapabilityMajor, ///< Major compute capability version number. ++ hipDeviceAttributeManagedMemory, ///< Device supports allocating managed memory on this system ++ hipDeviceAttributeMaxBlocksPerMultiProcessor, ///< Cuda only. Max block size per multiprocessor ++ hipDeviceAttributeMaxBlockDimX, ///< Max block size in width. ++ hipDeviceAttributeMaxBlockDimY, ///< Max block size in height. ++ hipDeviceAttributeMaxBlockDimZ, ///< Max block size in depth. ++ hipDeviceAttributeMaxGridDimX, ///< Max grid size in width. ++ hipDeviceAttributeMaxGridDimY, ///< Max grid size in height. ++ hipDeviceAttributeMaxGridDimZ, ///< Max grid size in depth. ++ hipDeviceAttributeMaxSurface1D, ///< Maximum size of 1D surface. ++ hipDeviceAttributeMaxSurface1DLayered, ///< Cuda only. Maximum dimensions of 1D layered surface. ++ hipDeviceAttributeMaxSurface2D, ///< Maximum dimension (width, height) of 2D surface. ++ hipDeviceAttributeMaxSurface2DLayered, ///< Cuda only. Maximum dimensions of 2D layered surface. ++ hipDeviceAttributeMaxSurface3D, ///< Maximum dimension (width, height, depth) of 3D surface. ++ hipDeviceAttributeMaxSurfaceCubemap, ///< Cuda only. Maximum dimensions of Cubemap surface. ++ hipDeviceAttributeMaxSurfaceCubemapLayered, ///< Cuda only. Maximum dimension of Cubemap layered surface. ++ hipDeviceAttributeMaxTexture1DWidth, ///< Maximum size of 1D texture. ++ hipDeviceAttributeMaxTexture1DLayered, ///< Cuda only. Maximum dimensions of 1D layered texture. ++ hipDeviceAttributeMaxTexture1DLinear, ///< Maximum number of elements allocatable in a 1D linear texture. ++ ///< Use cudaDeviceGetTexture1DLinearMaxWidth() instead on Cuda. ++ hipDeviceAttributeMaxTexture1DMipmap, ///< Cuda only. Maximum size of 1D mipmapped texture. ++ hipDeviceAttributeMaxTexture2DWidth, ///< Maximum dimension width of 2D texture. ++ hipDeviceAttributeMaxTexture2DHeight, ///< Maximum dimension hight of 2D texture. ++ hipDeviceAttributeMaxTexture2DGather, ///< Cuda only. Maximum dimensions of 2D texture if gather operations performed. ++ hipDeviceAttributeMaxTexture2DLayered, ///< Cuda only. Maximum dimensions of 2D layered texture. ++ hipDeviceAttributeMaxTexture2DLinear, ///< Cuda only. Maximum dimensions (width, height, pitch) of 2D textures bound to pitched memory. ++ hipDeviceAttributeMaxTexture2DMipmap, ///< Cuda only. Maximum dimensions of 2D mipmapped texture. ++ hipDeviceAttributeMaxTexture3DWidth, ///< Maximum dimension width of 3D texture. ++ hipDeviceAttributeMaxTexture3DHeight, ///< Maximum dimension height of 3D texture. ++ hipDeviceAttributeMaxTexture3DDepth, ///< Maximum dimension depth of 3D texture. ++ hipDeviceAttributeMaxTexture3DAlt, ///< Cuda only. Maximum dimensions of alternate 3D texture. ++ hipDeviceAttributeMaxTextureCubemap, ///< Cuda only. Maximum dimensions of Cubemap texture ++ hipDeviceAttributeMaxTextureCubemapLayered, ///< Cuda only. Maximum dimensions of Cubemap layered texture. ++ hipDeviceAttributeMaxThreadsDim, ///< Maximum dimension of a block ++ hipDeviceAttributeMaxThreadsPerBlock, ///< Maximum number of threads per block. ++ hipDeviceAttributeMaxThreadsPerMultiProcessor, ///< Maximum resident threads per multiprocessor. ++ hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies ++ hipDeviceAttributeMemoryBusWidth, ///< Global memory bus width in bits. ++ hipDeviceAttributeMemoryClockRate, ///< Peak memory clock frequency in kilohertz. ++ hipDeviceAttributeComputeCapabilityMinor, ///< Minor compute capability version number. ++ hipDeviceAttributeMultiGpuBoardGroupID, ///< Cuda only. Unique ID of device group on the same multi-GPU board ++ hipDeviceAttributeMultiprocessorCount, ///< Number of multiprocessors on the device. ++ hipDeviceAttributeName, ///< Device name. ++ hipDeviceAttributePageableMemoryAccess, ///< Device supports coherently accessing pageable memory ++ ///< without calling hipHostRegister on it ++ hipDeviceAttributePageableMemoryAccessUsesHostPageTables, ///< Device accesses pageable memory via the host's page tables ++ hipDeviceAttributePciBusId, ///< PCI Bus ID. ++ hipDeviceAttributePciDeviceId, ///< PCI Device ID. ++ hipDeviceAttributePciDomainID, ///< PCI Domain ID. ++ hipDeviceAttributePersistingL2CacheMaxSize, ///< Cuda11 only. Maximum l2 persisting lines capacity in bytes ++ hipDeviceAttributeMaxRegistersPerBlock, ///< 32-bit registers available to a thread block. This number is shared ++ ///< by all thread blocks simultaneously resident on a multiprocessor. ++ hipDeviceAttributeMaxRegistersPerMultiprocessor, ///< 32-bit registers available per block. ++ hipDeviceAttributeReservedSharedMemPerBlock, ///< Cuda11 only. Shared memory reserved by CUDA driver per block. ++ hipDeviceAttributeMaxSharedMemoryPerBlock, ///< Maximum shared memory available per block in bytes. ++ hipDeviceAttributeSharedMemPerBlockOptin, ///< Cuda only. Maximum shared memory per block usable by special opt in. ++ hipDeviceAttributeSharedMemPerMultiprocessor, ///< Cuda only. Shared memory available per multiprocessor. ++ hipDeviceAttributeSingleToDoublePrecisionPerfRatio, ///< Cuda only. Performance ratio of single precision to double precision. ++ hipDeviceAttributeStreamPrioritiesSupported, ///< Cuda only. Whether to support stream priorities. ++ hipDeviceAttributeSurfaceAlignment, ///< Cuda only. Alignment requirement for surfaces ++ hipDeviceAttributeTccDriver, ///< Cuda only. Whether device is a Tesla device using TCC driver ++ hipDeviceAttributeTextureAlignment, ///< Alignment requirement for textures ++ hipDeviceAttributeTexturePitchAlignment, ///< Pitch alignment requirement for 2D texture references bound to pitched memory; ++ hipDeviceAttributeTotalConstantMemory, ///< Constant memory size in bytes. ++ hipDeviceAttributeTotalGlobalMem, ///< Global memory available on devicice. ++ hipDeviceAttributeUnifiedAddressing, ///< Cuda only. An unified address space shared with the host. ++ hipDeviceAttributeUuid, ///< Cuda only. Unique ID in 16 byte. ++ hipDeviceAttributeWarpSize, ///< Warp size in threads. + +- hipDeviceAttributeMaxPitch, ///< Maximum pitch in bytes allowed by memory copies +- hipDeviceAttributeTextureAlignment, /// + + + + + + + + + + + +The release notes for the ROCm platform. + +------------------- + +## ROCm 5.6.1 + + + +### What's New in This Release + +ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. + +## HIP 5.6.1 (for ROCm 5.6.1) + +### Fixed Defects + +* *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host +* Enabled xnack+ check in HIP catch2 tests hang when executing tests +* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs +* Using *hipGraphAddMemFreeNode* no longer results in a crash diff --git a/docs/about/release/release_notes.md b/docs/about/release/release_notes.md deleted file mode 100644 index 0830b04d1..000000000 --- a/docs/about/release/release_notes.md +++ /dev/null @@ -1,583 +0,0 @@ -# Release Notes - - - - - - - - - - - - - -The release notes for the ROCm platform. - -------------------- - -## ROCm 5.6.0 - - - -#### Release Highlights - -ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include: - -- New documentation portal at https://rocm.docs.amd.com -- Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite -- OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements -- Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers -- New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. - -#### OS and GPU Support Changes - -- SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. -- AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. - - No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 - - Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release) - - Bug fixes during the maintenance will be made to the next ROCm point release - - Bug fixes will not be back ported to older ROCm releases for this SKU - - Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. - -#### AMDSMI CLI 23.0.0.4 - -##### Added - -- AMDSMI CLI tool enabled for Linux Bare Metal & Guest - -- Package: amd-smi-lib - -##### Known Issues - -- not all Error Correction Code (ECC) fields are currently supported - -- RHEL 8 & SLES 15 have extra install steps - -#### Kernel Modules (DKMS) - -##### Fixes - -- Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). - -#### HIP 5.6 (For ROCm 5.6) - -##### Optimizations - -- Consolidation of hipamd, rocclr and OpenCL projects in clr -- Optimized lock for graph global capture mode - -##### Added - -- Added hipRTC support for amd_hip_fp16 -- Added hipStreamGetDevice implementation to get the device associated with the stream -- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats -- hipArrayGetInfo for getting information about the specified array -- hipArrayGetDescriptor for getting 1D or 2D array descriptor -- hipArray3DGetDescriptor to get 3D array descriptor - -##### Changed - -- hipMallocAsync to return success for zero size allocation to match hipMalloc -- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package -- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide -- Removed hipBusBandwidth and hipCommander samples from hip-tests - -##### Fixed - -- Fixed regression in hipMemCpyParam3D when offset is applied - -##### Known Issues - -- Limited testing on xnack+ configuration - - Multiple HIP tests failures (gpuvm fault or hangs) -- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU -- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release - -##### Upcoming changes in future release - -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction (unsigned int -> size_t) -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' - -#### ROCgdb-13 (For ROCm 5.6.0) - -##### Optimized - -- Improved performances when handling the end of a process with a large number of threads. - -Known Issues - -- On certain configurations, ROCgdb can show the following warning message: - - `warning: Probes-based dynamic linker interface failed. Reverting to original interface.` - - This does not affect ROCgdb's functionalities. - -#### ROCprofiler (For ROCm 5.6.0) - -In ROCm 5.6 the `rocprofilerv1` and `rocprofilerv2` include and library files of -ROCm 5.5 are split into separate files. The `rocmtools` files that were -deprecated in ROCm 5.5 have been removed. - - | ROCm 5.6 | rocprofilerv1 | rocprofilerv2 | - |-----------------|-------------------------------------|----------------------------------------| - | **Tool script** | `bin/rocprof` | `bin/rocprofv2` | - | **API include** | `include/rocprofiler/rocprofiler.h` | `include/rocprofiler/v2/rocprofiler.h` | - | **API library** | `lib/librocprofiler.so.1` | `lib/librocprofiler.so.2` | - -The ROCm Profiler Tool that uses `rocprofilerV1` can be invoked using the -following command: - -```sh -$ rocprof … -``` - -To write a custom tool based on the `rocprofilerV1` API do the following: - -```C -main.c: -#include // Use the rocprofilerV1 API -int main() { - // Use the rocprofilerV1 API - return 0; -} -``` - -This can be built in the following manner: - -```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64 -``` - -The resulting `a.out` will depend on -`/opt/rocm-5.6.0/lib/librocprofiler64.so.1`. - -The ROCm Profiler that uses `rocprofilerV2` API can be invoked using the -following command: - -```sh -$ rocprofv2 … -``` - -To write a custom tool based on the `rocprofilerV2` API do the following: - -```C -main.c: -#include // Use the rocprofilerV2 API -int main() { - // Use the rocprofilerV2 API - return 0; -} -``` - -This can be built in the following manner: - -```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2 -``` - -The resulting `a.out` will depend on -`/opt/rocm-5.6.0/lib/librocprofiler64.so.2`. - -##### Optimized - -- Improved Test Suite - -##### Added - -- 'end_time' need to be disabled in roctx_trace.txt - -##### Fixed - -- rocprof in ROcm/5.4.0 gpu selector broken. -- rocprof in ROCm/5.4.1 fails to generate kernel info. -- rocprof clobbers LD_PRELOAD. - -### Library Changes in ROCM 5.6.0 - -| Library | Version | -|---------|---------| -| hipBLAS | ⇒ [1.0.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.6.0) | -| hipCUB | ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.6.0) | -| hipFFT | ⇒ [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.6.0) | -| hipSOLVER | ⇒ [1.8.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.6.0) | -| hipSPARSE | ⇒ [2.3.6](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.6.0) | -| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.6.0) | -| rccl | ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.6.0) | -| rocALUTION | ⇒ [2.1.9](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.6.0) | -| rocBLAS | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.6.0) | -| rocFFT | ⇒ [1.0.23](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.6.0) | -| rocm-cmake | ⇒ [0.9.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.6.0) | -| rocPRIM | ⇒ [2.13.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.6.0) | -| rocRAND | ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.6.0) | -| rocSOLVER | ⇒ [3.22.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.6.0) | -| rocSPARSE | ⇒ [2.5.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.6.0) | -| rocThrust | ⇒ [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.6.0) | -| rocWMMA | ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.6.0) | -| Tensile | ⇒ [4.37.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.6.0) | - -#### hipBLAS 1.0.0 - -hipBLAS 1.0.0 for ROCm 5.6.0 - -##### Changed - -- added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing - -##### Removed - -- removed support for deprecated hipblasInt8Datatype_t enum -- removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions - -##### Deprecated - -- in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and - out-of-place functionality - -#### hipCUB 2.13.1 - -hipCUB 2.13.1 for ROCm 5.6.0 - -##### Added - -- Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. - -##### Changed - -- CUB backend references CUB and Thrust version 1.17.2. -- Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). - -##### Known Issues - -- `BlockRadixRankMatch` is currently broken under the rocPRIM backend. -- `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. - -#### hipFFT 1.0.12 - -hipFFT 1.0.12 for ROCm 5.6.0 - -##### Added - -- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. - -##### Changed - -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. - -#### hipSOLVER 1.8.0 - -hipSOLVER 1.8.0 for ROCm 5.6.0 - -##### Added - -- Added compatibility API with hipsolverRf prefix - -#### hipSPARSE 2.3.6 - -hipSPARSE 2.3.6 for ROCm 5.6.0 - -##### Added - -- Added SpGEMM algorithms - -##### Changed - -- For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE - -#### MIOpen 2.19.0 - -MIOpen 2.19.0 for ROCm 5.6.0 - -##### Added - -- ROCm 5.5 support for gfx1101 (Navi32) - -##### Changed - -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag - -##### Fixed - -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. - -#### rccl 2.15.5 - -RCCL 2.15.5 for ROCm 5.6.0 - -##### Changed - -- Compatibility with NCCL 2.15.5 -- Unit test executable renamed to rccl-UnitTests - -##### Added - -- HW-topology aware binary tree implementation -- Experimental support for MSCCL -- New unit tests for hipGraph support -- NPKit integration - -##### Fixed - -- rocm-smi ID conversion -- Support for HIP_VISIBLE_DEVICES for unit tests -- Support for p2p transfers to non (HIP) visible devices - -##### Removed - -- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench - -#### rocALUTION 2.1.9 - -rocALUTION 2.1.9 for ROCm 5.6.0 - -##### Improved - -- Fixed synchronization issues in level 1 routines - -#### rocBLAS 3.0.0 - -rocBLAS 3.0.0 for ROCm 5.6.0 - -##### Optimizations - -- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256. -- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a. - -##### Added - -- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex. - -##### Deprecated - -- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality -- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release -- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release -- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size() -- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release - -##### Removed - -- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead. -- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively. -- rocblas_set_int8_type_for_hipblas was deprecated and is now removed. -- rocblas_get_int8_type_for_hipblas was deprecated and is now removed. - -##### Dependencies - -- build only dependency on python joblib added as used by Tensile build -- fix for cmake install on some OS when performed by install.sh -d --cmake_install - -##### Fixed - -- make trsm offset calculations 64 bit safe - -##### Changed - -- refactor rotg test code - -#### rocFFT 1.0.23 - -rocFFT 1.0.23 for ROCm 5.6.0 - -##### Added - -- Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create. -- Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used. -- Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems. - -##### Changed - -- Replaced std::complex with hipComplex data types for data generator. -- FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example). -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. - -##### Fixed - -- Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure. - -#### rocm-cmake 0.9.0 - -rocm-cmake 0.9.0 for ROCm 5.6.0 - -##### Added - -- Added the option ROCM_HEADER_WRAPPER_WERROR - - Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings. - - Configure-time CMake option sets the default for the C macro. - -#### rocPRIM 2.13.0 - -rocPRIM 2.13.0 for ROCm 5.6.0 - -##### Added - -- New block level `radix_rank` primitive. -- New block level `radix_rank_match` primitive. -- Added a stable block sorting implementation. This be used with `block_sort` by using the `block_sort_algorithm::stable_merge_sort` algorithm. - -##### Changed - -- Improved the performance of `block_radix_sort` and `device_radix_sort`. -- Improved the performance of `device_merge_sort`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). Contributed by: [v01dXYZ](https://github.com/v01dXYZ). - -##### Known Issues - -- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. -- When `ROCPRIM_DISABLE_LOOKBACK_SCAN` is set, `device_scan` fails for input sizes bigger than `scan_config::size_limit`, which defaults to `std::numeric_limits<unsigned int>::max()`. - -#### rocRAND 2.10.17 - -rocRAND 2.10.17 for ROCm 5.6.0 - -##### Added - -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". - -##### Changed - -- Python 2.7 is no longer officially supported. - -#### rocSOLVER 3.22.0 - -rocSOLVER 3.22.0 for ROCm 5.6.0 - -##### Added - -- LU refactorization for sparse matrices - - CSRRF_ANALYSIS - - CSRRF_SUMLU - - CSRRF_SPLITLU - - CSRRF_REFACTLU -- Linear system solver for sparse matrices - - CSRRF_SOLVE -- Added type `rocsolver_rfinfo` for use with sparse matrix routines - -##### Optimized - -- Improved the performance of BDSQR and GESVD when singular vectors are requested - -##### Fixed - -- BDSQR and GESVD should no longer hang when the input contains `NaN` or `Inf` - -#### rocSPARSE 2.5.2 - -rocSPARSE 2.5.2 for ROCm 5.6.0 - -##### Improved - -- Fixed a memory leak in csritsv -- Fixed a bug in csrsm and bsrsm - -#### rocThrust 2.18.0 - -rocThrust 2.18.0 for ROCm 5.6.0 - -##### Fixed - -- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. - -##### Changed - -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). - -#### rocWMMA 1.1.0 - -rocWMMA 1.1.0 for ROCm 5.6.0 - -##### Added - -- Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) -- Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) -- Added performance gemm samples for half, single and double precision -- Added rocWMMA cmake versioning -- Added vectorized support in coordinate transforms -- Included ROCm smi for runtime clock rate detection -- Added fragment transforms for transpose and change data layout - -##### Changed - -- Default to GPU rocBLAS validation against rocWMMA -- Re-enabled int8 gemm tests on gfx9 -- Upgraded to C++17 -- Restructured unit test folder for consistency -- Consolidated rocWMMA samples common code - -#### Tensile 4.37.0 - -Tensile 4.37.0 for ROCm 5.6.0 - -##### Added - -- Added user driven tuning API -- Added decision tree fallback feature -- Added SingleBuffer + AtomicAdd option for GlobalSplitU -- DirectToVgpr support for fp16 and Int8 with TN orientation -- Added new test cases for various functions -- Added SingleBuffer algorithm for ZGEMM/CGEMM -- Added joblib for parallel map calls -- Added support for MFMA + LocalSplitU + DirectToVgprA+B -- Added asmcap check for MIArchVgpr -- Added support for MFMA + LocalSplitU -- Added frequency, power, and temperature data to the output - -##### Optimizations - -- Improved the performance of GlobalSplitU with SingleBuffer algorithm -- Reduced the running time of the extended and pre_checkin tests -- Optimized the Tailloop section of the assembly kernel -- Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch) -- Improved the performance of the second kernel of MultipleBuffer algorithm - -##### Changed - -- Updated custom kernels with 64-bit offsets -- Adapted 64-bit offset arguments for assembly kernels -- Improved temporary register re-use to reduce max sgpr usage -- Removed some restrictions on VectorWidth and DirectToVgpr -- Updated the dependency requirements for Tensile -- Changed the range of AssertSummationElementMultiple -- Modified the error messages for more clarity -- Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used -- Removed dummy vgpr for vectorStaticRemainder -- Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder -- Removed qReg parameter from vectorStaticRemainder - -##### Fixed - -- Fixed tmp sgpr allocation to avoid over-writing values (alpha) -- 64-bit offset parameters for post kernels -- Fixed gfx908 CI test failures -- Fixed offset calculation to prevent overflow for large offsets -- Fixed issues when BufferLoad and BufferStore are equal to zero -- Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch -- Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch -- Fixed the memory access error related to StaggerU + large stride -- Fixed ZGEMM 4x4 MatrixInst mismatch -- Fixed DGEMM 4x4 MatrixInst mismatch -- Fixed ASEM + GSU + NoTailLoop opt mismatch -- Fixed AssertSummationElementMultiple + GlobalSplitU issues -- Fixed ASEM + GSU + TailLoop inner unroll diff --git a/docs/whats_new/whats_new.md b/docs/about/whats-new/whats-new.md similarity index 100% rename from docs/whats_new/whats_new.md rename to docs/about/whats-new/whats-new.md diff --git a/docs/rocm_ai/migraphx_optimization.md b/docs/conceptual/ai-migraphx-optimization.md similarity index 94% rename from docs/rocm_ai/migraphx_optimization.md rename to docs/conceptual/ai-migraphx-optimization.md index 8b2777b41..4c75f778f 100644 --- a/docs/rocm_ai/migraphx_optimization.md +++ b/docs/conceptual/ai-migraphx-optimization.md @@ -4,33 +4,27 @@ The following sections cover inferencing and introduces MIGraphX. ## Inference -The inference is where capabilities learned during Deep Learning training are put to work. It refers to using a fully trained neural network to make conclusions (predictions) on unseen data that the model has never interacted with before. Deep Learning inferencing is achieved by feeding new data, such as new images, to the network, giving the Deep Neural Network a chance to classify the image. +The inference is where capabilities learned during deep-learning training are put to work. It refers to using a fully trained neural network to make conclusions (predictions) on unseen data that the model has never interacted with before. Deep-learning inferencing is achieved by feeding new data, such as new images, to the network, giving the Deep Neural Network a chance to classify the image. Taking our previous example of MNIST, the DNN can be fed new images of handwritten digit images, allowing the neural network to classify digits. A fully trained DNN should make accurate predictions about what an image represents, and inference cannot happen without training. ## MIGraphX Introduction -MIGraphX is a graph compiler focused on accelerating the Machine Learning inference that can target AMD GPUs and CPUs. MIGraphX accelerates the Machine Learning models by leveraging several graph-level transformations and optimizations. These optimizations include: +MIGraphX is a graph compiler focused on accelerating the machine-learning inference that can target AMD GPUs and CPUs. MIGraphX accelerates the machine-learning models by leveraging several graph-level transformations and optimizations. These optimizations include: -- Operator fusion - -- Arithmetic simplifications - -- Dead-code elimination - -- Common subexpression elimination (CSE) - -- Constant propagation +* Operator fusion +* Arithmetic simplifications +* Dead-code elimination +* Common subexpression elimination (CSE) +* Constant propagation After doing all these transformations, MIGraphX emits code for the AMD GPU by calling to MIOpen or rocBLAS or creating HIP kernels for a particular operator. MIGraphX can also target CPUs using DNNL or ZenDNN libraries. MIGraphX provides easy-to-use APIs in C++ and Python to import machine models in ONNX or TensorFlow. Users can compile, save, load, and run these models using MIGraphX's C++ and Python APIs. Internally, MIGraphX parses ONNX or TensorFlow models into internal graph representation where each operator in the model gets mapped to an operator within MIGraphX. Each of these operators defines various attributes such as: -- Number of arguments - -- Type of arguments - -- Shape of arguments +* Number of arguments +* Type of arguments +* Shape of arguments After optimization passes, all these operators get mapped to different kernels on GPUs or CPUs. @@ -54,11 +48,11 @@ The header files and libraries are installed under `/opt/rocm-\`, wher There are two ways to build the MIGraphX sources. -- [Use the ROCm build tool](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install the prerequisites and build the libraries with just one command. +* [Use the ROCm build tool](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install the prerequisites and build the libraries with just one command. or -- [Use CMake](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source. +* [Use CMake](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source. For detailed steps on building from source and installing dependencies, refer to the following `README` file: @@ -329,7 +323,7 @@ To run generated `.mxr` files through `migraphx-driver`, use the following: Alternatively, you can use MIGraphX's C++ or Python API to generate `.mxr` file. -```{figure} ../data/rocm_ai/image018.png +```{figure} ../data/rocm-ai/image018.png :name: image018 :align: center diff --git a/docs/rocm_ai/pytorch_inception.md b/docs/conceptual/ai-pytorch-inception.md similarity index 94% rename from docs/rocm_ai/pytorch_inception.md rename to docs/conceptual/ai-pytorch-inception.md index 6aeb72e7f..ab752c884 100644 --- a/docs/rocm_ai/pytorch_inception.md +++ b/docs/conceptual/ai-pytorch-inception.md @@ -1,8 +1,8 @@ # Inception V3 with PyTorch -## Deep Learning Training +## Deep learning training -Deep Learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective. +Deep-learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective. The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model's performance during the training process. @@ -56,7 +56,7 @@ This example is adapted from the PyTorch research hub page on Inception v3[^torc Follow these steps: -1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](../tutorials/install/pytorch_install) for setting up a PyTorch environment on ROCm. +1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](../tutorials/install/pytorch-install) for setting up a PyTorch environment on ROCm. ```dockerfile docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest @@ -146,7 +146,7 @@ The previous section focused on downloading and using the Inception v3 model for Follow these steps: -1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](../tutorials/install/pytorch_install) for setting up a PyTorch environment on ROCm. +1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](../tutorials/install/pytorch-install) for setting up a PyTorch environment on ROCm. ```dockerfile docker pull rocm/pytorch:latest @@ -463,7 +463,7 @@ torch.save(model.state_dict(), "trained_inception_v3.pt") Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in the following image. -```{figure} ../data/rocm_ai/inception_v3.png +```{figure} ../data/rocm-ai/inception-v3.png :name: inception-v3 --- align: center @@ -741,7 +741,7 @@ To understand the code step by step, follow these steps: plt.show() ``` - ```{figure} ../data/rocm_ai/mnist_1.png + ```{figure} ../data/rocm-ai/mnist-1.png --- align: center --- @@ -769,13 +769,13 @@ To understand the code step by step, follow these steps: plt.show() ``` - ```{figure} ../data/rocm_ai/mnist_2.png + ```{figure} ../data/rocm-ai/mnist-2.png --- align: center --- ``` - The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep Learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training. + The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training. ```py model = tf.keras.Sequential([ @@ -785,9 +785,9 @@ To understand the code step by step, follow these steps: ]) ``` - - The first layer in this network `tf.keras.layers.Flatten` transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data. + * The first layer in this network `tf.keras.layers.Flatten` transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data. - - After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes. + * After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes. 12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation. @@ -797,11 +797,11 @@ To understand the code step by step, follow these steps: metrics=['accuracy']) ``` - - Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction. + * Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction. - - Optimizer —This is how the model is updated based on the data it sees and its loss function. + * Optimizer —This is how the model is updated based on the data it sees and its loss function. - - Metrics —This is used to monitor the training and testing steps. + * Metrics —This is used to monitor the training and testing steps. The following example uses accuracy, the fraction of the correctly classified images. @@ -895,7 +895,7 @@ To understand the code step by step, follow these steps: plt.show() ``` - ```{figure} ../data/rocm_ai/mnist_3.png + ```{figure} ../data/rocm-ai/mnist-3.png --- align: center --- @@ -911,7 +911,7 @@ To understand the code step by step, follow these steps: plt.show() ``` - ```{figure} ../data/rocm_ai/mnist_4.png + ```{figure} ../data/rocm-ai/mnist-4.png --- align: center --- @@ -946,7 +946,7 @@ To understand the code step by step, follow these steps: plt.show() ``` - ```{figure} ../data/rocm_ai/mnist_5.png + ```{figure} ../data/rocm-ai/mnist-5.png --- align: center --- @@ -988,7 +988,7 @@ Follow these steps: cache_subdir='') ``` - ```py + ```bash Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] – 1s 0us/step 84149932/84125825 [==============================] – 1s 0us/step @@ -1115,7 +1115,7 @@ To prepare the data for training, follow these steps: print("Vectorized review", vectorize_text(first_review, first_label)) ``` - ```{figure} ../data/rocm_ai/TextClassification_3.png + ```{figure} ../data/rocm-ai/TextClassification-3.png --- align: center --- @@ -1158,7 +1158,7 @@ To prepare the data for training, follow these steps: model.summary() ``` - ```{figure} ../data/rocm_ai/TextClassification_4.png + ```{figure} ../data/rocm-ai/TextClassification-4.png --- align: center --- @@ -1178,7 +1178,7 @@ To prepare the data for training, follow these steps: history = model.fit(train_ds,validation_data=val_ds,epochs=epochs) ``` - ```{figure} ../data/rocm_ai/TextClassification_5.png + ```{figure} ../data/rocm-ai/TextClassification-5.png --- align: center --- @@ -1226,7 +1226,7 @@ To prepare the data for training, follow these steps: The following images illustrate the training and validation loss and the training and validation accuracy. - ```{figure} ../data/rocm_ai/TextClassification_6.png + ```{figure} ../data/rocm-ai/TextClassification-6.png :name: TextClassification6 --- align: center @@ -1234,7 +1234,7 @@ To prepare the data for training, follow these steps: Training and Validation Loss ``` - ```{figure} ../data/rocm_ai/TextClassification_7.png + ```{figure} ../data/rocm-ai/TextClassification-7.png :name: TextClassification7 --- align: center diff --git a/docs/conceptual/cmake_packages.rst b/docs/conceptual/cmake-packages.rst similarity index 99% rename from docs/conceptual/cmake_packages.rst rename to docs/conceptual/cmake-packages.rst index 1dce30b34..29f640bfd 100644 --- a/docs/conceptual/cmake_packages.rst +++ b/docs/conceptual/cmake-packages.rst @@ -20,10 +20,10 @@ Finding Dependencies In short, CMake supports finding dependencies in two ways: -- In Module mode, it consults a file ``Find.cmake`` which tries to +* In Module mode, it consults a file ``Find.cmake`` which tries to find the component in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects may ship them as well. -- In Config mode, it locates a file named ``-config.cmake`` or +* In Config mode, it locates a file named ``-config.cmake`` or ``Config.cmake`` which describes the installed component in all regards needed to consume it. diff --git a/docs/conceptual/compiler_disambiguation.md b/docs/conceptual/compiler-disambiguation.md similarity index 100% rename from docs/conceptual/compiler_disambiguation.md rename to docs/conceptual/compiler-disambiguation.md diff --git a/docs/conceptual/file_reorg.md b/docs/conceptual/file-reorg.md similarity index 98% rename from docs/conceptual/file_reorg.md rename to docs/conceptual/file-reorg.md index 7565f530e..3016b7b35 100644 --- a/docs/conceptual/file_reorg.md +++ b/docs/conceptual/file-reorg.md @@ -88,8 +88,8 @@ from the new location (`/opt/rocm-/include`) as shown in the example below. #include ``` -- Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: `#pragma` message announcing `#warning`. -- Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the `#pragma` message will be announcing `#error`. +* Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: `#pragma` message announcing `#warning`. +* Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the `#pragma` message will be announcing `#error`. ### Executable Files diff --git a/docs/conceptual/gpu_arch.md b/docs/conceptual/gpu-arch.md similarity index 54% rename from docs/conceptual/gpu_arch.md rename to docs/conceptual/gpu-arch.md index 812b83672..7a7750772 100644 --- a/docs/conceptual/gpu_arch.md +++ b/docs/conceptual/gpu-arch.md @@ -5,23 +5,25 @@ :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} AMD Instinct MI200 +:::{grid-item-card} +**[AMD Instinct MI250](./gpu-arch/mi250.md)** + Review hardware aspects of the AMD Instinct™ MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs. -- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) -- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) -- [Guide](./gpu_arch/mi250.md) +* [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) +* [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) ::: -:::{grid-item-card} AMD Instinct MI100 +:::{grid-item-card} +**[AMD Instinct MI100](./gpu-arch/mi100.md)** + Review hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs. -- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) -- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) -- [Guide](./gpu_arch/mi100.md) +* [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) +* [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) ::: @@ -29,18 +31,18 @@ accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs ## ISA Documentation -- [AMD Instinct MI200/CDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) -- [AMD Instinct MI100/CDNA1 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) -- [AMD Instinct MI50/Vega 7nm Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/vega-7nm-shader-instruction-set-architecture.pdf) -- [AMD Instinct MI25/Vega Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/vega-shader-instruction-set-architecture.pdf) -- [AMD RDNA3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf) -- [AMD RDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) -- [AMD RDNA Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf) -- [AMD GCN3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf) +* [AMD Instinct MI200/CDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) +* [AMD Instinct MI100/CDNA1 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) +* [AMD Instinct MI50/Vega 7nm Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/vega-7nm-shader-instruction-set-architecture.pdf) +* [AMD Instinct MI25/Vega Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/vega-shader-instruction-set-architecture.pdf) +* [AMD RDNA3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf) +* [AMD RDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) +* [AMD RDNA Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf) +* [AMD GCN3 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf) ## White Papers -- [AMD CDNA™ 2 Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) -- [AMD CDNA Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) -- [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf) -- [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf) +* [AMD CDNA™ 2 Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) +* [AMD CDNA Architecture White Paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) +* [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf) +* [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf) diff --git a/docs/conceptual/gpu_arch/mi100.md b/docs/conceptual/gpu-arch/mi100.md similarity index 97% rename from docs/conceptual/gpu_arch/mi100.md rename to docs/conceptual/gpu-arch/mi100.md index ddbfabb37..63a8ef08e 100644 --- a/docs/conceptual/gpu_arch/mi100.md +++ b/docs/conceptual/gpu-arch/mi100.md @@ -17,7 +17,7 @@ available to connect the processors plus one PCIe Gen 4 x16 link per processor can attach additional I/O devices such as the host adapters for the network fabric. -```{figure} ../../data/conceptual/gpu_arch/image004.png +```{figure} ../../data/conceptual/gpu-arch/image004.png :name: mi100-arch :alt: Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators. :align: center @@ -43,7 +43,7 @@ computing (HPC) and AI & machine learning (ML) that run on everything from individual servers to the world's largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance. -```{figure} ../../data/conceptual/gpu_arch/image005.png +```{figure} ../../data/conceptual/gpu-arch/image005.png :name: mi100-block :alt: Structure of the AMD Instinct accelerator (MI100 generation). :align: center @@ -72,7 +72,7 @@ instructions of 16 data elements per instruction. This enables the CU to process Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS (`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`). -```{figure} ../../data/conceptual/gpu_arch/image006.png +```{figure} ../../data/conceptual/gpu-arch/image006.png :name: mi100-gcd :alt: Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture. :align: center diff --git a/docs/conceptual/gpu_arch/mi200_performance_counters.md b/docs/conceptual/gpu-arch/mi200-performance-counters.md similarity index 100% rename from docs/conceptual/gpu_arch/mi200_performance_counters.md rename to docs/conceptual/gpu-arch/mi200-performance-counters.md diff --git a/docs/conceptual/gpu_arch/mi250.md b/docs/conceptual/gpu-arch/mi250.md similarity index 96% rename from docs/conceptual/gpu_arch/mi250.md rename to docs/conceptual/gpu-arch/mi250.md index 7f1bbeed4..5ba0e3b3f 100644 --- a/docs/conceptual/gpu_arch/mi250.md +++ b/docs/conceptual/gpu-arch/mi250.md @@ -7,7 +7,7 @@ accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs The micro-architecture of the AMD Instinct MI250 accelerators is based on the AMD CDNA 2 architecture that targets compute applications such as HPC, -artificial intelligence (AI), and Machine Learning (ML) and that run on +artificial intelligence (AI), and machine learning (ML) and that run on everything from individual servers to the world’s largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance. @@ -38,7 +38,7 @@ execution units (also called matrix cores), which are geared toward executing matrix operations like matrix-matrix multiplications. For FP64, the peak performance of these units amounts to 90.5 TFLOPS. -```{figure} ../../data/conceptual/gpu_arch/image001.png +```{figure} ../../data/conceptual/gpu-arch/image001.png :name: mi250-gcd :alt: Structure of a single GCD in the AMD Instinct MI250 accelerator. :align: center @@ -97,7 +97,7 @@ is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD). -```{figure} ../../data/conceptual/gpu_arch/image002.png +```{figure} ../../data/conceptual/gpu-arch/image002.png :name: mi250-perf :alt: Dual-GCD architecture of the AMD Instinct MI250 accelerators.. :align: center @@ -122,7 +122,7 @@ GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch . Note that some platforms may offer an x8 interface to the GCDs, which reduces the available host-to-GPU bandwidth. -```{figure} ../../data/conceptual/gpu_arch/image003.png +```{figure} ../../data/conceptual/gpu-arch/image003.png :name: mi250-block :alt: Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor. :align: center diff --git a/docs/conceptual/gpu_isolation.md b/docs/conceptual/gpu-isolation.md similarity index 100% rename from docs/conceptual/gpu_isolation.md rename to docs/conceptual/gpu-isolation.md diff --git a/docs/conceptual/index.md b/docs/conceptual/index.md index fdd124464..6669cc414 100644 --- a/docs/conceptual/index.md +++ b/docs/conceptual/index.md @@ -3,42 +3,42 @@ :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Compiler Nomencalture -:link: ./compiler_disambiguation -:link-type: doc +:::{grid-item-card} +**[Compiler Nomenclature](./compiler-disambiguation.md)** + ROCm ships multiple compilers of varying origins and purposes. This article disambiguates compiler naming used throughout the documentation. ::: -:::{grid-item-card} Using CMake -:link: ./cmake_packages -:link-type: doc +:::{grid-item-card} +**[Using CMake](./cmake-packages.rst)** + ROCm components ship with 1st party CMake support. This article details how that support works and how to use it. ::: -:::{grid-item-card} Linux Folder Structure Reorganization -:link: ./file_reorg -:link-type: doc +:::{grid-item-card} +**[Linux Folder Structure Reorganization](./file-reorg.md)** + ROCm™ packages have adopted the Linux foundation file system hierarchy standard to ensure ROCm components follow open source conventions for Linux-based distributions. ::: -:::{grid-item-card} GPU Isolation Techniques -:link: ./gpu_isolation -:link-type: doc +:::{grid-item-card} +**[GPU Isolation Techniques](./gpu-isolation.md)** + Restricting the access of applications to a subset of GPUs, aka isolating GPUs allows users to hide GPU resources from programs. ::: -:::{grid-item-card} GPU Architectures -:link: ./gpu_arch -:link-type: doc +:::{grid-item-card} +**[GPU Architectures](./gpu-arch.md)** + AMD documentation around architectural details from both the CDNA and RDNA product lines. diff --git a/docs/conceptual/using_gpu_sanitizer.md b/docs/conceptual/using-gpu-sanitizer.md similarity index 91% rename from docs/conceptual/using_gpu_sanitizer.md rename to docs/conceptual/using-gpu-sanitizer.md index 83a74640f..7f8ed0dd4 100644 --- a/docs/conceptual/using_gpu_sanitizer.md +++ b/docs/conceptual/using-gpu-sanitizer.md @@ -13,12 +13,12 @@ The address sanitizer process begins by compiling the application of interest wi Recommendations for doing this are: -+ Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`. -+ Add the following options to the existing compiler and linker options: - + `-fsanitize=address` - enables instrumentation - + `-shared-libsan` - use shared version of runtime - + `-g` - add debug info for improved reporting -+ Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+` +* Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`. +* Add the following options to the existing compiler and linker options: + * `-fsanitize=address` - enables instrumentation + * `-shared-libsan` - use shared version of runtime + * `-g` - add debug info for improved reporting +* Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+` Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted. It is not an error to compile some files without address sanitizer instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the Address Sanitizer runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section. @@ -29,9 +29,9 @@ When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code a There are a few options if the compile time becomes unacceptable: -+ Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the address sanitizer process. -+ Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information. -+ Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process. +* Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the address sanitizer process. +* Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information. +* Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process. ## Use AMD Supplied Address Sanitizer Instrumented Libraries @@ -47,33 +47,33 @@ When adjusting an application build to add instrumentation, linking against thes Here are a few recommendations to consider before running an address sanitizer instrumented heterogeneous application. -+ Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient. -+ Ensure XNACK is enabled - + For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`. - + For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`. +* Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient. +* Ensure XNACK is enabled + * For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`. + * For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`. This requirement is due to the fact that the XNACK setting for these GPUs is system-wide. -+ Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd ` can be used to see which libraries will be used. +* Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd ` can be used to see which libraries will be used. If the instrumented libraries are not listed by `ldd`, the environment variable `LD_LIBRARY_PATH` may need to be adjusted, or in some cases an `RPATH` compiled into the application may need to be changed and the application recompiled. -+ Ensure that the application depends on the address sanitizer runtime. This can be checked by running the command `readelf -d | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output. +* Ensure that the application depends on the address sanitizer runtime. This can be checked by running the command `readelf -d | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output. If it does not appear, when executed the application will quickly output an address sanitizer error that looks like: ```bash ==3210==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD. ``` -+ Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information). +* Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information). There is an environment variable, `ASAN_OPTIONS` which can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options. There are two `ASAN_OPTION` flags of particular note. -+ `halt_on_error=0/1 default 1`. +* `halt_on_error=0/1 default 1`. This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional address sanitizer instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur. -+ `detect_leaks=0/1 default 1`. +* `detect_leaks=0/1 default 1`. This option directs the address sanitizer runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`. ## Runtime Overhead @@ -216,10 +216,10 @@ $ rocgdb ## Known Issues with Using GPU Sanitizer -+ Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected. +* Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected. -+ Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows. +* Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows. -+ Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented. +* Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented. -+ It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime. +* It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime. diff --git a/docs/conf.py b/docs/conf.py index 108aede40..932d40486 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -9,8 +9,8 @@ import shutil from rocm_docs import ROCmDocs -shutil.copy2('../CONTRIBUTING.md','./contributing.md') -shutil.copy2('../RELEASE.md','./release.md') +shutil.copy2('../CONTRIBUTING.md','./contribute/index.md') +shutil.copy2('../RELEASE.md','./about/release-notes.md') # Keep capitalization due to similar linking on GitHub's markdown preview. shutil.copy2('../CHANGELOG.md','./CHANGELOG.md') @@ -35,37 +35,40 @@ article_pages = [ "date":"2023-07-27" }, - {"file":"tutorials/quick_start/windows", "os":["windows"]}, - {"file":"tutorials/quick_start/linux", "os":["linux"]}, + {"file":"tutorials/quick-start/windows", "os":["windows"]}, + {"file":"tutorials/quick-start/linux", "os":["linux"]}, {"file":"tutorials/install/linux/index", "os":["linux"]}, - {"file":"tutorials/install/linux/install_overview", "os":["linux"]}, + {"file":"tutorials/install/linux/install-options", "os":["linux"]}, {"file":"tutorials/install/linux/prerequisites", "os":["linux"]}, {"file":"tutorials/install/docker", "os":["linux"]}, - {"file":"tutorials/install/magma_install", "os":["linux"]}, - {"file":"tutorials/install/pytorch_install", "os":["linux"]}, - {"file":"tutorials/install/tensorflow_install", "os":["linux"]}, + {"file":"tutorials/install/magma-install", "os":["linux"]}, + {"file":"tutorials/install/pytorch-install", "os":["linux"]}, + {"file":"tutorials/install/tensorflow-install", "os":["linux"]}, {"file":"tutorials/install/windows/index", "os":["windows"]}, {"file":"tutorials/install/windows/prerequisites", "os":["windows"]}, {"file":"tutorials/install/windows/cli/index", "os":["windows"]}, {"file":"tutorials/install/windows/gui/index", "os":["windows"]}, - {"file":"about/release/linux_support", "os":["linux"]}, - {"file":"about/release/windows_support", "os":["windows"]}, + {"file":"about/compatibility/linux-support", "os":["linux"]}, + {"file":"about/compatibility/windows-support", "os":["windows"]}, - {"file":"about/compatibility/docker_image_support_matrix", "os":["linux"]}, + {"file":"about/compatibility/docker-image-support-matrix", "os":["linux"]}, - {"file":"reference/libraries/gpu_libraries/communication", "os":["linux"]}, - {"file":"reference/compilers_tools/index", "os":["linux"]}, - {"file":"reference/computer_vision", "os":["linux"]}, + {"file":"reference/libraries/gpu-libraries/index", "os":["linux"]}, + {"file":"reference/compilers-tools/index", "os":["linux"]}, + {"file":"reference/index", "os":["linux"]}, - {"file":"how_to/deep_learning_rocm", "os":["linux"]}, - {"file":"how_to/gpu_aware_mpi", "os":["linux"]}, - {"file":"how_to/system_debugging", "os":["linux"]}, + {"file":"how-to/deep-learning-rocm", "os":["linux"]}, + {"file":"how-to/gpu-aware-mpi", "os":["linux"]}, + {"file":"how-to/system-debugging", "os":["linux"]}, + {"file":"how-to/index", "os":["linux", "windows"]}, + + {"file":"rocm-ai", "os":["linux", "windows"]}, + {"file":"rocm-a-z", "os":["linux", "windows"]}, - {"file":"rocm_ai/rocm_ai", "os":["linux"]}, ] external_toc_path = "./sphinx/_toc.yml" diff --git a/docs/contribute/building.md b/docs/contribute/building.md index 2de08bba0..5a1ea2ea6 100644 --- a/docs/contribute/building.md +++ b/docs/contribute/building.md @@ -10,12 +10,12 @@ When opening a PR to the `develop` branch on GitHub, the page corresponding to the PR (`https://github.com/RadeonOpenCompute/ROCm/pull/`) will have a summary at the bottom. This requires the user be logged in to GitHub. -- There, click `Show all checks` and `Details` of the Read the Docs pipeline. It +* There, click `Show all checks` and `Details` of the Read the Docs pipeline. It will take you to a URL of the form `https://readthedocs.com/projects/advanced-micro-devices-rocm/builds//` - - The list of commands shown are the exact ones used by CI to produce a render + * The list of commands shown are the exact ones used by CI to produce a render of the documentation. -- There, click on the small blue link `View docs` (which is not the same as the +* There, click on the small blue link `View docs` (which is not the same as the bigger button with the same text). It will take you to the built HTML site with a URL of the form `https://advanced-micro-devices-demo--.com.readthedocs.build/projects/alpha/en//`. @@ -24,7 +24,7 @@ a summary at the bottom. This requires the user be logged in to GitHub. Python versions known to build documentation: -- 3.8 +* 3.8 To build the docs locally using Python Virtual Environment (`venv`), execute the following commands from the project root: @@ -54,9 +54,8 @@ resulting website show up on a locally-served web server. ### Configuring VS Code 1. Install the following extensions: - - - Python `(ms-python.python)` - - Live Server `(ritwickdey.LiveServer)` + * Python `(ms-python.python)` + * Live Server `(ritwickdey.LiveServer)` 2. Add the following entries in `.vscode/settings.json` @@ -69,11 +68,11 @@ resulting website show up on a locally-served web server. ``` The settings above are used for the following reasons: - - `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed + * `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed alongside the `tasks.json` command. - - `liveServer.settings.wait`: Tells live server to wait with the update to give time for Sphinx to + * `liveServer.settings.wait`: Tells live server to wait with the update to give time for Sphinx to regenerate site contents and not refresh before all is done. (Empirical value) - - `python.terminal.activateEnvInCurrentTerminal`: Automatic virtual environment activation is a nice touch, + * `python.terminal.activateEnvInCurrentTerminal`: Automatic virtual environment activation is a nice touch, should you want to build the site from the integrated terminal. 3. Add the following tasks in `.vscode/tasks.json` @@ -145,21 +144,21 @@ resulting website show up on a locally-served web server. 4. Configure Python virtual environment (`venv`) - - From the Command Palette, run `Python: Create Environment` - - Select `venv` environment and the `docs/sphinx/requirements.txt` file. + * From the Command Palette, run `Python: Create Environment` + * Select `venv` environment and the `docs/sphinx/requirements.txt` file. _(Simply pressing enter while hovering over the file from the drop down is insufficient, one has to select the radio button with the 'Space' key if using the keyboard.)_ 5. Build the docs - - Launch the default build Task using either: - - a hotkey _(default is `Ctrl+Shift+B`)_ or - - by issuing the `Tasks: Run Build Task` from the Command Palette. + * Launch the default build Task using either: + * a hotkey _(default is `Ctrl+Shift+B`)_ or + * by issuing the `Tasks: Run Build Task` from the Command Palette. 6. Open the live preview - - Navigate to the output of the site within VS Code, right-click on + * Navigate to the output of the site within VS Code, right-click on `.vscode/build/html/index.html` and select `Open with Live Server`. The contents should update on every rebuild without having to refresh the browser. diff --git a/docs/contribute/index.md b/docs/contribute/index.md index 869dfb3bd..84426f8ae 100644 --- a/docs/contribute/index.md +++ b/docs/contribute/index.md @@ -35,16 +35,16 @@ guide on writing and formatting on GitHub as a starting point. ROCm documentation adds additional requirements to Markdown and RST based files as follows: -- Level one headers are only used for page titles. There must be only one level +* Level one headers are only used for page titles. There must be only one level 1 header per file for both Markdown and Restructured Text. -- Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via +* Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via our automated GitHub action on a Pull Request (PR). See the {doc}`rocm-docs-core linting user guide ` for more details. ## Filenames and folder structure -Please use snake case (all lower case letters and underscores instead of spaces) -for file names. For example, `example_file_name.md`. +Please use kebab-case (all lower case letters and dashes instead of spaces) +for file names. For example, `example-file-name.md`. Our documentation follows Pitchfork for folder structure. All documentation is in `/docs` except for special files like the contributing guide in the `/` folder. All images used in the documentation are @@ -52,8 +52,8 @@ placed in the `/docs/data` folder. ## Language and Style -Adopt Microsoft C++ docs guidelines for -[Voice and tone](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md). +Adopt Microsoft CPP-Docs guidelines for +[Voice and Tone](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md). ROCm documentation templates to be made public shortly. ROCm templates dictate the recommended structure and flow of the documentation. Guidelines on how to @@ -69,5 +69,3 @@ Raise issues in `rocm-docs-core` for any formatting concerns and changes request For more topics, such as submitting feedback and ways to build documentation, see the [Contributing Section](https://rocm.docs.amd.com/en/latest/contributing.html) at [rocm.docs.amd.com](https://rocm.docs.amd.com) - -To learn more about how our documentation is built, refer to the [ROCm toolchain](toolchain.md). diff --git a/docs/contribute/toolchain.md b/docs/contribute/toolchain.md index e3a7c1660..5ec08e4a2 100644 --- a/docs/contribute/toolchain.md +++ b/docs/contribute/toolchain.md @@ -9,7 +9,7 @@ project that applies customization for our documentation. This project is the tool most ROCm repositories use as part of the documentation build. It is also available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/). -See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation `. +See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation`. ## Sphinx diff --git a/docs/data/amd_logo.png b/docs/data/amd-logo.png similarity index 100% rename from docs/data/amd_logo.png rename to docs/data/amd-logo.png diff --git a/docs/data/conceptual/gpu_arch/image001.png b/docs/data/conceptual/gpu-arch/image001.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image001.png rename to docs/data/conceptual/gpu-arch/image001.png diff --git a/docs/data/conceptual/gpu_arch/image002.png b/docs/data/conceptual/gpu-arch/image002.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image002.png rename to docs/data/conceptual/gpu-arch/image002.png diff --git a/docs/data/conceptual/gpu_arch/image003.png b/docs/data/conceptual/gpu-arch/image003.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image003.png rename to docs/data/conceptual/gpu-arch/image003.png diff --git a/docs/data/conceptual/gpu_arch/image004.png b/docs/data/conceptual/gpu-arch/image004.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image004.png rename to docs/data/conceptual/gpu-arch/image004.png diff --git a/docs/data/conceptual/gpu_arch/image005.png b/docs/data/conceptual/gpu-arch/image005.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image005.png rename to docs/data/conceptual/gpu-arch/image005.png diff --git a/docs/data/conceptual/gpu_arch/image006.png b/docs/data/conceptual/gpu-arch/image006.png similarity index 100% rename from docs/data/conceptual/gpu_arch/image006.png rename to docs/data/conceptual/gpu-arch/image006.png diff --git a/docs/data/how_to/gpu_enabled_mpi_1.png b/docs/data/how-to/gpu-enabled-mpi-1.png similarity index 100% rename from docs/data/how_to/gpu_enabled_mpi_1.png rename to docs/data/how-to/gpu-enabled-mpi-1.png diff --git a/docs/data/how_to/tuning_guides/image010.png b/docs/data/how-to/tuning-guides/image010.png similarity index 100% rename from docs/data/how_to/tuning_guides/image010.png rename to docs/data/how-to/tuning-guides/image010.png diff --git a/docs/data/how_to/tuning_guides/image011.png b/docs/data/how-to/tuning-guides/image011.png similarity index 100% rename from docs/data/how_to/tuning_guides/image011.png rename to docs/data/how-to/tuning-guides/image011.png diff --git a/docs/data/how_to/tuning_guides/image012.png b/docs/data/how-to/tuning-guides/image012.png similarity index 100% rename from docs/data/how_to/tuning_guides/image012.png rename to docs/data/how-to/tuning-guides/image012.png diff --git a/docs/data/how_to/tuning_guides/image013.png b/docs/data/how-to/tuning-guides/image013.png similarity index 100% rename from docs/data/how_to/tuning_guides/image013.png rename to docs/data/how-to/tuning-guides/image013.png diff --git a/docs/data/how_to/tuning_guides/image014.png b/docs/data/how-to/tuning-guides/image014.png similarity index 100% rename from docs/data/how_to/tuning_guides/image014.png rename to docs/data/how-to/tuning-guides/image014.png diff --git a/docs/data/how_to/tuning_guides/image015.png b/docs/data/how-to/tuning-guides/image015.png similarity index 100% rename from docs/data/how_to/tuning_guides/image015.png rename to docs/data/how-to/tuning-guides/image015.png diff --git a/docs/data/how_to/tuning_guides/image016.png b/docs/data/how-to/tuning-guides/image016.png similarity index 100% rename from docs/data/how_to/tuning_guides/image016.png rename to docs/data/how-to/tuning-guides/image016.png diff --git a/docs/data/how_to/tuning_guides/tuning001.png b/docs/data/how-to/tuning-guides/tuning001.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning001.png rename to docs/data/how-to/tuning-guides/tuning001.png diff --git a/docs/data/how_to/tuning_guides/tuning002.png b/docs/data/how-to/tuning-guides/tuning002.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning002.png rename to docs/data/how-to/tuning-guides/tuning002.png diff --git a/docs/data/how_to/tuning_guides/tuning003.png b/docs/data/how-to/tuning-guides/tuning003.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning003.png rename to docs/data/how-to/tuning-guides/tuning003.png diff --git a/docs/data/how_to/tuning_guides/tuning004.png b/docs/data/how-to/tuning-guides/tuning004.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning004.png rename to docs/data/how-to/tuning-guides/tuning004.png diff --git a/docs/data/how_to/tuning_guides/tuning005.png b/docs/data/how-to/tuning-guides/tuning005.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning005.png rename to docs/data/how-to/tuning-guides/tuning005.png diff --git a/docs/data/how_to/tuning_guides/tuning006.png b/docs/data/how-to/tuning-guides/tuning006.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning006.png rename to docs/data/how-to/tuning-guides/tuning006.png diff --git a/docs/data/how_to/tuning_guides/tuning008.png b/docs/data/how-to/tuning-guides/tuning008.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning008.png rename to docs/data/how-to/tuning-guides/tuning008.png diff --git a/docs/data/how_to/tuning_guides/tuning009.png b/docs/data/how-to/tuning-guides/tuning009.png similarity index 100% rename from docs/data/how_to/tuning_guides/tuning009.png rename to docs/data/how-to/tuning-guides/tuning009.png diff --git a/docs/data/reference/openmp/openmp_toolchain.svg b/docs/data/reference/openmp/openmp-toolchain.svg similarity index 100% rename from docs/data/reference/openmp/openmp_toolchain.svg rename to docs/data/reference/openmp/openmp-toolchain.svg diff --git a/docs/data/rocm_ai/TextClassification_3.png b/docs/data/rocm-ai/TextClassification-3.png similarity index 100% rename from docs/data/rocm_ai/TextClassification_3.png rename to docs/data/rocm-ai/TextClassification-3.png diff --git a/docs/data/rocm_ai/TextClassification_4.png b/docs/data/rocm-ai/TextClassification-4.png similarity index 100% rename from docs/data/rocm_ai/TextClassification_4.png rename to docs/data/rocm-ai/TextClassification-4.png diff --git a/docs/data/rocm_ai/TextClassification_5.png b/docs/data/rocm-ai/TextClassification-5.png similarity index 100% rename from docs/data/rocm_ai/TextClassification_5.png rename to docs/data/rocm-ai/TextClassification-5.png diff --git a/docs/data/rocm_ai/TextClassification_6.png b/docs/data/rocm-ai/TextClassification-6.png similarity index 100% rename from docs/data/rocm_ai/TextClassification_6.png rename to docs/data/rocm-ai/TextClassification-6.png diff --git a/docs/data/rocm_ai/TextClassification_7.png b/docs/data/rocm-ai/TextClassification-7.png similarity index 100% rename from docs/data/rocm_ai/TextClassification_7.png rename to docs/data/rocm-ai/TextClassification-7.png diff --git a/docs/data/rocm_ai/image018.png b/docs/data/rocm-ai/image018.png similarity index 100% rename from docs/data/rocm_ai/image018.png rename to docs/data/rocm-ai/image018.png diff --git a/docs/data/rocm_ai/inception_v3.png b/docs/data/rocm-ai/inception-v3.png similarity index 100% rename from docs/data/rocm_ai/inception_v3.png rename to docs/data/rocm-ai/inception-v3.png diff --git a/docs/data/rocm_ai/mnist_1.png b/docs/data/rocm-ai/mnist-1.png similarity index 100% rename from docs/data/rocm_ai/mnist_1.png rename to docs/data/rocm-ai/mnist-1.png diff --git a/docs/data/rocm_ai/mnist_2.png b/docs/data/rocm-ai/mnist-2.png similarity index 100% rename from docs/data/rocm_ai/mnist_2.png rename to docs/data/rocm-ai/mnist-2.png diff --git a/docs/data/rocm_ai/mnist_3.png b/docs/data/rocm-ai/mnist-3.png similarity index 100% rename from docs/data/rocm_ai/mnist_3.png rename to docs/data/rocm-ai/mnist-3.png diff --git a/docs/data/rocm_ai/mnist_4.png b/docs/data/rocm-ai/mnist-4.png similarity index 100% rename from docs/data/rocm_ai/mnist_4.png rename to docs/data/rocm-ai/mnist-4.png diff --git a/docs/data/rocm_ai/mnist_5.png b/docs/data/rocm-ai/mnist-5.png similarity index 100% rename from docs/data/rocm_ai/mnist_5.png rename to docs/data/rocm-ai/mnist-5.png diff --git a/docs/data/tutorials/install/magma_install/magma005.png b/docs/data/tutorials/install/magma-install/magma005.png similarity index 100% rename from docs/data/tutorials/install/magma_install/magma005.png rename to docs/data/tutorials/install/magma-install/magma005.png diff --git a/docs/data/tutorials/install/magma_install/magma006.png b/docs/data/tutorials/install/magma-install/magma006.png similarity index 100% rename from docs/data/tutorials/install/magma_install/magma006.png rename to docs/data/tutorials/install/magma-install/magma006.png diff --git a/docs/data/unused_images/_005-deselect-all-windows.png b/docs/data/unused-images/_005-deselect-all-windows.png similarity index 100% rename from docs/data/unused_images/_005-deselect-all-windows.png rename to docs/data/unused-images/_005-deselect-all-windows.png diff --git a/docs/data/unused_images/_006-component-options-sdk-core-windows.png b/docs/data/unused-images/_006-component-options-sdk-core-windows.png similarity index 100% rename from docs/data/unused_images/_006-component-options-sdk-core-windows.png rename to docs/data/unused-images/_006-component-options-sdk-core-windows.png diff --git a/docs/data/unused_images/_007-component-options-libraries-windows.png b/docs/data/unused-images/_007-component-options-libraries-windows.png similarity index 100% rename from docs/data/unused_images/_007-component-options-libraries-windows.png rename to docs/data/unused-images/_007-component-options-libraries-windows.png diff --git a/docs/data/unused_images/_008-component-options-rtc-windows.png b/docs/data/unused-images/_008-component-options-rtc-windows.png similarity index 100% rename from docs/data/unused_images/_008-component-options-rtc-windows.png rename to docs/data/unused-images/_008-component-options-rtc-windows.png diff --git a/docs/data/unused_images/_009-component-options-rt-windows.png b/docs/data/unused-images/_009-component-options-rt-windows.png similarity index 100% rename from docs/data/unused_images/_009-component-options-rt-windows.png rename to docs/data/unused-images/_009-component-options-rt-windows.png diff --git a/docs/data/unused_images/_010-component-options-vs-plugin-windows.png b/docs/data/unused-images/_010-component-options-vs-plugin-windows.png similarity index 100% rename from docs/data/unused_images/_010-component-options-vs-plugin-windows.png rename to docs/data/unused-images/_010-component-options-vs-plugin-windows.png diff --git a/docs/data/unused_images/_011-component-options-radeon-software-windows.png b/docs/data/unused-images/_011-component-options-radeon-software-windows.png similarity index 100% rename from docs/data/unused_images/_011-component-options-radeon-software-windows.png rename to docs/data/unused-images/_011-component-options-radeon-software-windows.png diff --git a/docs/data/unused_images/_Deep Learning Image 1.png b/docs/data/unused-images/_Deep Learning Image 1.png similarity index 100% rename from docs/data/unused_images/_Deep Learning Image 1.png rename to docs/data/unused-images/_Deep Learning Image 1.png diff --git a/docs/data/unused_images/_Install PyTorch using wheels Package.png b/docs/data/unused-images/_Install PyTorch using wheels Package.png similarity index 100% rename from docs/data/unused_images/_Install PyTorch using wheels Package.png rename to docs/data/unused-images/_Install PyTorch using wheels Package.png diff --git a/docs/data/unused_images/_Machine Learning.png b/docs/data/unused-images/_Machine Learning.png similarity index 100% rename from docs/data/unused_images/_Machine Learning.png rename to docs/data/unused-images/_Machine Learning.png diff --git a/docs/data/unused_images/_Matrix-1.png b/docs/data/unused-images/_Matrix-1.png similarity index 100% rename from docs/data/unused_images/_Matrix-1.png rename to docs/data/unused-images/_Matrix-1.png diff --git a/docs/data/unused_images/_Matrix-2.png b/docs/data/unused-images/_Matrix-2.png similarity index 100% rename from docs/data/unused_images/_Matrix-2.png rename to docs/data/unused-images/_Matrix-2.png diff --git a/docs/data/unused_images/_Matrix-3.png b/docs/data/unused-images/_Matrix-3.png similarity index 100% rename from docs/data/unused_images/_Matrix-3.png rename to docs/data/unused-images/_Matrix-3.png diff --git a/docs/data/unused_images/_Model In.png b/docs/data/unused-images/_Model In.png similarity index 100% rename from docs/data/unused_images/_Model In.png rename to docs/data/unused-images/_Model In.png diff --git a/docs/data/unused_images/_Pytorch 11.png b/docs/data/unused-images/_Pytorch 11.png similarity index 100% rename from docs/data/unused_images/_Pytorch 11.png rename to docs/data/unused-images/_Pytorch 11.png diff --git a/docs/data/unused_images/_Text Classification 1.png b/docs/data/unused-images/_Text Classification 1.png similarity index 100% rename from docs/data/unused_images/_Text Classification 1.png rename to docs/data/unused-images/_Text Classification 1.png diff --git a/docs/data/unused_images/_Text Classification 2.png b/docs/data/unused-images/_Text Classification 2.png similarity index 100% rename from docs/data/unused_images/_Text Classification 2.png rename to docs/data/unused-images/_Text Classification 2.png diff --git a/docs/data/unused_images/_Text Classification 3.png b/docs/data/unused-images/_Text Classification 3.png similarity index 100% rename from docs/data/unused_images/_Text Classification 3.png rename to docs/data/unused-images/_Text Classification 3.png diff --git a/docs/data/unused_images/_Text Classification 4.png b/docs/data/unused-images/_Text Classification 4.png similarity index 100% rename from docs/data/unused_images/_Text Classification 4.png rename to docs/data/unused-images/_Text Classification 4.png diff --git a/docs/data/unused_images/_Text Classification 5.png b/docs/data/unused-images/_Text Classification 5.png similarity index 100% rename from docs/data/unused_images/_Text Classification 5.png rename to docs/data/unused-images/_Text Classification 5.png diff --git a/docs/data/unused_images/_image.007-tuning.png b/docs/data/unused-images/_image.007-tuning.png similarity index 100% rename from docs/data/unused_images/_image.007-tuning.png rename to docs/data/unused-images/_image.007-tuning.png diff --git a/docs/data/unused_images/_mnist 4.png b/docs/data/unused-images/_mnist 4.png similarity index 100% rename from docs/data/unused_images/_mnist 4.png rename to docs/data/unused-images/_mnist 4.png diff --git a/docs/data/unused_images/_mnist 5.png b/docs/data/unused-images/_mnist 5.png similarity index 100% rename from docs/data/unused_images/_mnist 5.png rename to docs/data/unused-images/_mnist 5.png diff --git a/docs/data/unused_images/_with_pytorch.png b/docs/data/unused-images/_with_pytorch.png similarity index 100% rename from docs/data/unused_images/_with_pytorch.png rename to docs/data/unused-images/_with_pytorch.png diff --git a/docs/data/unused_images/_with_tensorflow.png b/docs/data/unused-images/_with_tensorflow.png similarity index 100% rename from docs/data/unused_images/_with_tensorflow.png rename to docs/data/unused-images/_with_tensorflow.png diff --git a/docs/how-to/deep-learning-rocm.md b/docs/how-to/deep-learning-rocm.md new file mode 100644 index 000000000..9d965f795 --- /dev/null +++ b/docs/how-to/deep-learning-rocm.md @@ -0,0 +1,20 @@ +# Deep learning guide + +The following sections cover the different framework installations for ROCm and +deep-learning applications. The following image provides +the sequential flow for the use of each framework. Refer to the ROCm Compatible +Frameworks Release Notes for each framework's most current release notes at +[Third party support](../about/compatibility/3rd-party-support-matrix.md). + +```{figure} ../data/tutorials/install/magma-install/magma005.png +:name: rocm-compat-frameworks-chart +:align: center + +ROCm Compatible Frameworks Flowchart +``` + +## Frameworks Installation + +* [How to Install PyTorch?](../tutorials/install/pytorch-install) +* [How to Install Tensorflow?](../tutorials/install/tensorflow-install) +* [How to Install Magma?](../tutorials/install/magma-install) diff --git a/docs/how_to/gpu_aware_mpi.md b/docs/how-to/gpu-aware-mpi.md similarity index 97% rename from docs/how_to/gpu_aware_mpi.md rename to docs/how-to/gpu-aware-mpi.md index 0eecc358c..1bcdd02e5 100644 --- a/docs/how_to/gpu_aware_mpi.md +++ b/docs/how-to/gpu-aware-mpi.md @@ -72,7 +72,7 @@ make -j $(nproc) make -j $(nproc) install ``` -The [communication libraries tables](#communication_libraries) +The [communication libraries tables](#communication-libraries) documents the compatibility of UCX versions with ROCm versions. ## Install Open MPI @@ -148,7 +148,7 @@ larger than 67MB, an effective utilization of about 150GB/sec is achieved, which corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that connection: -:::{figure} ../data/how_to/gpu_enabled_mpi_1.png +:::{figure} ../data/how-to/gpu-enabled-mpi-1.png :name: mpi-bandwidth :alt: OSU execution showing transfer bandwidth increasing alongside payload inc. Inter-GPU bandwidth with various payload sizes. @@ -161,7 +161,7 @@ Unified Collective Communication Library (UCC) component in Open MPI. For this, the UCC library has to be configured and compiled with ROCm support. -Please note the compatibility [tables](#communication_libraries) +Please note the compatibility [tables](#communication-libraries) for UCC versions with the various ROCm versions. An example for configuring UCC and Open MPI with ROCm support diff --git a/docs/how-to/index.md b/docs/how-to/index.md new file mode 100644 index 000000000..aa347f67d --- /dev/null +++ b/docs/how-to/index.md @@ -0,0 +1,34 @@ +# All How-To Material + +:::::{grid} 1 1 2 2 +:gutter: 1 + +:::{grid-item-card} +**[Tuning Guides](./tuning-guides/index.md)** + +Use case-specific system setup and tuning guides. + +::: + +:::{grid-item-card} +**[Deep-learning guide](./deep-learning-rocm.md)** + +Installation of various deep learning frameworks and applications. + +::: + +:::{grid-item-card} +**[GPU-Enabled MPI](./gpu-aware-mpi.md)** + +This chapter exemplifies how to set up Open MPI with the ROCm platform. + +::: + +:::{grid-item-card} +**[System Debugging Guide](./system-debugging.md)** + +Useful commands to debug misbehaving ROCm installations. + +::: + +::::: diff --git a/docs/how_to/system_debugging.md b/docs/how-to/system-debugging.md similarity index 71% rename from docs/how_to/system_debugging.md rename to docs/how-to/system-debugging.md index 322639263..071ddd41e 100644 --- a/docs/how_to/system_debugging.md +++ b/docs/how-to/system-debugging.md @@ -6,19 +6,13 @@ Kernel options to avoid: the Ethernet port getting renamed every time you change ## ROCr Error Code -- 2 Invalid Dimension - -- 4 Invalid Group Memory - -- 8 Invalid (or Null) Code - -- 32 Invalid Format - -- 64 Group is too large - -- 128 Out of VGPRs - -- 0x80000000 Debug Options +* 2 Invalid Dimension +* 4 Invalid Group Memory +* 8 Invalid (or Null) Code +* 32 Invalid Format +* 64 Group is too large +* 128 Out of VGPRs +* 0x80000000 Debug Options ## Command to Dump Firmware Version and Get Linux Kernel Version @@ -30,15 +24,15 @@ Kernel options to avoid: the Ethernet port getting renamed every time you change Debug messages when developing/debugging base ROCm driver. You could enable the printing from `libhsakmt.so` by setting an environment variable, `HSAKMT_DEBUG_LEVEL`. Available debug levels are 3-7. The higher level you set, the more messages will print. -- `export HSAKMT_DEBUG_LEVEL=3` : Only pr_err() prints. +* `export HSAKMT_DEBUG_LEVEL=3` : Only pr_err() prints. -- `export HSAKMT_DEBUG_LEVEL=4` : pr_err() and pr_warn() print. +* `export HSAKMT_DEBUG_LEVEL=4` : pr_err() and pr_warn() print. -- `export HSAKMT_DEBUG_LEVEL=5` : We currently do not implement “notice”. Setting to 5 is same as setting to 4. +* `export HSAKMT_DEBUG_LEVEL=5` : We currently do not implement “notice”. Setting to 5 is same as setting to 4. -- `export HSAKMT_DEBUG_LEVEL=6` : pr_err(), pr_warn(), and pr_info print. +* `export HSAKMT_DEBUG_LEVEL=6` : pr_err(), pr_warn(), and pr_info print. -- `export HSAKMT_DEBUG_LEVEL=7` : Everything including pr_debug prints. +* `export HSAKMT_DEBUG_LEVEL=7` : Everything including pr_debug prints. ## ROCr Level Environment Variables for Debug diff --git a/docs/how_to/tuning_guides/index.md b/docs/how-to/tuning-guides/index.md similarity index 70% rename from docs/how_to/tuning_guides/index.md rename to docs/how-to/tuning-guides/index.md index af3666418..370d418ed 100644 --- a/docs/how_to/tuning_guides/index.md +++ b/docs/how-to/tuning-guides/index.md @@ -9,10 +9,10 @@ hardware and BIOS configurations for OEM platforms may not provide optimal performance for HPC workloads. To enable optimal HPC settings on a per-platform and per-workload level, this guide calls out: -- BIOS settings that can impact performance -- Hardware configuration best practices -- Supported versions of operating systems -- Workload-specific recommendations for optimal BIOS and operating system +* BIOS settings that can impact performance +* Hardware configuration best practices +* Supported versions of operating systems +* Workload-specific recommendations for optimal BIOS and operating system settings There is also a discussion on the AMD Instinct™ software development @@ -23,11 +23,11 @@ not exhaustively tested across all compilers. Prerequisites to understanding this document and to performing tuning of HPC applications include: -- Experience in configuring servers -- Administrative access to the server's Management Interface (BMC) -- Administrative access to the operating system -- Familiarity with the OEM server's BMC (strongly recommended) -- Familiarity with the OS specific tools for configuration, monitoring, and +* Experience in configuring servers +* Administrative access to the server's Management Interface (BMC) +* Administrative access to the operating system +* Familiarity with the OEM server's BMC (strongly recommended) +* Familiarity with the OS specific tools for configuration, monitoring, and troubleshooting (strongly recommended) This document provides guidance on tuning systems with various AMD Instinct™ @@ -46,23 +46,25 @@ their own performance testing for additional tuning. :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} AMD Instinct™ MI200 +:::{grid-item-card} +**[AMD Instinct™ MI200](./mi200)** + This chapter goes through how to configure your AMD Instinct™ MI200 accelerated compute nodes to get the best performance out of them. -- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) -- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) -- [Guide](./mi200) +* [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) +* [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) ::: -:::{grid-item-card} AMD Instinct™ MI100 +:::{grid-item-card} +**[AMD Instinct™ MI100](./mi100)** + This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100 accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs. -- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) -- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) -- [Guide](./mi100) +* [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) +* [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) ::: @@ -75,7 +77,7 @@ requirements, a blend of both graphics and compute, certification, stability and the list continues. The document covers specific software requirements and processes needed to use -these GPUs for Single Root I/O Virtualization (SR-IOV) and Machine Learning +these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning (ML). The main purpose of this document is to help users utilize the RDNA 2 GPUs to @@ -84,13 +86,14 @@ their full potential. :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} AMD Radeon™ PRO W6000 and V620 +:::{grid-item-card} +**[AMD Radeon™ PRO W6000 and V620](./w6000-v620)** + This chapter describes the AMD GPUs with RDNA™ 2 architecture, namely AMD Radeon PRO W6800 and AMD Radeon PRO V620 -- [AMD RDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) -- [Whitepaper](https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf) -- [Guide](./w6000_v620) +* [AMD RDNA2 Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) +* [Whitepaper](https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf) ::: diff --git a/docs/how_to/tuning_guides/mi100.md b/docs/how-to/tuning-guides/mi100.md similarity index 90% rename from docs/how_to/tuning_guides/mi100.md rename to docs/how-to/tuning-guides/mi100.md index efab47ed9..8cb141fa8 100644 --- a/docs/how_to/tuning_guides/mi100.md +++ b/docs/how-to/tuning-guides/mi100.md @@ -11,14 +11,14 @@ AMD EPYC™ 7003 Series Processors" depending on the processor generation of the system. In addition to the BIOS settings listed below the following settings -({ref}`mi100_bios_settings`) will also have to be enacted via the command line (see -{ref}`mi100_os_settings`): +({ref}`mi100-bios-settings`) will also have to be enacted via the command line (see +{ref}`mi100-os-settings`): -- Core C states -- AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors) -- IOMMU (if needed) +* Core C states +* AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors) +* IOMMU (if needed) -(mi100_bios_settings)= +(mi100-bios-settings)= ### System BIOS Settings @@ -230,7 +230,7 @@ processor (NPS4). For memory bandwidth sensitive applications using MPI, NPS4 is recommended. For applications that are not optimized for NUMA locality, NPS1 is the recommended setting. -(mi100_os_settings)= +(mi100-os-settings)= ### Operating System Settings @@ -238,9 +238,9 @@ NPS1 is the recommended setting. There are several Core-States, or C-states that an AMD EPYC CPU can idle within: -- C0: active. This is the active state while running an application. -- C1: idle -- C2: idle and power gated. This is a deeper sleep state and will have a +* C0: active. This is the active state while running an application. +* C1: idle +* C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1. @@ -336,17 +336,17 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to "enable", the following steps can be applied to the system to enable all (logical) cores of the system: -- In the server BIOS, set IOMMU to "Enabled". -- When configuring the Grub boot loader, add the following arguments for the +* In the server BIOS, set IOMMU to "Enabled". +* When configuring the Grub boot loader, add the following arguments for the Linux kernel: `amd_iommu=on iommu=pt` -- Update Grub to use the modified configuration: +* Update Grub to use the modified configuration: ```shell sudo grub2-mkconfig -o /boot/grub2/grub.cfg ``` -- Reboot the system. -- Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`: +* Reboot the system. +* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`: ```none [...] @@ -363,8 +363,8 @@ For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to [Installing ROCm on Linux](../../tutorials/install/linux/index). For verifying that the installation was successful, refer to {ref}`verifying-kernel-mode-driver-installation` and -[Validation Tools](../../reference/compilers_tools/validation_tools). Should verification -fail, consult the [System Debugging Guide](../system_debugging). +[Validation Tools](../../reference/compilers-tools/validation-tools). Should verification +fail, consult the [System Debugging Guide](../system-debugging). (mi100-hw-verification)= @@ -375,7 +375,7 @@ the GPU hardware, the `rocm-smi` command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions: -```{figure} ../../data/how_to/tuning_guides/tuning001.png +```{figure} ../../data/how-to/tuning-guides/tuning001.png :name: mi100-smi-showhw :alt: rocm-smi --showhw output on an 8*MI100 system @@ -385,7 +385,7 @@ versions: Another important query is to show the system structure, the localization of the GPUs in the system, and the fabric connections between the system components: -```{figure} ../../data/how_to/tuning_guides/tuning002.png +```{figure} ../../data/how-to/tuning-guides/tuning002.png :name: mi100-smi-showtopo :alt: rocm-smi --showtopo output on an 8*MI100 system. @@ -394,19 +394,19 @@ GPUs in the system, and the fabric connections between the system components: The previous command shows the system structure in four blocks: -- The first block of the output shows the distance between the GPUs similar to +* The first block of the output shows the distance between the GPUs similar to what the `numactl` command outputs for the NUMA domains of a system. The weight is a qualitative measure for the "distance" data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU. -- The second block has a matrix for the number of hops required to send data +* The second block has a matrix for the number of hops required to send data from one GPU to another. For the GPUs in the local hive, this number is one, while for the others it is three (one hop to leave the hive, one hop across the processors, and one hop within the destination hive). -- The third block outputs the link types between the GPUs. This can either be +* The third block outputs the link types between the GPUs. This can either be "XGMI" for AMD Infinity Fabric™ links or "PCIE" for PCIe Gen4 links. -- The fourth block reveals the localization of a GPU with respect to the NUMA +* The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC™ processors. To query the compute capabilities of the GPU devices, the `rocminfo` command is @@ -414,7 +414,7 @@ available with the AMD ROCm™ platform. It lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and instruction set architecture: -```{figure} ../../data/how_to/tuning_guides/tuning003.png +```{figure} ../../data/how-to/tuning-guides/tuning003.png :name: mi100-rocminfo :alt: rocminfo output fragment on an 8*MI100 system. @@ -422,7 +422,7 @@ SIMD pipelines, memory information, and instruction set architecture: ``` For a complete list of architecture (LLVM target) names, refer to -[Linux Support](../../about/release/linux_support.md) and [Windows Support](../../about/release/windows_support.md). +[Linux Support](../../about/compatibility/linux-support.md) and [Windows Support](../../about/compatibility/windows-support.md). ### Testing Inter-device Bandwidth @@ -468,7 +468,7 @@ Alternatively, the source code can be downloaded and built from The output will list the available compute devices (CPUs and GPUs): -```{figure} ../../data/how_to/tuning_guides/tuning004.png +```{figure} ../../data/how-to/tuning-guides/tuning004.png :name: mi100-bandwidth-test-1 :alt: rocm-bandwidth-test output fragment on an 8*MI100 system listing devices. @@ -479,14 +479,14 @@ The output will also show a matrix that contains a "1" if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to `rocm-smi`): -```{figure} ../../data/how_to/tuning_guides/tuning005.png +```{figure} ../../data/how-to/tuning-guides/tuning005.png :name: mi100-bandwidth-test-2 :alt: rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device access matrix. `rocm-bandwidth-test` output fragment on an 8*MI100 system showing inter-device access matrix. ``` -```{figure} ../../data/how_to/tuning_guides/tuning006.png +```{figure} ../../data/how-to/tuning-guides/tuning006.png :name: mi100-bandwidth-test-3 :alt: rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device NUMA distance. @@ -496,7 +496,7 @@ NUMA distance (similar to `rocm-smi`): The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU): -```{figure} ../../data/how_to/tuning_guides/tuning004.png +```{figure} ../../data/how-to/tuning-guides/tuning004.png :name: mi100-bandwidth-test-4 :alt: rocm-bandwidth-test output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths. diff --git a/docs/how_to/tuning_guides/mi200.md b/docs/how-to/tuning-guides/mi200.md similarity index 90% rename from docs/how_to/tuning_guides/mi200.md rename to docs/how-to/tuning-guides/mi200.md index 4fe1fe47a..0e2c60f3f 100644 --- a/docs/how_to/tuning_guides/mi200.md +++ b/docs/how-to/tuning-guides/mi200.md @@ -8,14 +8,14 @@ is advised to configure the system for the best possible host configuration according to the "High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series Processors." -Configure the system BIOS settings as explained in {ref}`mi200_bios_settings` and +Configure the system BIOS settings as explained in {ref}`mi200-bios-settings` and enact the below given settings via the command line as explained in -{ref}`mi200_os_settings`: +{ref}`mi200-os-settings`: -- Core C states -- IOMMU (if needed) +* Core C states +* IOMMU (if needed) -(mi200_bios_settings)= +(mi200-bios-settings)= ### System BIOS Settings @@ -213,7 +213,7 @@ the number of NUMA nodes per socket/processor (NPS), follow the guidance of the Processors" to provide the optimal configuration for host side computation. For most HPC workloads, NPS=4 is the recommended value. -(mi200_os_settings)= +(mi200-os-settings)= ### Operating System Settings @@ -221,9 +221,9 @@ most HPC workloads, NPS=4 is the recommended value. There are several Core-States, or C-states that an AMD EPYC CPU can idle within: -- C0: active. This is the active state while running an application. -- C1: idle -- C2: idle and power gated. This is a deeper sleep state and will have a +* C0: active. This is the active state while running an application. +* C1: idle +* C2: idle and power gated. This is a deeper sleep state and will have a greater latency when moving back to the C0 state, compared to when the CPU is coming out of C1. @@ -319,17 +319,17 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to "enable", the following steps can be applied to the system to enable all (logical) cores of the system: -- In the server BIOS, set IOMMU to "Enabled". -- When configuring the Grub boot loader, add the following arguments for the +* In the server BIOS, set IOMMU to "Enabled". +* When configuring the Grub boot loader, add the following arguments for the Linux kernel: `amd_iommu=on iommu=pt` -- Update Grub to use the modified configuration: +* Update Grub to use the modified configuration: ```shell sudo grub2-mkconfig -o /boot/grub2/grub.cfg ``` -- Reboot the system. -- Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`: +* Reboot the system. +* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`: ```none [...] @@ -346,8 +346,8 @@ For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to [Installing ROCm on Linux](../../tutorials/install/linux/index). For verifying that the installation was successful, refer to {ref}`verifying-kernel-mode-driver-installation` and -[Validation Tools](../../reference/compilers_tools/validation_tools). Should verification -fail, consult the [System Debugging Guide](../system_debugging). +[Validation Tools](../../reference/compilers-tools/validation-tools). Should verification +fail, consult the [System Debugging Guide](../system-debugging). (mi200-hw-verification)= @@ -358,7 +358,7 @@ the GPU hardware, the `rocm-smi` command is available. It can show available GPUs in the system with their device ID and their respective firmware (or VBIOS) versions: -```{figure} ../../data/how_to/tuning_guides/tuning008.png +```{figure} ../../data/how-to/tuning-guides/tuning008.png :name: mi200-smi-showhw :alt: rocm-smi --showhw output on an 8*MI200 system. @@ -368,28 +368,28 @@ versions: To see the system structure, the localization of the GPUs in the system, and the fabric connections between the system components, use: -```{figure} ../../data/how_to/tuning_guides/tuning009.png +```{figure} ../../data/how-to/tuning-guides/tuning009.png :name: mi200-smi-showtopo :alt: rocm-smi --showtopo output on an 8*MI200 system. `rocm-smi --showtopo` output on an 8*MI200 system. ``` -- The first block of the output shows the distance between the GPUs similar to +* The first block of the output shows the distance between the GPUs similar to what the `numactl` command outputs for the NUMA domains of a system. The weight is a qualitative measure for the "distance" data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU. -- The second block has a matrix named "Hops between two GPUs", where 1 means the +* The second block has a matrix named "Hops between two GPUs", where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links. -- The third block outputs the link types between the GPUs. This can either be +* The third block outputs the link types between the GPUs. This can either be "XGMI" for AMD Infinity Fabric links or "PCIE" for PCIe Gen4 links. -- The fourth block reveals the localization of a GPU with respect to the NUMA +* The fourth block reveals the localization of a GPU with respect to the NUMA organization of the shared memory of the AMD EPYC processors. To query the compute capabilities of the GPU devices, use `rocminfo` command. It @@ -397,7 +397,7 @@ lists specific details about the GPU devices, including but not limited to the number of compute units, width of the SIMD pipelines, memory information, and instruction set architecture: -```{figure} ../../data/how_to/tuning_guides/image010.png +```{figure} ../../data/how-to/tuning-guides/image010.png :name: mi200-rocminfo :alt: rocminfo output fragment on an 8*MI200 system. @@ -405,7 +405,7 @@ instruction set architecture: ``` For a complete list of architecture (LLVM target) names, refer to GPU OS Support for -[Linux](../../about/release/linux_support) and [Windows](../../about/release/windows_support). +[Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md). ### Testing Inter-device Bandwidth @@ -452,7 +452,7 @@ Alternatively, the source code can be downloaded and built from The output will list the available compute devices (CPUs and GPUs), including their device ID and PCIe ID: -```{figure} ../../data/how_to/tuning_guides/image011.png +```{figure} ../../data/how-to/tuning-guides/image011.png :name: mi200-bandwidth-test-1 :alt: rocm-bandwidth-test output fragment on an 8*MI200 system listing devices. @@ -463,7 +463,7 @@ The output will also show a matrix that contains a "1" if a device can communicate to another device (CPU and GPU) of the system and it will show the NUMA distance (similar to `rocm-smi`): -```{figure} ../../data/how_to/tuning_guides/image012.png +```{figure} ../../data/how-to/tuning-guides/image012.png :name: mi200-bandwidth-test-2 :alt: rocm-bandwidth-test output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances. @@ -473,7 +473,7 @@ NUMA distance (similar to `rocm-smi`): The output also contains the measured bandwidth for unidirectional and bidirectional transfers between the devices (CPU and GPU): -```{figure} ../../data/how_to/tuning_guides/image013.png +```{figure} ../../data/how-to/tuning-guides/image013.png :name: mi200-bandwidth-test-3 :alt: rocm-bandwidth-test output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths. diff --git a/docs/how_to/tuning_guides/w6000_v620.md b/docs/how-to/tuning-guides/w6000-v620.md similarity index 93% rename from docs/how_to/tuning_guides/w6000_v620.md rename to docs/how-to/tuning-guides/w6000-v620.md index c4d53f550..a02080a40 100644 --- a/docs/how_to/tuning_guides/w6000_v620.md +++ b/docs/how-to/tuning-guides/w6000-v620.md @@ -9,14 +9,14 @@ Bare Metal follows the routine ROCm To enable ROCm virtualization on V620, one has to setup Single Root I/O Virtualization (SR-IOV) in the BIOS via setting found in the following -({ref}`bios_settings`). A tested configuration can be followed in -({ref}`os_settings`). +({ref}`bios-settings`). A tested configuration can be followed in +({ref}`os-settings`). ```{attention} SR-IOV is supported on V620 and unsupported on W6800. ``` -(bios_settings)= +(bios-settings)= ### System BIOS Settings @@ -48,7 +48,7 @@ SR-IOV is supported on V620 and unsupported on W6800. To set up the host, update SBIOS to version 1.2a. -(os_settings)= +(os-settings)= ### Operating System Settings @@ -147,7 +147,7 @@ First, assign GPU virtual function (VF) to VM using the following steps. 3. In the **Virtual Machine Manager** GUI, select the **VM** and click **Open**. - ```{figure} ../../data/how_to/tuning_guides/image014.png + ```{figure} ../../data/how-to/tuning-guides/image014.png :name: virtmgr-machine-mgr :alt: Virtual Machine Manager @@ -157,7 +157,7 @@ First, assign GPU virtual function (VF) to VM using the following steps. 4. In the VM GUI, go to **Show Virtual Hardware Details > Add Hardware** to configure hardware. - ```{figure} ../../data/how_to/tuning_guides/image015.png + ```{figure} ../../data/how-to/tuning-guides/image015.png :name: virtmgr-hw-details :alt: Show Virtual Hardware Details @@ -166,7 +166,7 @@ First, assign GPU virtual function (VF) to VM using the following steps. 5. Go to **Add Hardware > PCI Host Device > VF** and click **Finish**. - ```{figure} ../../data/how_to/tuning_guides/image016.png + ```{figure} ../../data/how-to/tuning-guides/image016.png :name: virtmgr-vf-select :alt: VF Selection diff --git a/docs/how_to/deep_learning_rocm.md b/docs/how_to/deep_learning_rocm.md deleted file mode 100644 index 9767b75f0..000000000 --- a/docs/how_to/deep_learning_rocm.md +++ /dev/null @@ -1,20 +0,0 @@ -# Deep Learning Guide - -The following sections cover the different framework installations for ROCm and -Deep Learning applications. The following image provides -the sequential flow for the use of each framework. Refer to the ROCm Compatible -Frameworks Release Notes for each framework's most current release notes at -[Third party support](../about/compatibility/3rd_party_support_matrix). - -```{figure} ../data/tutorials/install/magma_install/magma005.png -:name: rocm-compat-frameworks-chart -:align: center - -ROCm Compatible Frameworks Flowchart -``` - -## Frameworks Installation - -- [How to Install PyTorch?](../tutorials/install/pytorch_install) -- [How to Install Tensorflow?](../tutorials/install/tensorflow_install) -- [How to Install Magma?](../tutorials/install/magma_install) diff --git a/docs/how_to/index.md b/docs/how_to/index.md deleted file mode 100644 index 9b1290e57..000000000 --- a/docs/how_to/index.md +++ /dev/null @@ -1,34 +0,0 @@ -# All How-To Material - -:::::{grid} 1 1 2 2 -:gutter: 1 - -:::{grid-item-card} Tuning Guides -:link: tuning_guides/index -:link-type: doc -Use case-specific system setup and tuning guides. - -::: - -:::{grid-item-card} Deep Learning Guide -:link: deep_learning_rocm -:link-type: doc -Installation of various Deep Learning frameworks and applications. - -::: - -:::{grid-item-card} GPU-Enabled MPI -:link: gpu_aware_mpi -:link-type: doc -This chapter exemplifies how to set up Open MPI with the ROCm platform. - -::: - -:::{grid-item-card} System Debugging Guide -:link: system_debugging -:link-type: doc -Useful commands to debug misbehaving ROCm installations. - -::: - -::::: diff --git a/docs/index.md b/docs/index.md index 831e120c3..2f6433ad1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,5 +1,13 @@ # AMD ROCm™ documentation +Welcome to the ROCm docs home page! If you're new to ROCm, you can review the following +resources to learn more about our products and what we support: + +* [What is ROCm?](./what-is-rocm.md) +* [What's new?](about/whats-new/whats-new) +* [Compatibility & support](./about/compatibility/index.md) +* [Release notes](./about/release-notes.md) + Our documentation is divided into four main categories: ::::{grid} 1 2 2 2 @@ -7,67 +15,73 @@ Our documentation is divided into four main categories: :::{grid-item-card} :padding: 2 -**[Tutorials](tutorials/index)** +**[Tutorials](./tutorials/index.md)** Instructional material ^^^ -- [Installing ROCm](tutorials/install/index) -- [Installing Magma](tutorials/install/magma_install) -- [Installing PyTorch](tutorials/install/pytorch_install) -- [Installing TensorFlow](tutorials/install/tensorflow_install) -- [GitHub examples](https://github.com/amd/rocm-examples) -- [Artificial intelligence](rocm_ai/rocm_ai) +* [Installing ROCm](./tutorials/install/index.md) +* [Installing Magma](./tutorials/install/magma-install.md) +* [Installing PyTorch](./tutorials/install/pytorch-install.md) +* [Installing TensorFlow](./tutorials/install/tensorflow-install.md) +* [GitHub examples](https://github.com/amd/rocm-examples) +* [Artificial intelligence](./rocm-ai.md) ::: :::{grid-item-card} :padding: 2 -**[How-to](how_to/index)** +**[How-to](./how-to/index.md)** Task-oriented walkthroughs ^^^ -- [System Tuning for Various Architectures](how_to/tuning_guides/index) -- [GPU Aware MPI](how_to/gpu_aware_mpi) -- [Setting up for Deep Learning with ROCm](how_to/deep_learning_rocm) -- [System Level Debugging](how_to/system_debugging) +* [System Tuning for Various Architectures](./how-to/tuning-guides/index.md) +* [GPU Aware MPI](./how-to/gpu-aware-mpi.md) +* [Setting up for deep learning with ROCm](./how-to/deep-learning-rocm.md) +* [System Level Debugging](./how-to/system-debugging.md) ::: :::{grid-item-card} :padding: 2 -**[Reference](reference/index)** +**[Reference](./reference/index.md)** Collated information ^^^ -- [Libraries](reference/libraries/index) - - [Math libraries](reference/libraries/gpu_libraries/math) - - [C++ Primitives libraries](reference/libraries/gpu_libraries/c++_primitives) - - [Communication libraries](reference/libraries/gpu_libraries/communication) -- [Compilers & tools](reference/compilers_tools/index) - - [Computer Vision](reference/computer_vision) - - [Management Tools](reference/compilers_tools/management_tools) - - [Validation Tools](reference/compilers_tools/validation_tools) -- [HIP](reference/hip) -- [OpenMP](reference/openmp/openmp) +* [Libraries](./reference/libraries/index.md) + * [Math libraries](./reference/libraries/gpu-libraries/math.md) + * [C++ Primitives libraries](./reference/libraries/gpu-libraries/c++primitives.md) + * [Communication libraries](./reference/libraries/gpu-libraries/communication.md) +* [Compilers & tools](./reference/compilers-tools/index.md) + * [Management Tools](./reference/compilers-tools/management-tools.md) + * [Validation Tools](./reference/compilers-tools/validation-tools.md) +* [HIP](./reference/hip.md) +* [OpenMP](./reference/openmp/openmp.md) ::: :::{grid-item-card} :padding: 2 -**[Conceptual](conceptual/index)** +**[Conceptual](./conceptual/index.md)** -Topic overviews and background information +Topic overviews & background information ^^^ -- [Compiler Disambiguation](conceptual/compiler_disambiguation) -- [Using CMake](conceptual/cmake_packages) -- [Linux Folder Structure Reorganization](conceptual/file_reorg) -- [GPU Isolation Techniques](conceptual/gpu_isolation) -- [GPU Architecture](conceptual/gpu_arch) +* [Compiler Disambiguation](./conceptual/compiler-disambiguation.md) +* [Using CMake](./conceptual/cmake-packages.rst) +* [Linux Folder Structure Reorganization](./conceptual/file-reorg.md) +* [GPU Isolation Techniques](./conceptual/gpu-isolation.md) +* [GPU Architecture](./conceptual/gpu-arch.md) +* [ROCm & AI](./rocm-ai.md) ::: :::: + +We welcome collaboration! If you'd like to contribute to our documentation, you can find instructions +on our [Contributing to ROCm](./contribute/index.md) page. Known issues are listed on +[GitHub](https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue). + +Licensing information for all ROCm components is listed on our [Licensing](./about/license.md) page. diff --git a/docs/reference/compilers_tools/dev_tools.md b/docs/reference/compilers-tools/dev-tools.md similarity index 100% rename from docs/reference/compilers_tools/dev_tools.md rename to docs/reference/compilers-tools/dev-tools.md diff --git a/docs/reference/compilers_tools/compilers.md b/docs/reference/compilers-tools/index.md similarity index 64% rename from docs/reference/compilers_tools/compilers.md rename to docs/reference/compilers-tools/index.md index ed53de8a4..9c1a823a4 100644 --- a/docs/reference/compilers_tools/compilers.md +++ b/docs/reference/compilers-tools/index.md @@ -1,53 +1,59 @@ -# Compilers and tools - -:::::{grid} 1 1 2 2 -:gutter: 1 - -:::{grid-item-card} {doc}`ROCdbgapi ` -The AMD Debugger API is a library that provides all the support necessary for a -debugger and other tools to perform low level control of the execution and -inspection of execution state of AMD's commercially available GPU architectures. - -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/ROCdbgapi/) - -::: - -:::{grid-item-card} [ROCmCC](../rocmcc/rocmcc) -ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance -computing on AMD GPUs and CPUs and supports various heterogeneous programming -models such as HIP, OpenMP, and OpenCL. - -- [Documentation](../rocmcc/rocmcc) - -::: - -:::{grid-item-card} {doc}`ROCgdb ` -This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger. - -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/ROCgdb/) - -::: - -:::{grid-item-card} {doc}`ROCProfiler ` -ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics. - -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/rocprofiler/) - -::: - -:::{grid-item-card} {doc}`ROCTracer ` -Callback/Activity Library for Performance tracing AMD GPUs - -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/roctracer) - -::: - -::::: - -## See Also - -- [Compiler Disambiguation](../../conceptual/compiler_disambiguation.md) +# ROCm compilers and tools + +:::::{grid} 1 1 2 2 +:gutter: 1 + +:::{grid-item-card} {doc}`ROCdbgapi ` + +The AMD Debugger API is a library that provides all the support necessary for a +debugger and other tools to perform low level control of the execution and +inspection of execution state of AMD's commercially available GPU architectures. + +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/ROCdbgapi/) + +::: + +:::{grid-item-card} +**[ROCmCC](../rocmcc/rocmcc.md)** + +ROCmCC is a Clang/LLVM-based compiler. It is optimized for high-performance +computing on AMD GPUs and CPUs and supports various heterogeneous programming +models such as HIP, OpenMP, and OpenCL. + +* [Documentation](../rocmcc/rocmcc.md) + +::: + +:::{grid-item-card} {doc}`ROCgdb ` + +This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger. + +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/ROCgdb/) + +::: + +:::{grid-item-card} {doc}`ROCProfiler ` + +ROC profiler library. Profiling with performance counters and derived metrics. Library supports GFX8/GFX9. Hardware specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes hardware performance counters with complex performance metrics. + +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/rocprofiler/) + +::: + +:::{grid-item-card} {doc}`ROCTracer ` + +Callback/Activity Library for Performance tracing AMD GPUs + +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/roctracer) + +::: + +::::: + +## See Also + +* [Compiler Disambiguation](../../conceptual/compiler-disambiguation.md) diff --git a/docs/reference/compilers_tools/management_tools.md b/docs/reference/compilers-tools/management-tools.md similarity index 63% rename from docs/reference/compilers_tools/management_tools.md rename to docs/reference/compilers-tools/management-tools.md index 783c0fd7e..f6709a25a 100644 --- a/docs/reference/compilers_tools/management_tools.md +++ b/docs/reference/compilers-tools/management-tools.md @@ -4,29 +4,31 @@ :gutter: 1 :::{grid-item-card} {doc}`AMD SMI ` + The AMD System Management Interface Library, or AMD SMI library, is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices. -- {doc}`Documentation ` -- [GitHub](https://github.com/RadeonOpenCompute/amdsmi) -- [Examples](https://github.com/amd/go_amd_smi#example) +* [GitHub](https://github.com/RadeonOpenCompute/amdsmi) +* [Examples](https://github.com/amd/go_amd_smi#example) ::: :::{grid-item-card} {doc}`ROCm SMI LIB ` + This tool acts as a command line interface for manipulating and monitoring the AMD GPU kernel, and is intended to replace and deprecate the existing `rocm_smi.py` CLI tool. It uses `ctypes` to call the `rocm_smi_lib` API. -- {doc}`Documentation ` -- [GitHub](https://github.com/RadeonOpenCompute/rocm_smi_lib) -- [Examples](https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools) +* {doc}`Documentation ` +* [GitHub](https://github.com/RadeonOpenCompute/rocm_smi_lib) +* [Examples](https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools) ::: :::{grid-item-card} {doc}`ROCm Data Center Tool ` + The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and data center environments. -- [GitHub](https://github.com/RadeonOpenCompute/rdc) -- [Changelog](https://github.com/RadeonOpenCompute/rdc/blob/master/CHANGELOG.md) -- [Examples](https://github.com/RadeonOpenCompute/rdc/tree/master/example) +* [GitHub](https://github.com/RadeonOpenCompute/rdc) +* [Changelog](https://github.com/RadeonOpenCompute/rdc/blob/master/CHANGELOG.md) +* [Examples](https://github.com/RadeonOpenCompute/rdc/tree/master/example) ::: diff --git a/docs/reference/compilers_tools/validation_tools.md b/docs/reference/compilers-tools/validation-tools.md similarity index 63% rename from docs/reference/compilers_tools/validation_tools.md rename to docs/reference/compilers-tools/validation-tools.md index b89c69fee..aa99944d9 100644 --- a/docs/reference/compilers_tools/validation_tools.md +++ b/docs/reference/compilers-tools/validation-tools.md @@ -4,21 +4,23 @@ :gutter: 1 :::{grid-item-card} {doc}`RVS ` + The ROCm Validation Suite is a system administrator’s and cluster manager's tool for detecting and troubleshooting common problems affecting AMD GPU(s) running in a high-performance computing environment, enabled using the ROCm software stack on a compatible platform. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite) -- [Changelog](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite) +* [Changelog](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/CHANGELOG.md) ::: :::{grid-item-card} {doc}`TransferBench ` + TransferBench is a simple utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/TransferBench/) -- [Changelog](https://github.com/ROCmSoftwarePlatform/TransferBench/blob/develop/CHANGELOG.md) -- {doc}`transferbench:examples/index` +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/TransferBench/) +* [Changelog](https://github.com/ROCmSoftwarePlatform/TransferBench/blob/develop/CHANGELOG.md) +* {doc}`transferbench:examples/index` ::: diff --git a/docs/reference/compilers_tools/index.md b/docs/reference/compilers_tools/index.md deleted file mode 100644 index dde300468..000000000 --- a/docs/reference/compilers_tools/index.md +++ /dev/null @@ -1,3 +0,0 @@ -# ROCm compilers and tools - -add links... diff --git a/docs/reference/docker.md b/docs/reference/docker.md deleted file mode 100644 index c597eaa5e..000000000 --- a/docs/reference/docker.md +++ /dev/null @@ -1 +0,0 @@ -# Docker diff --git a/docs/reference/hip.md b/docs/reference/hip.md index 6f90dd24c..6d51aad7a 100644 --- a/docs/reference/hip.md +++ b/docs/reference/hip.md @@ -9,12 +9,13 @@ page introduces the HIP runtime and other HIP libraries and tools. :gutter: 1 :::{grid-item-card} {doc}`HIP Runtime ` + The HIP Runtime is used to enable GPU acceleration for all HIP language based products. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/HIP) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/HIP) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic) ::: @@ -26,12 +27,13 @@ products. :gutter: 1 :::{grid-item-card} {doc}`HIPIFY ` + HIPIFY assists with porting applications from based on CUDA to the HIP Runtime. Supported CUDA APIs are documented here as well. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCm-Developer-Tools/HIPIFY/) -- [Changelog](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCm-Developer-Tools/HIPIFY/) +* [Changelog](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/CHANGELOG.md) ::: diff --git a/docs/reference/index.md b/docs/reference/index.md index 60348537f..473f93184 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -5,93 +5,98 @@ :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} [HIP](./hip) +:::{grid-item-card} +**[HIP](./hip.md)** + HIP is both AMD's GPU programming language extension and the GPU runtime. -- {doc}`HIP ` -- [HIP Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic) -- {doc}`HIPIFY ` +* {doc}`HIP ` +* [HIP Examples](https://github.com/amd/rocm-examples/tree/develop/HIP-Basic) +* {doc}`HIPIFY ` ::: -:::{grid-item-card} [Math Libraries](./libraries/gpu_libraries/math) +:::{grid-item-card} +**[Math Libraries](./libraries/gpu-libraries/math.md)** + HIP Math Libraries support the following domains: -- [Linear Algebra Libraries](./libraries/gpu_libraries/linear_algebra) -- [Fast Fourier Transforms](./libraries/gpu_libraries/fft) -- [Random Numbers](./libraries/gpu_libraries/rand) +* [Linear Algebra Libraries](./libraries/gpu-libraries/math-linear-algebra.md) +* [Fast Fourier Transforms](./libraries/gpu-libraries/math-fft.md) +* [Random Numbers](./libraries/gpu-libraries/rand.md) ::: -:::{grid-item-card} [C++ Primitive Libraries](./libraries/gpu_libraries/c++_primitives) +:::{grid-item-card} +**[C++ Primitive Libraries](./libraries/gpu-libraries/c++primitives.md)** + ROCm template libraries for C++ primitives and algorithms are as follows: -- {doc}`rocPRIM ` -- {doc}`rocThrust ` -- {doc}`hipCUB ` -- {doc}`hipTensor ` +* {doc}`rocPRIM ` +* {doc}`rocThrust ` +* {doc}`hipCUB ` +* {doc}`hipTensor ` ::: -:::{grid-item-card} [Communication Libraries](./libraries/gpu_libraries/communication) +:::{grid-item-card} [Communication Libraries](./libraries/gpu-libraries/communication.md) Inter and intra-node communication is supported by the following projects: -- {doc}`RCCL ` +* {doc}`RCCL ` ::: -:::{grid-item-card} [Artificial intelligence](../rocm_ai/rocm_ai) +:::{grid-item-card} +**[Artificial intelligence](../rocm-ai.md)** + Libraries related to AI. -- {doc}`MIOpen ` -- {doc}`Composable Kernel ` -- {doc}`MIGraphX ` +* {doc}`MIOpen ` +* {doc}`Composable Kernel ` +* {doc}`MIGraphX ` +* {doc}`MIVisionX ` +* {doc}`rocAL ` +::: + +:::{grid-item-card} +**[OpenMP](./openmp/openmp.md)** + +* [OpenMP Support Guide](./openmp/openmp.md) ::: -:::{grid-item-card} [Computer Vision](./computer_vision) -Computer vision related projects. +:::{grid-item-card} +**[Compilers and Tools](./compilers-tools/index.md)** -- {doc}`MIVisionX ` -- {doc}`rocAL ` +* [ROCmCC](./rocmcc/rocmcc.md) +* {doc}`ROCdbgapi ` +* {doc}`ROCgdb ` +* {doc}`ROCProfiler ` +* {doc}`ROCTracer ` ::: -:::{grid-item-card} [OpenMP](openmp/openmp) +:::{grid-item-card} +**[Management Tools](./compilers-tools/management-tools.md)** -- [OpenMP Support Guide](openmp/openmp) +* {doc}`AMD SMI ` +* {doc}`ROCm SMI ` +* {doc}`ROCm Data Center Tool ` ::: -:::{grid-item-card} [Compilers and Tools](compilers_tools/index) +:::{grid-item-card} +**[Validation Tools](./compilers-tools/validation-tools.md)** -- [ROCmCC](./rocmcc/rocmcc) -- {doc}`ROCdbgapi ` -- {doc}`ROCgdb ` -- {doc}`ROCProfiler ` -- {doc}`ROCTracer ` +* {doc}`ROCm Validation Suite ` +* {doc}`TransferBench ` ::: -:::{grid-item-card} [Management Tools](./compilers_tools/management_tools) +:::{grid-item-card} **GPU Architectures** -- {doc}`AMD SMI ` -- {doc}`ROCm SMI ` -- {doc}`ROCm Data Center Tool ` - -::: - -:::{grid-item-card} [Validation Tools](./compilers_tools/validation_tools) - -- {doc}`ROCm Validation Suite ` -- {doc}`TransferBench ` - -::: - -:::{grid-item-card} GPU Architectures - -- [AMD Instinct MI200](../conceptual/gpu_arch/mi250.md) -- [AMD Instinct MI100](../conceptual/gpu_arch/mi100.md) +* [AMD Instinct MI200](../conceptual/gpu-arch/mi250.md) +* [AMD Instinct MI100](../conceptual/gpu-arch/mi100.md) ::: diff --git a/docs/reference/compilers_tools/ai_tools.md b/docs/reference/libraries/ai-libraries.md similarity index 53% rename from docs/reference/compilers_tools/ai_tools.md rename to docs/reference/libraries/ai-libraries.md index 0ac60500e..91f4ec4da 100644 --- a/docs/reference/compilers_tools/ai_tools.md +++ b/docs/reference/libraries/ai-libraries.md @@ -4,29 +4,32 @@ :gutter: 1 :::{grid-item-card} {doc}`MIOpen ` + AMD's library for high performance machine learning primitives. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/MIOpen) -- [Changelog](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/MIOpen) +* [Changelog](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`Composable Kernel ` + Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/composable_kernel) -- [Changelog](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/composable_kernel) +* [Changelog](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`MIGraphX ` + AMD MIGraphX is AMD's graph inference engine that accelerates machine learning model inference. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX) -- [Changelog](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX) +* [Changelog](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/CHANGELOG.md) ::: diff --git a/docs/reference/libraries/gpu_libraries/c++_primitives.md b/docs/reference/libraries/gpu-libraries/c++primitives.md similarity index 58% rename from docs/reference/libraries/gpu_libraries/c++_primitives.md rename to docs/reference/libraries/gpu-libraries/c++primitives.md index 08cffc21b..f2ecf525b 100644 --- a/docs/reference/libraries/gpu_libraries/c++_primitives.md +++ b/docs/reference/libraries/gpu-libraries/c++primitives.md @@ -6,47 +6,51 @@ ROCm template libraries for algorithms are as follows: :gutter: 1 :::{grid-item-card} {doc}`rocPRIM ` + rocPRIM is an AMD GPU optimized template library of algorithm primitives, like transforms, reductions, scans, etc. It also serves as a common back-end for similar libraries found inside ROCm. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocPRIM/) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocPRIM) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocPRIM/) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocPRIM) ::: :::{grid-item-card} {doc}`rocThrust ` + rocThrust is a template library of algorithm primitives with a Thrust-compatible interface. Their CPU back-ends are identical, while the GPU back-end calls into rocPRIM. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocThrust) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocThrust) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocThrust) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocThrust) ::: :::{grid-item-card} {doc}`hipCUB ` + hipCUB is a template library of algorithm primitives with a CUB-compatible interface. It's back-end is rocPRIM. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipCUB) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/hipCUB) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipCUB) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/hipCUB) ::: :::{grid-item-card} {doc}`hipTensor ` + hipTensor is AMD's C++ library for accelerating tensor primitives based on the composable kernel library, through general purpose kernel languages, like HIP C++. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipTensor) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipTensor) ::: diff --git a/docs/reference/libraries/gpu_libraries/communication.md b/docs/reference/libraries/gpu-libraries/communication.md similarity index 53% rename from docs/reference/libraries/gpu_libraries/communication.md rename to docs/reference/libraries/gpu-libraries/communication.md index aa36f60cd..f631caf40 100644 --- a/docs/reference/libraries/gpu_libraries/communication.md +++ b/docs/reference/libraries/gpu-libraries/communication.md @@ -4,15 +4,16 @@ :gutter: 1 :::{grid-item-card} {doc}`RCCL ` -RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, + +RCCL (pronounced "Rickle") is a standalone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rccl) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/tools) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rccl) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/tools) ::: diff --git a/docs/reference/libraries/gpu_libraries/fft.md b/docs/reference/libraries/gpu-libraries/math-fft.md similarity index 59% rename from docs/reference/libraries/gpu_libraries/fft.md rename to docs/reference/libraries/gpu-libraries/math-fft.md index ccb081cf9..f0cc0fd50 100644 --- a/docs/reference/libraries/gpu_libraries/fft.md +++ b/docs/reference/libraries/gpu-libraries/math-fft.md @@ -6,22 +6,24 @@ ROCm libraries for FFT are as follows: :gutter: 1 :::{grid-item-card} {doc}`rocFFT ` + rocFFT is an AMD GPU optimized library for FFT. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocFFT) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`hipFFT ` + hipFFT is a compatibility layer for GPU accelerated FFT optimized for AMD GPUs using rocFFT. hipFFT allows for a common interface for other non AMD GPU FFT libraries. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipFFT) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/CHANGELOG.md) ::: diff --git a/docs/reference/libraries/gpu_libraries/linear_algebra.md b/docs/reference/libraries/gpu-libraries/math-linear-algebra.md similarity index 59% rename from docs/reference/libraries/gpu_libraries/linear_algebra.md rename to docs/reference/libraries/gpu-libraries/math-linear-algebra.md index ebec03878..2594c983d 100644 --- a/docs/reference/libraries/gpu_libraries/linear_algebra.md +++ b/docs/reference/libraries/gpu-libraries/math-linear-algebra.md @@ -6,103 +6,113 @@ ROCm libraries for linear algebra are as follows: :gutter: 1 :::{grid-item-card} {doc}`rocBLAS ` + `rocBLAS` is an AMD GPU optimized library for BLAS (Basic Linear Algebra Subprograms). -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocBLAS) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocBLAS) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocBLAS) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocBLAS) ::: :::{grid-item-card} {doc}`hipBLAS ` + `hipBLAS` is a compatibility layer for GPU accelerated BLAS optimized for AMD GPUs via `rocBLAS` and `rocSOLVER`. `hipBLAS` allows for a common interface for other GPU BLAS libraries. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLAS) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLAS) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`hipBLASLt ` + `hipBLASLt` is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library. `hipBLASLt` is exposed APIs in HIP programming language with an underlying optimized generator as a back-end kernel provider. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLASLt) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipBLASLt) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipBLASLt/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`rocALUTION ` + `rocALUTION` is a sparse linear algebra library with focus on exploring fine-grained parallelism on top of AMD's ROCm runtime and toolchains, targeting modern CPU and GPU platforms. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocALUTION) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocALUTION) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`rocWMMA ` + `rocWMMA` provides an API to break down mixed precision matrix multiply-accumulate (MMA) problems into fragments and distributes these over GPU wavefronts. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocWMMA) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocWMMA) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`rocSOLVER ` + `rocSOLVER` provides a subset of LAPACK (Linear Algebra Package) functionality on the ROCm platform. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocSOLVER) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocSOLVER) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`hipSOLVER ` + `hipSOLVER` is a LAPACK marshalling library supporting both `rocSOLVER` and `cuSOLVER` as backends whilst exporting a unified interface. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipSOLVER) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSOLVER) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`rocSPARSE ` + `rocSPARSE` is a library to provide BLAS for sparse computations. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocSPARSE) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocSPARSE) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`hipSPARSE ` + `hipSPARSE` is a marshalling library to provide sparse BLAS functionality, supporting both `rocSPARSE` and `cuSPARSE` as backends. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSE) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSE) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/CHANGELOG.md) ::: :::{grid-item-card} {doc}`hipSPARSELt ` + `hipSPARSE` is a marshalling library to provide sparse BLAS functionality, supporting both `rocSPARSELt` and `cuSPARSELt` as backends. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSELt) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipSPARSELt) ::: diff --git a/docs/reference/libraries/gpu_libraries/math.md b/docs/reference/libraries/gpu-libraries/math.md similarity index 55% rename from docs/reference/libraries/gpu_libraries/math.md rename to docs/reference/libraries/gpu-libraries/math.md index 01f8d50d2..6dc7f2bf8 100644 --- a/docs/reference/libraries/gpu_libraries/math.md +++ b/docs/reference/libraries/gpu-libraries/math.md @@ -15,32 +15,35 @@ at compile-time of the hipLIB in question. For dynamic dispatch between vendor i ::::{grid} 1 2 3 3 :gutter: 1 -:::{grid-item-card} [Linear Algebra Libraries](linear_algebra) +:::{grid-item-card} +**[Linear Algebra Libraries](./math-linear-algebra.md)** -- {doc}`rocBLAS ` -- {doc}`hipBLAS ` -- {doc}`hipBLASLt ` -- {doc}`rocALUTION ` -- {doc}`rocWMMA ` -- {doc}`rocSOLVER ` -- {doc}`hipSOLVER ` -- {doc}`rocSPARSE ` -- {doc}`hipSPARSE ` -- {doc}`hipSPARSELt ` +* {doc}`rocBLAS ` +* {doc}`hipBLAS ` +* {doc}`hipBLASLt ` +* {doc}`rocALUTION ` +* {doc}`rocWMMA ` +* {doc}`rocSOLVER ` +* {doc}`hipSOLVER ` +* {doc}`rocSPARSE ` +* {doc}`hipSPARSE ` +* {doc}`hipSPARSELt ` ::: -:::{grid-item-card} [Fast Fourier Transforms](fft) +:::{grid-item-card} +**[Fast Fourier Transforms](./math-fft.md)** -- {doc}`rocFFT ` -- {doc}`hipFFT ` +* {doc}`rocFFT ` +* {doc}`hipFFT ` ::: -:::{grid-item-card} [Random Numbers](rand) +:::{grid-item-card} +**[Random Numbers](./rand.md)** -- {doc}`rocRAND ` -- {doc}`hipRAND ` +* {doc}`rocRAND ` +* {doc}`hipRAND ` ::: diff --git a/docs/reference/libraries/gpu_libraries/rand.md b/docs/reference/libraries/gpu-libraries/rand.md similarity index 57% rename from docs/reference/libraries/gpu_libraries/rand.md rename to docs/reference/libraries/gpu-libraries/rand.md index 1f9ed6f8b..dda353f50 100644 --- a/docs/reference/libraries/gpu_libraries/rand.md +++ b/docs/reference/libraries/gpu-libraries/rand.md @@ -4,23 +4,25 @@ :gutter: 1 :::{grid-item-card} {doc}`rocRAND ` + rocRAND is an AMD GPU optimized library for pseudo-random number generators (PRNG). -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/rocRAND/) -- [Changelog](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/CHANGELOG.md) -- [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocRAND) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/rocRAND/) +* [Changelog](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/CHANGELOG.md) +* [Examples](https://github.com/amd/rocm-examples/tree/develop/Libraries/rocRAND) ::: :::{grid-item-card} {doc}`hipRAND ` + hipRAND is a compatibility layer for GPU accelerated pseudo-random number generation (PRNG) optimized for AMD GPUs using rocRAND. hipRAND allows for a common interface for other non AMD GPU PRNG libraries. -- {doc}`Documentation ` -- [GitHub](https://github.com/ROCmSoftwarePlatform/hipRAND/) -- [Changelog](https://github.com/ROCmSoftwarePlatform/hipRAND/blob/develop/CHANGELOG.md) +* {doc}`Documentation ` +* [GitHub](https://github.com/ROCmSoftwarePlatform/hipRAND/) +* [Changelog](https://github.com/ROCmSoftwarePlatform/hipRAND/blob/develop/CHANGELOG.md) ::: diff --git a/docs/reference/libraries/index.md b/docs/reference/libraries/index.md index ed01495e6..24562e118 100644 --- a/docs/reference/libraries/index.md +++ b/docs/reference/libraries/index.md @@ -1,8 +1,20 @@ # ROCm libraries -add links... +::::{grid} 1 1 2 2 +:gutter: 1 -* Math -* C++ primitive -* Communication -* Artificial intelligence +:::{grid-item-card} +**[AI libraries](./ai-libraries.md)** + +::: + +:::{grid-item-card} +**[Math libraries](./gpu-libraries/math.md)** + +::: + +:::{grid-item-card} +**[Communication libraries](./gpu-libraries/communication.md)** + +::: +:::: diff --git a/docs/reference/openmp/openmp.md b/docs/reference/openmp/openmp.md index 7037de4e4..529500399 100644 --- a/docs/reference/openmp/openmp.md +++ b/docs/reference/openmp/openmp.md @@ -9,12 +9,12 @@ Along with host APIs, the OpenMP compilers support offloading code and data onto GPU devices. This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of `rocprof` with OpenMP applications. The GPUs supported are the same as those supported by -this ROCm release. See the list of supported GPUs in {doc}`../../about/release/linux_support`. +this ROCm release. See the list of supported GPUs for [Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md). The ROCm OpenMP compiler is implemented using LLVM compiler technology. The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code. -```{figure} ../../data/reference/openmp/openmp_toolchain.svg +```{figure} ../../data/reference/openmp/openmp-toolchain.svg :name: openmp-toolchain ``` @@ -26,13 +26,10 @@ sub-directories are: bin: Compilers (`flang` and `clang`) and other binaries. -- examples: The usage section below shows how to compile and run these programs. - -- include: Header files. - -- lib: Libraries including those required for target offload. - -- lib-debug: Debug versions of the above libraries. +* examples: The usage section below shows how to compile and run these programs. +* include: Header files. +* lib: Libraries including those required for target offload. +* lib-debug: Debug versions of the above libraries. ## OpenMP: Usage @@ -127,10 +124,10 @@ program with: The following tracing options are widely used to generate useful information: -- **`--hsa-trace`**: This option is used to get a JSON output file with the HSA +* **`--hsa-trace`**: This option is used to get a JSON output file with the HSA API execution traces and a flat profile in a CSV file. -- **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. +* **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls. Since this option results in loading ``libamdhip64.so``, follow the prerequisite as mentioned above. @@ -166,16 +163,16 @@ implemented in the past releases. ### Asynchronous Behavior in OpenMP Target Regions -- Controlling Asynchronous Behavior +* Controlling Asynchronous Behavior The OpenMP offloading runtime executes in an asynchronous fashion by default, allowing multiple data transfers to start concurrently. However, if the data to be transferred becomes larger than the default threshold of 1MB, the runtime falls back to a synchronous data transfer. The buffers that have been locked already are always executed asynchronously. You can overrule this default behavior by setting `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` and `OMPX_FORCE_SYNC_REGIONS`. See the [Environment Variables](#environment-variables) table for details. -- Multithreaded Offloading on the Same Device +* Multithreaded Offloading on the Same Device The `libomptarget` plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device. -- Parallel Memory Copy Invocations +* Parallel Memory Copy Invocations Implicit asynchronous execution of single target region enables parallel memory copy invocations. @@ -187,11 +184,9 @@ with Xnack capability. #### Prerequisites -- Linux Kernel versions above 5.14 - -- Latest KFD driver packaged in ROCm stack - -- Xnack, as USM support can only be tested with applications compiled with Xnack +* Linux Kernel versions above 5.14 +* Latest KFD driver packaged in ROCm stack +* Xnack, as USM support can only be tested with applications compiled with Xnack capability #### Xnack Capability @@ -220,13 +215,13 @@ HSA_XNACK=1 When Xnack support is not needed: -- Build the applications to maximize resource utilization using: +* Build the applications to maximize resource utilization using: ```bash --offload-arch=gfx908:xnack- ``` -- At runtime, set the `HSA_XNACK` environment variable to 0. +* At runtime, set the `HSA_XNACK` environment variable to 0. #### Unified Shared Memory Pragma @@ -376,27 +371,19 @@ GPUs with applications written in both HIP and OpenMP. **Features Supported on Host Platform (Target x86_64):** -- Use-after-free - -- Buffer overflows - -- Heap buffer overflow - -- Stack buffer overflow - -- Global buffer overflow - -- Use-after-return - -- Use-after-scope - -- Initialization order bugs +* Use-after-free +* Buffer overflows +* Heap buffer overflow +* Stack buffer overflow +* Global buffer overflow +* Use-after-return +* Use-after-scope +* Initialization order bugs **Features Supported on AMDGPU Platform (`amdgcn-amd-amdhsa`):** -- Heap buffer overflow - -- Global buffer overflow +* Heap buffer overflow +* Global buffer overflow **Software (Kernel/OS) Requirements:** Unified Shared Memory support with Xnack capability. See the section on [Unified Shared Memory](#unified-shared-memory) @@ -404,7 +391,7 @@ for prerequisites and details on Xnack. **Example:** -- Heap buffer overflow +* Heap buffer overflow ```bash void main() { @@ -424,7 +411,7 @@ void main() { See the complete sample code for heap buffer overflow [here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp). -- Global buffer overflow +* Global buffer overflow ```bash #pragma omp declare target @@ -453,33 +440,31 @@ See the complete sample code for global buffer overflow You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options: -- `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions. +* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions. -- `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution. +* `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution. -- `-fopenmp-assume-no-nested-parallelism`: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution. +* `-fopenmp-assume-no-nested-parallelism`: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution. -- `-O3` if no `-O*` is specified by the user. +* `-O3` if no `-O*` is specified by the user. ### Specialized Kernels Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported: -- No-Loop - -- Big-Jump-Loop - -- Cross-Team (Xteam) Reductions +* No-Loop +* Big-Jump-Loop +* Cross-Team (Xteam) Reductions To enable the generation of specialized kernels, follow these guidelines: -- Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel. +* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel. -- Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`. +* Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`. -- To automatically enable the specialized kernel generation, use `-Ofast` or `-fopenmp-target-fast` for compilation. +* To automatically enable the specialized kernel generation, use `-Ofast` or `-fopenmp-target-fast` for compilation. -- To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`. +* To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`. #### No-Loop Kernel Generation diff --git a/docs/reference/rocmcc/rocmcc.md b/docs/reference/rocmcc/rocmcc.md index 6467a539e..ea7215e29 100644 --- a/docs/reference/rocmcc/rocmcc.md +++ b/docs/reference/rocmcc/rocmcc.md @@ -19,15 +19,15 @@ The differences are listed in [the table below](rocm-llvm-vs-alt). For more details, see: -- AMD GPU usage: [llvm.org/docs/AMDGPUUsage.html](https://llvm.org/docs/AMDGPUUsage.html) -- Releases and source: +* AMD GPU usage: [llvm.org/docs/AMDGPUUsage.html](https://llvm.org/docs/AMDGPUUsage.html) +* Releases and source: ### ROCm Compiler Interfaces ROCm currently provides two compiler interfaces for compiling HIP programs: -- `/opt/rocm/bin/hipcc` -- `/opt/rocm/bin/amdclang++` +* `/opt/rocm/bin/hipcc` +* `/opt/rocm/bin/amdclang++` Both leverage the same LLVM compiler technology with the AMD GCN GPU support; however, they offer a slightly different user experience. The `hipcc` command-line @@ -237,8 +237,8 @@ minimized if the hoisted condition is executed more often. This heuristic prioritizes the conditions based on the number of times they are used within the loop. The heuristic can be controlled with the following options: -- `-unswitch-identical-branches-min-count=` - - Enables unswitching of a loop with respect to a branch conditional value +* `-unswitch-identical-branches-min-count=` + * Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least `` compares in the loop. This option is enabled with `-aggressive-loop-unswitch`. The default value is 3. @@ -246,8 +246,8 @@ loop. The heuristic can be controlled with the following options: Where, `n` is a positive integer and lower value of `` facilitates more unswitching. -- `-unswitch-identical-branches-max-count=` - - Enables unswitching of a loop with respect to a branch conditional value +* `-unswitch-identical-branches-max-count=` + * Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most `` compares in the loop. This option is enabled with `-aggressive-loop-unswitch`. The default value is 6. @@ -436,19 +436,19 @@ Inline assembly (ASM) statements allow a developer to include assembly instructions directly in either host or device code. While the ROCm compiler supports ASM statements, their use is not recommended for the following reasons: -- The compiler's ability to produce both correct code and to optimize +* The compiler's ability to produce both correct code and to optimize surrounding code is impeded. -- The compiler does not parse the content of the ASM statements and so +* The compiler does not parse the content of the ASM statements and so cannot "see" its contents. -- The compiler must make conservative assumptions in an effort to retain +* The compiler must make conservative assumptions in an effort to retain correctness. -- The conservative assumptions may yield code that, on the whole, is less +* The conservative assumptions may yield code that, on the whole, is less performant compared to code without ASM statements. It is possible that a syntactically correct ASM statement may cause incorrect runtime behavior. -- ASM statements are often ASIC-specific; code containing them is less portable +* ASM statements are often ASIC-specific; code containing them is less portable and adds a maintenance burden to the developer if different ASICs are targeted. -- Writing correct ASM statements is often difficult; we strongly recommend +* Writing correct ASM statements is often difficult; we strongly recommend thorough testing of any use of ASM statements. :::{note} @@ -608,9 +608,9 @@ architectures. The ROCmCC compiler is enhanced to generate binaries that can contain heterogenous images. This heterogeneity could be in terms of: -- Images of different architectures, like AMD GCN and NVPTX -- Images of same architectures but for different GPUs, like gfx906 and gfx908 -- Images of same architecture and same GPU but for different target features, +* Images of different architectures, like AMD GCN and NVPTX +* Images of same architectures but for different GPUs, like gfx906 and gfx908 +* Images of same architecture and same GPU but for different target features, like `gfx908:xnack+` and `gfx908:xnack-` An appropriate image is selected by the OpenMP device runtime for execution diff --git a/docs/rocm-a-z.md b/docs/rocm-a-z.md new file mode 100644 index 000000000..fc76e734e --- /dev/null +++ b/docs/rocm-a-z.md @@ -0,0 +1,46 @@ +# ROCm A-Z + +:::{table} +:name: rocm-a-z + +| ROCm product | Description | +| :---------------- | :------------ | +| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | A graph inference engine that accelerates machine learning model inference | +| [Composable Kernel](https://rocm.docs.amd.com/projects/composable_kernel/en/latest/) | A library that aims to provide a programming model for writing performance critical kernels for machine learning workloads across multiple architectures | +| {doc}`HIP ` | AMD’s GPU programming language extension and the GPU runtime | +| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/) | A BLAS-marshaling library that supports [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/) and cuBLAS backends | +| [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/) | A compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure | +| [hipCUB](https://rocm.docs.amd.com/projects/hipCUB/en/latest/) | A thin header-only wrapper library on top of [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) or CUB that allows project porting using the CUB library to the HIP layer | +| [hipFFT](https://rocm.docs.amd.com/projects/hipFFT/en/latest/) | An FFT-marshalling library that supports rocFFT or cuFFT backends | +| {doc}`HIPIFY ` | A set of tools for translating CUDA source code into portable HIP C++ | +| [hipify-clang](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-clang.html) | A Clang-based tool for translating CUDA sources into HIP sources | +| [hipify-perl](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/hipify-perl.html) | An autogenerated, perl-based script that translates CUDA source code into portable HIP C++ | +| [hipSOLVER](https://rocm.docs.amd.com/projects/hipSOLVER/en/latest/) | A LAPACK-marshalling library that supports [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) and cuSOLVER backends | +| [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/) | A SPARSE-marshalling library that supports [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) and cuSPARSE backends | +| [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/) | A graph inference engine that accelerates machine learning model inference | +| [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/latest/) | An open source deep-learning library | +| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM) | An OpenCL general matrix multiplication (GEMM) API and kernel generator | +| [MIOpenTensile](https://github.com/ROCmSoftwarePlatform/MIOpenTensile) | Provides host-callable interfaces to Tensile library | +| [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/latest/doxygen/html/index.html) | A set of comprehensive computer vision and machine learning libraries, utilities, and applications | +| [Radeon Compute Profiler (RCP)](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | A performance analysis tool that gathers data from the API run-time and GPU for OpenCL™ and ROCm/HSA applications | +| [RCCL](https://rocm.docs.amd.com/projects/rccl/en/latest/) | A standalone library that provides multi-GPU and multi-node collective communication primitives | +| [rocAL](https://rocm.docs.amd.com/projects/rocAL/en/latest/doxygen/html/index.html) | An augmentation library designed to decode and process images and videos | +| [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/)| A BLAS implementation (in the HIP programming language) on ROCm's runtime and toolchains | +| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | An AMDGPU Driver with KFD that is used by ROCm | +| [ROCmCC](https://rocm.docs.amd.com/en/latest/reference/rocmcc/rocmcc.html) | A Clang/LLVM-based compiler | +| [ROCm cmake](https://github.com/RadeonOpenCompute/rocm-cmake) | A collection of CMake modules for common build and development tasks | +| [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/latest/) | Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments | +| [ROCm Debugger (ROCgdb)](https://rocm.docs.amd.com/projects/ROCgdb/en/latest/) | A source-level debugger for Linux, based on the GNU Debugger (GDB) | +| [ROCm Debugger API (ROCdbgapi)](https://rocm.docs.amd.com/projects/ROCdbgapi/en/latest/) | The ROCm debugger library | +| [ROCm SMI](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | A C library for Linux that provides a user space interface for applications to monitor and control GPU applications | +| [ROCm Validation Suite](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) | A tool for detecting and troubleshooting common problems affecting AMD GPUs running in a high-performance computing environment | +| [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/latest/) | A header-only library for HIP parallel primitives | +| [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/profiler_home_page.html) | A profiling tool for HIP applications | +| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents | +| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on the ROCm platform, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs | +| [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) | Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language) | +| [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/latest/) | A parallel algorithm library | +| [ROCTracer](https://rocm.docs.amd.com/projects/roctracer/en/latest/) | Intercepts runtime API calls and traces asynchronous activity | +| [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/) | A utility to benchmark simultaneous transfers between user-specified devices (CPUs/GPUs) | + +::: diff --git a/docs/reference/computer_vision.md b/docs/rocm-ai.md similarity index 57% rename from docs/reference/computer_vision.md rename to docs/rocm-ai.md index 30bc90e41..bc6e85007 100644 --- a/docs/reference/computer_vision.md +++ b/docs/rocm-ai.md @@ -1,21 +1,34 @@ -# Computer Vision +# ROCm & artificial intelligence -::::{grid} 1 1 2 2 +:::::{grid} 1 1 2 2 :gutter: 1 +:::{grid-item-card} +**[Inception V3 with PyTorch](./conceptual/ai-pytorch-inception.md)** + +A collection of detailed and guided examples for working with Inception V3 with PyTorch on ROCm. + +::: + +:::{grid-item-card} +**[Optimizing Inference with MIGraphX](./conceptual/ai-migraphx-optimization.md)** + +Walkthroughs of optimizing inference using MIGraphX. + +::: + :::{grid-item-card} {doc}`MIVisionX ` + MIVisionX toolkit is a set of comprehensive computer vision and machine intelligence libraries, utilities, and applications bundled into a single toolkit. AMD MIVisionX also delivers a highly optimized open-source implementation of the Khronos OpenVX™ and OpenVX™ Extensions. -- {doc}`Documentation ` -- [GitHub](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/) -- [Changelog](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/CHANGELOG.md) +* [GitHub](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/) +* [Changelog](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/CHANGELOG.md) ::: :::{grid-item-card} {doc}`rocAL ` -The AMD ROCm Augmentation Library (rocAL) is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user. rocAL currently provides C API. -- {doc}`Documentation ` +The AMD ROCm Augmentation Library (rocAL) is designed to efficiently decode and process images and videos from a variety of storage formats and modify them through a processing graph programmable by the user. rocAL currently provides C API. ::: diff --git a/docs/rocm_a-z.md b/docs/rocm_a-z.md deleted file mode 100644 index 53b62475e..000000000 --- a/docs/rocm_a-z.md +++ /dev/null @@ -1,38 +0,0 @@ -# ROCm A-Z - -:::{table} -:name: rocm_a-z - -| ROCm product | Description | -| :---------------- | :------------ | -| AMD SMI | | -| Composable Kernel | | -| {doc}`HIP ` | AMD’s GPU programming language extension and the GPU runtime | -| hipBLAS | | -| hipCUB | | -| hipFFT | | -| {doc}`HIPIFY ` | Assists with porting applications from CUDA to HIP runtime | -| hipify-clang | A tool to translate CUDA source code into portable HIP C++ | -| hipify-perl | A tool to translate CUDA source code into portable HIP C++ | -| hipSOLVER | | -| hipSPARSE | | -| hipTensor | | -| MIGraphX | | -| MIOpen | | -| MIVisionX | | -| OpenMP | | -| RCCL | | -| rocAL | | -| ROCdbgapi | | -| ROCgdb | | -| ROCmCC | | -| ROCm Data Center Tool | | -| ROCm SMI | | -| ROCm Validation Suite | | -| rocPRIM | A header-only library for HIP parallel primitives | -| ROCProfiler | | -| rocThrust | A parallel algorithm library | -| ROCTracer | | -| TransferBench | | - -::: diff --git a/docs/rocm_ai/rocm_ai.md b/docs/rocm_ai/rocm_ai.md deleted file mode 100644 index 10f9e9179..000000000 --- a/docs/rocm_ai/rocm_ai.md +++ /dev/null @@ -1,20 +0,0 @@ -# ROCm & artificial intelligence - -:::::{grid} 1 1 2 2 -:gutter: 1 - -:::{grid-item-card} Inception V3 with PyTorch -:link: pytorch_inception -:link-type: doc -A collection of detailed and guided examples for working with Inception V3 with PyTorch on ROCm. - -::: - -:::{grid-item-card} Optimizing Inference with MIGraphX -:link: migraphx_optimization -:link-type: doc -Walkthroughs of optimizing inference using MIGraphX. - -::: - -::::: diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 1b0e05160..8c9da7736 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -2,52 +2,225 @@ # These comments will also be removed. defaults: numbered: False - maxdepth: 6 + maxdepth: 7 root: index subtrees: - entries: - - file: what_is_rocm + - file: what-is-rocm.md title: What is ROCm? subtrees: - entries: - - file: rocm_ai/rocm_ai.md + - file: rocm-ai.md title: ROCm & AI - - file: tutorials/quick_start/index.md + - file: about/whats-new/whats-new.md + title: What's new? + + - file: tutorials/quick-start/index.md title: Quick start subtrees: - entries: - - file: tutorials/quick_start/linux.md + - file: tutorials/quick-start/linux.md title: Linux - - file: tutorials/quick_start/windows.md + - file: tutorials/quick-start/windows.md title: Windows + - file: about/compatibility/index.md title: Compatibility & support - - file: CHANGELOG.md + subtrees: + - entries: + - file: about/compatibility/linux-support.md + title: Linux (GPU & OS) + - file: about/compatibility/windows-support.md + title: Windows (GPU & OS) + - file: about/compatibility/3rd-party-support-matrix.md + title: Third-party + - file: about/compatibility/user-kernel-space-compat-matrix.md + title: User/kernel space support + - file: about/compatibility/docker-image-support-matrix.md + title: Docker + + - file: about/release-notes.md title: Release information subtrees: - entries: - - file: whats_new/whats_new.md - title: What's new? - - file: about/release/release_history.md + - file: CHANGELOG.md + title: Changelog + - file: about/release-history.md title: Release history - url: https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue title: Known issues + - file: tutorials/index.md title: Tutorials subtrees: - entries: - - file: tutorials/install/linux/index.md - title: Install ROCm (Linux) - - file: tutorials/install/windows/index.md - title: Install ROCm (Windows) - - file: how_to/index.md - title: How-to guides + - file: tutorials/install/index.md + title: Installing ROCm + subtrees: + - entries: + - file: tutorials/install/linux/index.md + title: Linux + subtrees: + - entries: + - file: tutorials/install/linux/prerequisites.md + title: Prerequisites + - file: tutorials/install/linux/install-options.md + title: Installation options + subtrees: + - entries: + - file: tutorials/install/linux/installer/index.md + title: AMDGPU install script + subtrees: + - entries: + - file: tutorials/install/linux/installer/install.md + title: Install + - file: tutorials/install/linux/installer/upgrade.md + title: Upgrade + - file: tutorials/install/linux/installer/uninstall.md + title: Uninstall + - file: tutorials/install/linux/os-native/index.md + title: Package manager + subtrees: + - entries: + - file: tutorials/install/linux/os-native/package-manager-integration.md + title: Package manager integration + - file: tutorials/install/linux/os-native/install.md + title: Install + - file: tutorials/install/linux/os-native/upgrade.md + title: Upgrade + - file: tutorials/install/linux/os-native/uninstall.md + title: Uninstall + - file: tutorials/install/windows/index.md + title: Windows + subtrees: + - entries: + - file: tutorials/install/windows/prerequisites.md + title: Prerequisites + - file: tutorials/install/windows/gui/index.md + title: Graphical installation + subtrees: + - entries: + - file: tutorials/install/windows/gui/install.md + title: Install + - file: tutorials/install/windows/gui/upgrade.md + title: Upgrade + - file: tutorials/install/windows/gui/uninstall.md + title: Uninstall + - file: tutorials/install/windows/cli/index.md + title: Command line installation + subtrees: + - entries: + - file: tutorials/install/windows/cli/install.md + title: Install + - file: tutorials/install/windows/cli/upgrade.md + title: Upgrade + - file: tutorials/install/windows/cli/uninstall.md + title: Uninstall + - file: tutorials/install/docker.md + title: ROCm Docker containers + - file: tutorials/install/pytorch-install.md + title: Pytorch for ROCm + - file: tutorials/install/tensorflow-install.md + title: Tensorflow for ROCm + - file: tutorials/install/magma-install.md + title: Magma for ROCm + + - file: how-to/index.md + title: How-to + subtrees: + - entries: + - file: how-to/deep-learning-rocm.md + title: Deep learning + - file: how-to/gpu-aware-mpi.md + title: Using MPI + - file: how-to/system-debugging.md + title: Debugging + - file: how-to/tuning-guides/index.md + title: Tuning guides + subtrees: + - entries: + - file: how-to/tuning-guides/mi100.md + title: MI100 + - file: how-to/tuning-guides/mi200.md + title: MI200 + - file: how-to/tuning-guides/w6000-v620.md + title: RDNA2 + - file: reference/index.md title: Reference guides + subtrees: + - entries: + - file: reference/hip.md + title: HIP + - file: reference/compilers-tools/index.md + title: Compilers & tools + subtrees: + - entries: + - file: reference/compilers-tools/validation-tools.md + title: Validation tools + - file: reference/compilers-tools/management-tools.md + title: Management tools + - file: reference/libraries/index.md + title: Libraries + subtrees: + - entries: + - file: reference/libraries/ai-libraries.md + title: AI + - file: reference/libraries/gpu-libraries/communication.md + title: Communication + - file: reference/libraries/gpu-libraries/math.md + title: Math + subtrees: + - entries: + - file: reference/libraries/gpu-libraries/math-linear-algebra.md + title: Linear algebra + - file: reference/libraries/gpu-libraries/math-fft.md + title: Fast Fourier transforms + - file: reference/libraries/gpu-libraries/rand.md + title: Random numbers + - file: reference/libraries/gpu-libraries/c++primitives.md + title: C++ primitives + - file: reference/openmp/openmp.md + title: OpenMP + - file: reference/rocmcc/rocmcc.md + title: ROCmCC + - file: conceptual/index.md title: Conceptual documentation - - file: rocm_a-z.md + subtrees: + - entries: + - file: conceptual/gpu-arch.md + title: GPU architectures + subtrees: + - entries: + - file: conceptual/gpu-arch/mi100.md + title: MI100 + - file: conceptual/gpu-arch/mi200-performance-counters.md + title: MI200 + - file: conceptual/gpu-arch/mi250.md + title: MI250 + - file: conceptual/compiler-disambiguation.md + title: Compiler disambiguation + - file: conceptual/file-reorg.md + title: File structure (Linux FHS) + - file: conceptual/windows-app-deployment-guidelines.md + title: Windows deployment guidelines + - file: conceptual/gpu-isolation.md + title: GPU isolation techniques + - file: conceptual/using-gpu-sanitizer.md + title: LLVN ASan + - file: conceptual/cmake-packages.rst + title: Using CMake + - file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst + title: ROCm & PCIe Atomics + - file: conceptual/ai-pytorch-inception.md + title: Inception V3 with Pytorch + - file: conceptual/ai-migraphx-optimization.md + title: Inference optimization with MIGraphX + + - file: rocm-a-z.md title: ROCm A-Z + - file: contribute/index.md title: Contributing subtrees: @@ -58,6 +231,7 @@ subtrees: title: Building documentation - file: contribute/feedback.md title: Providing feedback + - file: about/license.md title: ROCm licensing diff --git a/docs/temp/troubleshooting.md b/docs/temp/troubleshooting.md index bc34f4ff7..722995608 100644 --- a/docs/temp/troubleshooting.md +++ b/docs/temp/troubleshooting.md @@ -14,7 +14,7 @@ dependencies or libraries do not support the current GPU. To implement a workaround, follow these steps: 1. Confirm that the hardware supports the ROCm stack. Refer to -{ref}`supported_gpus`. +{ref}`linux-support` and {ref}`windows-support`. 2. Determine the gfx target. @@ -44,7 +44,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`. Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to Option 2: Install PyTorch Using Wheels Package in the section -{ref}`install_pytorch_using_wheels` of this guide for more information. +{ref}`install-pytorch-using-wheels` of this guide for more information. **Q: How do I profile PyTorch workloads?** diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index df26efb90..353a0fc4a 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -3,24 +3,27 @@ :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} [Installing ROCm](./install/index.md) +:::{grid-item-card} +**[Installing ROCm](./install/index.md)** Learn how to install ROCm on Linux and Windows. ::: -:::{grid-item-card} [ROCm examples](https://github.com/amd/rocm-examples) -:link-type: url +:::{grid-item-card} +**[ROCm examples](https://github.com/amd/rocm-examples)** + Sample code demonstrating the HIP API and ROCm-accelerated domain libraries. ::: -:::{grid-item-card} [Artificial intelligence](../rocm_ai/rocm_ai) +:::{grid-item-card} +**[Artificial intelligence](../rocm-ai.md)** Detailed walkthroughs of specific artificial intelligence use cases using ROCm acceleration. -- [Implementing Inception V3 on ROCm with PyTorch](../rocm_ai/pytorch_inception) -- [Optimizing Inference with MIGraphX](../rocm_ai/migraphx_optimization) +* [Implementing Inception V3 on ROCm with PyTorch](../conceptual/ai-pytorch-inception.md) +* [Optimizing Inference with MIGraphX](../conceptual/ai-migraphx-optimization.md) ::: diff --git a/docs/tutorials/install/docker.md b/docs/tutorials/install/docker.md index 2c2119c27..bbfdced8e 100644 --- a/docs/tutorials/install/docker.md +++ b/docs/tutorials/install/docker.md @@ -4,7 +4,7 @@ Docker containers share the kernel with the host operating system, therefore the ROCm kernel-mode driver must be installed on the host. Please refer to -{ref}`linux_install_methods` on installing `amdgpu-dkms`. The other +{ref}`linux-install-methods` on installing `amdgpu-dkms`. The other user-space parts (like the HIP-runtime or math libraries) of the ROCm stack will be loaded from the container image and don't need to be installed to the host. @@ -17,8 +17,8 @@ OpenMP offloading) explicit access to the GPUs must be granted. The ROCm runtimes make use of multiple device files: -- `/dev/kfd`: the main compute interface shared by all GPUs -- `/dev/dri/renderD`: direct rendering interface (DRI) devices for each +* `/dev/kfd`: the main compute interface shared by all GPUs +* `/dev/dri/renderD`: direct rendering interface (DRI) devices for each GPU. **``** is a number for each card in the system starting from 128. Exposing these devices to a container is done by using the diff --git a/docs/tutorials/install/index.md b/docs/tutorials/install/index.md index f74550b74..967d2a6c2 100644 --- a/docs/tutorials/install/index.md +++ b/docs/tutorials/install/index.md @@ -1,20 +1,20 @@ # Installing ROCm -Our installation guides are designed to walk you through a ROCm installation in detail. If you want to get up and running quickly, try our [quick-start guides](../quick_start/index). +Our installation guides are designed to walk you through a ROCm installation in detail. If you want to get up and running quickly, try our [quick-start guides](../quick-start/index.md). :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Linux installation guide -:link: ./linux/index -:link-type: doc +:::{grid-item-card} +**[Linux installation guide](./linux/index.md)** + Install ROCm on Linux. ::: -:::{grid-item-card} Windows installation guide -:link: ./windows/index -:link-type: doc +:::{grid-item-card} +**[Windows installation guide](./windows/index.md)** + Install ROCm on Linux. ::: diff --git a/docs/tutorials/install/linux/index.md b/docs/tutorials/install/linux/index.md index 181196573..257918b59 100644 --- a/docs/tutorials/install/linux/index.md +++ b/docs/tutorials/install/linux/index.md @@ -1,6 +1,6 @@ # Install ROCm on Linux -Start with {doc}`../../quick_start/linux` or follow the detailed +Start with {doc}`../../quick-start/linux` or follow the detailed instructions below. ## Prepare to Install @@ -8,16 +8,14 @@ instructions below. ::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Prerequisites -:link: prerequisites -:link-type: doc +:::{grid-item-card} +**[Prerequisites](./prerequisites.md)** The prerequisites page lists the required steps *before* installation. ::: -:::{grid-item-card} Install Choices -:link: install_overview -:link-type: doc +:::{grid-item-card} +**[Installation options](./install-options.md)** Package manager vs AMDGPU Installer @@ -26,23 +24,21 @@ Standard Packages vs Multi-Version Packages :::: -(linux_install_methods)= +(linux-install-methods)= ## Choose your install method ::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Package Manager -:link: os-native/index -:link-type: doc +:::{grid-item-card} +**[Package manager](./os-native/index.md)** Directly use your distribution's package manager to install ROCm. ::: -:::{grid-item-card} AMDGPU Installer -:link: installer/index -:link-type: doc +:::{grid-item-card} +**[AMDGPU Installer](./installer/index.md)** Use an installer tool that orchestrates changes via the package manager. @@ -52,4 +48,4 @@ manager. ## See Also -- {doc}`../../../about/release/linux_support` +[Linux support](../../../about/compatibility/linux-support.md) diff --git a/docs/tutorials/install/linux/install_overview.md b/docs/tutorials/install/linux/install-options.md similarity index 85% rename from docs/tutorials/install/linux/install_overview.md rename to docs/tutorials/install/linux/install-options.md index ac1b25f60..6ba51a8c9 100644 --- a/docs/tutorials/install/linux/install_overview.md +++ b/docs/tutorials/install/linux/install-options.md @@ -1,14 +1,14 @@ # ROCm Installation Options (Linux) Users installing ROCm must choose between various installation options. A new -user should follow the [Quick Start guide](../../quick_start/linux). +user should follow the [Quick Start guide](../../quick-start/linux). ## Package Manager versus AMDGPU Installer? ROCm supports two methods for installation: -- Directly using the Linux distribution's package manager -- The `amdgpu-install` script +* Directly using the Linux distribution's package manager +* The `amdgpu-install` script There is no difference in the final installation state when choosing either option. @@ -38,17 +38,17 @@ specific and a ROCm release version. The single-version ROCm installation refers to the following: -- Installation of a single instance of the ROCm release on a system -- Use of non-versioned ROCm meta-packages +* Installation of a single instance of the ROCm release on a system +* Use of non-versioned ROCm meta-packages ### Multi-version Installation The multi-version installation refers to the following: -- Installation of multiple instances of the ROCm stack on a system. Extending +* Installation of multiple instances of the ROCm stack on a system. Extending the package name and its dependencies with the release version adds the ability to support multiple versions of packages simultaneously. -- Use of versioned ROCm meta-packages. +* Use of versioned ROCm meta-packages. ```{attention} ROCm packages that were previously installed from a single-version installation diff --git a/docs/tutorials/install/linux/installer/index.md b/docs/tutorials/install/linux/installer/index.md index 8bdc70d54..ebae61033 100644 --- a/docs/tutorials/install/linux/installer/index.md +++ b/docs/tutorials/install/linux/installer/index.md @@ -3,29 +3,26 @@ ::::{grid} 2 3 3 3 :gutter: 1 -:::{grid-item-card} Install -:link: install -:link-type: doc +:::{grid-item-card} +**[Installing ROCm](./install.md)** -How to install ROCm? +Installation instructions. ::: -:::{grid-item-card} Upgrade -:link: upgrade -:link-type: doc +:::{grid-item-card} +**[Upgrading ROCm](./upgrade.md)** Instructions for upgrading an existing ROCm installation. ::: -:::{grid-item-card} Uninstall -:link: uninstall -:link-type: doc +:::{grid-item-card} +**[Uninstalling ROCm](./uninstall.md)** -Steps for removing ROCm packages, libraries and tools. +Instructions for removing ROCm packages, libraries and tools. ::: :::: ## See Also -- {doc}`../../../../about/release/linux_support` +[Linux support](../../../../about/compatibility/linux-support.md) diff --git a/docs/tutorials/install/linux/installer/install.md b/docs/tutorials/install/linux/installer/install.md index ce350f7d8..c3cf5304a 100644 --- a/docs/tutorials/install/linux/installer/install.md +++ b/docs/tutorials/install/linux/installer/install.md @@ -159,13 +159,13 @@ hiplibsdk (for application developers requiring HIP on AMD products) To install use cases specific to your requirements, use the installer `amdgpu-install` as follows: -- To install a single use case add it with the `--usecase` option: +* To install a single use case add it with the `--usecase` option: ```shell sudo amdgpu-install --usecase=rocm ``` -- For multiple use cases separate them with commas: +* For multiple use cases separate them with commas: ```shell sudo amdgpu-install --usecase=hiplibsdk,rocm diff --git a/docs/tutorials/install/linux/os-native/index.md b/docs/tutorials/install/linux/os-native/index.md index 8a992a409..49b8fc302 100644 --- a/docs/tutorials/install/linux/os-native/index.md +++ b/docs/tutorials/install/linux/os-native/index.md @@ -1,32 +1,28 @@ # Installation via Package manager -::::{grid} 2 3 3 3 +::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Install -:link: install -:link-type: doc +:::{grid-item-card} +**[Installing ROCm](./install.md)** -How to install ROCm? +Installation instructions. ::: -:::{grid-item-card} Upgrade -:link: upgrade -:link-type: doc +:::{grid-item-card} +**[Upgrading ROCm](./upgrade.md)** Instructions for upgrading an existing ROCm installation. ::: -:::{grid-item-card} Uninstall -:link: uninstall -:link-type: doc +:::{grid-item-card} +**[Uninstalling ROCm](./uninstall.md)** -Steps for removing ROCm packages libraries and tools. +Instructions for removing ROCm packages, libraries and tools. ::: -:::{grid-item-card} Package Manager Integration -:link: package_manager_integration -:link-type: doc +:::{grid-item-card} +**[Package Manager Integration](./package-manager-integration.md)** Information about packages. ::: @@ -35,4 +31,4 @@ Information about packages. ## See Also -- {doc}`../../../../about/release/linux_support` +[Linux support](../../../../about/compatibility/linux-support.md) diff --git a/docs/tutorials/install/linux/os-native/install.md b/docs/tutorials/install/linux/os-native/install.md index 297e94d26..0e8faaf99 100644 --- a/docs/tutorials/install/linux/os-native/install.md +++ b/docs/tutorials/install/linux/os-native/install.md @@ -133,13 +133,13 @@ single/multi-version installations are, refer to {ref}`installation-types`. For a comprehensive list of meta-packages, refer to {ref}`meta-package-desc`. -- Sample Single-version installation +* Sample Single-version installation ```shell sudo apt install rocm-hip-sdk ``` -- Sample Multi-version installation +* Sample Multi-version installation ```shell sudo apt install rocm-hip-sdk5.6.1 rocm-hip-sdk5.5.3 @@ -333,13 +333,13 @@ single/multi-version installations are, refer to {ref}`installation-types`. For a comprehensive list of meta-packages, refer to {ref}`meta-package-desc`. -- Sample Single-version installation +* Sample Single-version installation ```shell sudo yum install rocm-hip-sdk ``` -- Sample Multi-version installation +* Sample Multi-version installation ```shell sudo yum install rocm-hip-sdk5.6.1 rocm-hip-sdk5.5.3 @@ -436,13 +436,13 @@ single/multi-version installations are, refer to {ref}`installation-types`. For a comprehensive list of meta-packages, refer to {ref}`meta-package-desc`. -- Sample Single-version installation +* Sample Single-version installation ```shell sudo zypper --gpg-auto-import-keys install rocm-hip-sdk ``` -- Sample Multi-version installation +* Sample Multi-version installation ```shell sudo zypper --gpg-auto-import-keys install rocm-hip-sdk5.6.1 rocm-hip-sdk5.5.3 diff --git a/docs/tutorials/install/linux/os-native/package_manager_integration.md b/docs/tutorials/install/linux/os-native/package-manager-integration.md similarity index 87% rename from docs/tutorials/install/linux/os-native/package_manager_integration.md rename to docs/tutorials/install/linux/os-native/package-manager-integration.md index 9a0e32df4..948e1ce94 100644 --- a/docs/tutorials/install/linux/os-native/package_manager_integration.md +++ b/docs/tutorials/install/linux/os-native/package-manager-integration.md @@ -3,9 +3,9 @@ This section provides information about the required meta-packages for the following AMD ROCm programming models: -- Heterogeneous-Computing Interface for Portability (HIP) -- OpenCL™ -- OpenMP™ +* Heterogeneous-Computing Interface for Portability (HIP) +* OpenCL™ +* OpenMP™ ## ROCm Package Naming Conventions @@ -16,8 +16,8 @@ support a specific use case. All meta-packages exist in both versioned and non-versioned forms. -- Non-versioned packages – For a single-version installation of the ROCm stack -- Versioned packages – For multi-version installations of the ROCm stack +* Non-versioned packages – For a single-version installation of the ROCm stack +* Versioned packages – For multi-version installations of the ROCm stack ```{figure} ../../../../data/tutorials/install/linux/linux002.png :name: package-naming @@ -59,9 +59,9 @@ of required packages and libraries. **Example:** -- `rocm-hip-runtime` is used to deploy on supported machines to execute HIP +* `rocm-hip-runtime` is used to deploy on supported machines to execute HIP applications. -- `rocm-hip-sdk` contains runtime components to deploy and execute HIP +* `rocm-hip-sdk` contains runtime components to deploy and execute HIP applications. ```{figure} ../../../../data/tutorials/install/linux/linux003.png @@ -87,8 +87,8 @@ clang compiler files. | `rocm-hip-libraries` | HIP libraries optimized for the AMD platform | | `rocm-hip-sdk` | Develop or port HIP applications and libraries for the AMD platform | | `rocm-developer-tools` | Debug and profile HIP applications | -| `rocm-ml-sdk` | Develop and run Machine Learning applications with optimized for AMD | -| `rocm-ml-libraries` | Key Machine Learning libraries, specifically MIOpen | +| `rocm-ml-sdk` | Develop and run machine-learning applications with optimized for AMD | +| `rocm-ml-libraries` | Key machine-learning libraries, specifically MIOpen | | `rocm-openmp-sdk` | Develop OpenMP-based applications for the AMD platform | | `rocm-openmp-runtime` | Run OpenMP-based applications for the AMD platform | ``` @@ -105,9 +105,9 @@ ROCm programming model. Associated Packages ``` -- Meta-packages can include another meta-package. -- `rocm-core` package is common across all the meta-packages. -- Meta-packages and associated packages are represented in the same color. +* Meta-packages can include another meta-package. +* `rocm-core` package is common across all the meta-packages. +* Meta-packages and associated packages are represented in the same color. ```{note} The preceding image is for informational purposes only, as the individual diff --git a/docs/tutorials/install/linux/prerequisites.md b/docs/tutorials/install/linux/prerequisites.md index 606f7c7ee..e33a37d90 100644 --- a/docs/tutorials/install/linux/prerequisites.md +++ b/docs/tutorials/install/linux/prerequisites.md @@ -24,7 +24,7 @@ Verify the Linux distribution using the following steps: uname -m && cat /etc/*release ``` -2. Confirm that the obtained Linux distribution information matches with those listed in {ref}`linux_support`. +2. Confirm that the obtained Linux distribution information matches with those listed in {ref}`linux-support`. **Example:** Running the command above on an Ubuntu system results in the following output: @@ -57,7 +57,7 @@ Verify the kernel version using the following steps: ``` 2. Confirm that the obtained kernel version information matches with system - requirements as listed in {ref}`linux_support`. + requirements as listed in {ref}`linux-support`. ## Additional package repositories diff --git a/docs/tutorials/install/magma_install.md b/docs/tutorials/install/magma-install.md similarity index 98% rename from docs/tutorials/install/magma_install.md rename to docs/tutorials/install/magma-install.md index b088b431d..e29de9ca6 100644 --- a/docs/tutorials/install/magma_install.md +++ b/docs/tutorials/install/magma-install.md @@ -13,7 +13,7 @@ more information, refer to ### Using MAGMA for PyTorch -Tensor is fundamental to Deep Learning techniques because it provides extensive +Tensor is fundamental to deep-learning techniques because it provides extensive representational functionalities and math operations. This data structure is represented as a multidimensional matrix. MAGMA accelerates tensor operations with a variety of solutions including driver routines, computational routines, diff --git a/docs/tutorials/install/pytorch_install.md b/docs/tutorials/install/pytorch-install.md similarity index 82% rename from docs/tutorials/install/pytorch_install.md rename to docs/tutorials/install/pytorch-install.md index ff505b39e..27a98a8bb 100644 --- a/docs/tutorials/install/pytorch_install.md +++ b/docs/tutorials/install/pytorch-install.md @@ -6,17 +6,17 @@ PyTorch is an open-source machine learning Python library, primarily differentiated by Tensor computing with GPU acceleration and a type-based automatic differentiation. Other advanced features include: -- Support for distributed training -- Native ONNX support -- C++ front-end -- The ability to deploy at scale using TorchServe -- A production-ready deployment mechanism through TorchScript +* Support for distributed training +* Native ONNX support +* C++ front-end +* The ability to deploy at scale using TorchServe +* A production-ready deployment mechanism through TorchScript ### Installing PyTorch To install ROCm on bare metal, refer to the sections -[GPU and OS Support (Linux)](../../about/release/linux_support) and -[Compatibility](../../about/compatibility/index) for hardware, software and +[GPU and OS Support (Linux)](../../about/compatibility/linux-support.md) and +[Compatibility](../../about/compatibility/index.md) for hardware, software and 3rd-party framework compatibility between ROCm and PyTorch. The recommended option to get a PyTorch environment is through Docker. However, installing the PyTorch wheels package on bare metal is also supported. @@ -53,7 +53,7 @@ Follow these steps: onto the container. ::: -(install_pytorch_using_wheels)= +(install-pytorch-using-wheels)= #### Option 2: Install PyTorch Using Wheels Package @@ -62,7 +62,7 @@ access this feature, refer to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/) and choose the "ROCm" compute platform. The following image is a matrix from that illustrates the installation compatibility between ROCm and the PyTorch build. -```{figure} ../../data/tutorials/install/magma_install/magma006.png +```{figure} ../../data/tutorials/install/magma-install/magma006.png :name: installation-matrix-pytorch :align: center @@ -121,10 +121,10 @@ To install PyTorch using the wheels package, follow these installation steps: A prebuilt base Docker image is used to build PyTorch in this option. The base Docker has all dependencies installed, including: -- ROCm -- Torchvision -- Conda packages -- Compiler toolchain +* ROCm +* Torchvision +* Conda packages +* Compiler toolchain Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, and the build scripts utilize that to determine the build environment configuration. @@ -162,11 +162,11 @@ Follow these steps: :::{note} By default in the `rocm/pytorch:latest-base`, PyTorch builds for these architectures simultaneously: - - gfx900 - - gfx906 - - gfx908 - - gfx90a - - gfx1030 + * gfx900 + * gfx906 + * gfx908 + * gfx90a + * gfx1030 ::: 5. To determine your AMD uarch, run: @@ -206,10 +206,10 @@ Docker image using scripts from the PyTorch repository. This will utilize a standard Docker image from operating system maintainers and install all the dependencies required to build PyTorch, including -- ROCm -- Torchvision -- Conda packages -- Compiler toolchain +* ROCm +* Torchvision +* Conda packages +* Compiler toolchain Follow these steps: @@ -257,11 +257,11 @@ Follow these steps: :::{note} By default in the `rocm/pytorch:latest-base`, PyTorch builds for these architectures simultaneously: - - gfx900 - - gfx906 - - gfx908 - - gfx90a - - gfx1030 + * gfx900 + * gfx906 + * gfx908 + * gfx90a + * gfx1030 ::: 6. To determine your AMD uarch, run: @@ -418,31 +418,3 @@ After installing ROCm PyTorch wheels: 1. [Optional] `export GFX_ARCH=gfx90a` 2. [Optional] `export ROCM_VERSION=5.5` 3. `./install_kdb_files_for_pytorch_wheels.sh` - -## References - -C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," CoRR, p. abs/1512.00567, 2015 - -PyTorch, \[Online\]. Available: [https://pytorch.org/vision/stable/index.html](https://pytorch.org/vision/stable/index.html) - -PyTorch, \[Online\]. Available: [https://pytorch.org/hub/pytorch_vision_inception_v3/](https://pytorch.org/hub/pytorch_vision_inception_v3/) - -Stanford, \[Online\]. Available: [http://cs231n.stanford.edu/](http://cs231n.stanford.edu/) - -Wikipedia, \[Online\]. Available: [https://en.wikipedia.org/wiki/Cross_entropy](https://en.wikipedia.org/wiki/Cross_entropy) - -AMD, "ROCm issues," \[Online\]. Available: [https://github.com/RadeonOpenCompute/ROCm/issues](https://github.com/RadeonOpenCompute/ROCm/issues) - -PyTorch, \[Online image\]. [https://pytorch.org/assets/brand-guidelines/PyTorch-Brand-Guidelines.pdf](https://pytorch.org/assets/brand-guidelines/PyTorch-Brand-Guidelines.pdf) - -TensorFlow, \[Online image\]. [https://www.tensorflow.org/extras/tensorflow_brand_guidelines.pdf](https://www.tensorflow.org/extras/tensorflow_brand_guidelines.pdf) - -MAGMA, \[Online image\]. [https://bitbucket.org/icl/magma/src/master/docs/](https://bitbucket.org/icl/magma/src/master/docs/) - -Advanced Micro Devices, Inc., \[Online\]. Available: [https://rocmsoftwareplatform.github.io/AMDMIGraphX/doc/html/](https://rocmsoftwareplatform.github.io/AMDMIGraphX/doc/html/) - -Advanced Micro Devices, Inc., \[Online\]. Available: [https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/wiki](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/wiki) - -Docker, \[Online\]. [https://docs.docker.com/get-started/overview/](https://docs.docker.com/get-started/overview/) - -Torchvision, \[Online\]. Available [https://pytorch.org/vision/master/index.html?highlight=torchvision#module-torchvision](https://pytorch.org/vision/master/index.html?highlight=torchvision#module-torchvision) diff --git a/docs/tutorials/install/tensorflow_install.md b/docs/tutorials/install/tensorflow-install.md similarity index 97% rename from docs/tutorials/install/tensorflow_install.md rename to docs/tutorials/install/tensorflow-install.md index b30c46f2a..60fe8e2ac 100644 --- a/docs/tutorials/install/tensorflow_install.md +++ b/docs/tutorials/install/tensorflow-install.md @@ -2,8 +2,8 @@ ## TensorFlow -TensorFlow is an open-source library for solving Machine Learning, -Deep Learning, and Artificial Intelligence problems. It can be used to solve +TensorFlow is an open-source library for solving machine-learning, +deep-learning, and artificial-intelligence problems. It can be used to solve many problems across different sectors and industries but primarily focuses on training and inference in neural networks. It is one of the most popular and in-demand frameworks and is very active in open source contribution and @@ -56,10 +56,10 @@ To install TensorFlow using the wheels package, follow these steps: :::{note} The supported Python versions are: - - 3.7 - - 3.8 - - 3.9 - - 3.10 + * 3.7 + * 3.8 + * 3.9 + * 3.10 ::: ```bash diff --git a/docs/tutorials/install/windows/cli/index.md b/docs/tutorials/install/windows/cli/index.md index 27aab41b1..e58765f88 100644 --- a/docs/tutorials/install/windows/cli/index.md +++ b/docs/tutorials/install/windows/cli/index.md @@ -3,29 +3,26 @@ ::::{grid} 2 3 3 3 :gutter: 1 -:::{grid-item-card} Install -:link: install -:link-type: doc +:::{grid-item-card} +**[Installing ROCm](./install.md)** -How to install ROCm? +Installation instructions. ::: -:::{grid-item-card} Upgrade -:link: upgrade -:link-type: doc +:::{grid-item-card} +**[Upgrading ROCm](./upgrade.md)** Instructions for upgrading an existing ROCm installation. ::: -:::{grid-item-card} Uninstall -:link: uninstall -:link-type: doc +:::{grid-item-card} +**[Uninstalling ROCm](./uninstall.md)** -Steps for removing ROCm packages and libraries. +Instructions for removing ROCm packages, libraries and tools. ::: :::: ## See Also -- {doc}`../../../../about/release/windows_support` +[Windows Support](../../../../about/compatibility/windows-support.md) diff --git a/docs/tutorials/install/windows/gui/index.md b/docs/tutorials/install/windows/gui/index.md index da0c73dfa..e6a3b1124 100644 --- a/docs/tutorials/install/windows/gui/index.md +++ b/docs/tutorials/install/windows/gui/index.md @@ -3,29 +3,26 @@ ::::{grid} 2 3 3 3 :gutter: 1 -:::{grid-item-card} Install -:link: install -:link-type: doc +:::{grid-item-card} +**[Installing ROCm](./install.md)** -How to install ROCm? +Installation instructions. ::: -:::{grid-item-card} Upgrade -:link: upgrade -:link-type: doc +:::{grid-item-card} +**[Upgrading ROCm](./upgrade.md)** Instructions for upgrading an existing ROCm installation. ::: -:::{grid-item-card} Uninstall -:link: uninstall -:link-type: doc +:::{grid-item-card} +**[Uninstalling ROCm](./uninstall.md)** -Steps for removing ROCm packages and libraries. +Instructions for removing ROCm packages, libraries and tools. ::: :::: ## See Also -- {doc}`../../../../about/release/windows_support` +[Windows Support](../../../../about/compatibility/windows-support.md) diff --git a/docs/tutorials/install/windows/index.md b/docs/tutorials/install/windows/index.md index 49c3de9b3..12a987e5c 100644 --- a/docs/tutorials/install/windows/index.md +++ b/docs/tutorials/install/windows/index.md @@ -1,6 +1,6 @@ # Install ROCm (HIP SDK) on Windows -Start with {doc}`../../quick_start/windows` or follow the detailed +Start with {doc}`../../quick-start/windows` or follow the detailed instructions below. ## Prepare to Install @@ -8,9 +8,8 @@ instructions below. ::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Prerequisites -:link: prerequisites -:link-type: doc +:::{grid-item-card} +**[Prerequisites](./prerequisites.md)** The prerequisites page lists the required steps to verify that the system supports ROCm. @@ -23,16 +22,14 @@ supports ROCm. ::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} Graphical Installation -:link: gui/index -:link-type: doc +:::{grid-item-card} +**[Graphical Installation](./gui/index.md)** Use the graphical front-end of the installer. ::: -:::{grid-item-card} Command Line Installation -:link: cli/index -:link-type: doc +:::{grid-item-card} +**[Command Line Installation](./cli/index.md)** Use the command line front-end of the installer. ::: @@ -45,15 +42,13 @@ Use the command line front-end of the installer. :gutter: 1 :::{grid-item-card} ROCm-Examples -:link: https://github.com/amd/rocm-examples -:link-type: url +**[ROCm-Examples](https://github.com/amd/rocm-examples)** Learn how to use ROCm with descriptive examples for novice to intermediate users. ::: -:::{grid-item-card} Windows App Deployment Guidelines -:link: ../../../conceptual/windows-app-deployment-guidelines -:link-type: doc +:::{grid-item-card} +**[Windows App Deployment Guidelines](../../../conceptual/windows-app-deployment-guidelines.md)** Discusses strategies on how to bundle HIP libraries with an end user application. ::: @@ -62,4 +57,4 @@ Discusses strategies on how to bundle HIP libraries with an end user application ## See Also -- {doc}`../../../about/release/windows_support` +[Windows Support](../../../about/compatibility/windows-support.md) diff --git a/docs/tutorials/install/windows/prerequisites.md b/docs/tutorials/install/windows/prerequisites.md index 3140668de..745e8fd4a 100644 --- a/docs/tutorials/install/windows/prerequisites.md +++ b/docs/tutorials/install/windows/prerequisites.md @@ -25,7 +25,7 @@ Verify the Windows Edition using the following steps: ``` 2. Confirm that the obtained information matches with those listed in - {ref}`supported_skus`. + {ref}`windows-support`. **Example:** Running the command above on a Windows system may result in the following output: @@ -71,4 +71,4 @@ Verify the Windows Edition using the following steps: ``` 3. Confirm that the obtained information matches with those listed in - {ref}`supported_skus`. + {ref}`windows-support`. diff --git a/docs/tutorials/quick_start/index.md b/docs/tutorials/quick-start/index.md similarity index 68% rename from docs/tutorials/quick_start/index.md rename to docs/tutorials/quick-start/index.md index 9cd20a3ec..cbc8c9a32 100644 --- a/docs/tutorials/quick_start/index.md +++ b/docs/tutorials/quick-start/index.md @@ -6,11 +6,12 @@ with troubleshooting. For more in-depth installation guides, see [Installing ROC :::::{grid} 1 1 2 2 :gutter: 1 -:::{grid-item-card} [Linux quick-start guide](linux.md) +:::{grid-item-card} +**[Linux quick-start guide](linux.md)** ::: -:::{grid-item-card} [Windows quick-start guide](windows.md) -:link-type: url +:::{grid-item-card} +**[Windows quick-start guide](windows.md)** ::: ::::: diff --git a/docs/tutorials/quick_start/linux.md b/docs/tutorials/quick-start/linux.md similarity index 96% rename from docs/tutorials/quick_start/linux.md rename to docs/tutorials/quick-start/linux.md index bf0878a0d..64589c35e 100644 --- a/docs/tutorials/quick_start/linux.md +++ b/docs/tutorials/quick-start/linux.md @@ -1,4 +1,7 @@ -# Quick Start (Linux) +# Linux quick-start installation guide + +For a quick summary on installing ROCm on Linux, follow the steps listed on this page. If you +want a more in-depth installation guide, see [Installing ROCm on Linux](../install/linux/index.md). ## Add Repositories diff --git a/docs/tutorials/quick_start/windows.md b/docs/tutorials/quick-start/windows.md similarity index 96% rename from docs/tutorials/quick_start/windows.md rename to docs/tutorials/quick-start/windows.md index 8c6b87ec7..24e859b6a 100644 --- a/docs/tutorials/quick_start/windows.md +++ b/docs/tutorials/quick-start/windows.md @@ -1,6 +1,8 @@ -# Quick Start (Windows) +# Windows quick-start installation guide -The steps to install the HIP SDK for Windows are described in this document. +For a quick summary on installing ROCm (HIP SDK) on Windows, follow the steps listed on this page. If +you want a more in-depth installation guide, see +[Installing ROCm on Windows](../install/windows/index.md). ## System Requirements diff --git a/docs/under_construction.md b/docs/under_construction.md deleted file mode 100644 index 6131bf6ee..000000000 --- a/docs/under_construction.md +++ /dev/null @@ -1,5 +0,0 @@ -# Under Construction - -ROCm beta documentation is not complete yet. - -Please check back again later. diff --git a/docs/what_is_rocm.md b/docs/what-is-rocm.md similarity index 100% rename from docs/what_is_rocm.md rename to docs/what-is-rocm.md diff --git a/docs/whats_new/rocm_on_windows.md b/docs/whats_new/rocm_on_windows.md deleted file mode 100644 index ab251307c..000000000 --- a/docs/whats_new/rocm_on_windows.md +++ /dev/null @@ -1,89 +0,0 @@ -# ROCm on Windows - -Starting with ROCm 5.5, the HIP SDK brings a subset of ROCm to developers on Windows. -The collection of features enabled on Windows is referred to as the HIP SDK. -These features allow developers to use the HIP runtime, HIP math libraries -and HIP Primitive libraries. The following table shows the differences -between Windows and Linux releases. - -|Component|Linux|Windows| -|---------|-----|-------| -|Driver|Radeon Software for Linux |AMD Software Pro Edition| -|Compiler|`hipcc`/`amdclang++`|`hipcc`/`clang++`| -|Debugger|`rocgdb`|no debugger available| -|Profiler|`rocprof`|[Radeon GPU Profiler](https://gpuopen.com/rgp/)| -|Porting Tools|HIPIFY|Coming Soon| -|Runtime|HIP (Open Sourced)|HIP (closed source)| -|Math Libraries|Supported|Supported| -|Primitives Libraries|Supported|Supported| -|Communication Libraries|Supported|Not Available| -|AI Libraries|MIOpen, MIGraphX|Not Available| -|System Management|`rocm-smi-lib`, RDC, `rocminfo`|`amdsmi`, `hipInfo`| -|AI Frameworks|PyTorch, TensorFlow, etc.|Not Available| -|CMake HIP Language|Enabled|Unsupported| -|Visual Studio| Not applicable| Plugin Available| -|HIP Ray Tracing| Supported|Supported| - -AMD is continuing to invest in Windows support and AMD plans to release enhanced -features in subsequent revisions. - -```{note} -The 5.5 Windows Installer collectively groups the Math and Primitives -libraries. -``` - -```{note} -GPU support on Windows and Linux may differ. You must refer to -Windows and Linux GPU support tables separately. -``` - -```{note} -HIP Ray Tracing is not distributed via ROCm in Linux. -``` - -## ROCm release versioning - -Linux OS releases set the canonical version numbers for ROCm. Windows will -follow Linux version numbers as Windows releases are based on Linux ROCm -releases. However, not all Linux ROCm releases will have a corresponding Windows -release. The following table shows the ROCm releases on Windows and Linux. Releases -with both Windows and Linux are referred to as a joint release. Releases with -only Linux support are referred to as a skipped release from the Windows -perspective. - -|Release version|Linux|Windows| -|---------------|-----|-------| -|5.5|✅|✅| -|5.6|✅|❌| - -ROCm Linux releases are versioned with following the Major.Minor.Patch -version number system. Windows releases will only be versioned with Major.Minor. - -In general, Windows releases will trail Linux releases. Software developers that -wish to support both Linux and Windows using a single ROCm version should -refrain from upgrading ROCm unless there is a joint release. - -## Windows Documentation implications - -The ROCm documentation website contains both Windows and Linux documentation. -Just below each article title, a convenient article information section states -whether the page applies to Linux only, Windows only or both OSes. To find the -exact Windows documentation for a release of the HIP SDK, please view the ROCm documentation with the same -Major.Minor version number while ignoring the Patch version. The Patch version -only matters for Linux releases. For convenience, -Windows documentation will continue to be included in the overall ROCm -documentation for the skipped Windows releases. - -Windows release notes will contain only information pertinent to Windows. -The software developer must read all the previous ROCm release notes (including) -skipped ROCm versions on Windows for information on all the changes present in -the Windows release. - -## Windows Builds from Source - -Not all source code required to build Windows from source is available under a -permissive open source license. Build instructions on Windows is only provided -for projects that can be built from source on Windows using a toolchain that -has closed source build prerequisites. The ROCm manifest file is not valid for -Windows. AMD does not release a manifest or tag our components in Windows. -Users may use corresponding Linux tags to build on Windows. diff --git a/tools/autotag/templates/rocm_changes/5.0.0.md b/tools/autotag/templates/rocm_changes/5.0.0.md index 7d6298bae..8d9ebaf85 100644 --- a/tools/autotag/templates/rocm_changes/5.0.0.md +++ b/tools/autotag/templates/rocm_changes/5.0.0.md @@ -272,13 +272,10 @@ As a workaround, users can run amdgpu.runpm=0, which temporarily disables the ru Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel: -- Dispatch - -- Start - -- End - -- Complete +* Dispatch +* Start +* End +* Complete ##### Issue @@ -317,7 +314,7 @@ Users are recommended to collect kernel execution timestamps without monitoring 3. Check the output result file from step 1. 4. The order of timestamps correctly displays as: - DispathNS < BeginNS < EndNS < CompleteNS + DispatchNS < BeginNS < EndNS < CompleteNS 5. Users can find the values of the collected counters in the output file generated in step 2. @@ -331,14 +328,14 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV #### ROCm Libraries Changes – Deprecations and Deprecation Removal -- The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too. +* The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too. -- The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead. +* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead. -- The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm +* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm rocsparse_spmm in 5.0 - ```h + ```cpp rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, @@ -356,7 +353,7 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV rocSPARSE_spmm in 4.0 - ```h + ```cpp rocsparse_status rocsparse_spmm(rocsparse_handle handle, rocsparse_operation trans_A, rocsparse_operation trans_B, @@ -377,9 +374,9 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV In this release, arithmetic operators of HIP complex and vector types are deprecated. -- As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types. +* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types. -- As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types. +* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types. During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and `_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types. diff --git a/tools/autotag/templates/rocm_changes/5.0.2.md b/tools/autotag/templates/rocm_changes/5.0.2.md index 8fe26acf7..2aa2e19ea 100644 --- a/tools/autotag/templates/rocm_changes/5.0.2.md +++ b/tools/autotag/templates/rocm_changes/5.0.2.md @@ -13,6 +13,6 @@ The resolution includes a compiler change, which emits the required metadata by Note: This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release. -Compatibility Matrix Updates to ROCm Deep Learning Guide +Compatibility Matrix Updates to the [Deep-learning guide](../../../../docs/how-to/deep-learning-rocm.md) -The compatibility matrix in the AMD Deep Learning Guide is updated for ROCm v5.0.2. +The compatibility matrix in the [Deep-learning guide](../../../../docs/how-to/deep-learning-rocm.md) is updated for ROCm v5.0.2. diff --git a/tools/autotag/templates/rocm_changes/5.1.0.md b/tools/autotag/templates/rocm_changes/5.1.0.md index 5738816f9..6fdcab47e 100644 --- a/tools/autotag/templates/rocm_changes/5.1.0.md +++ b/tools/autotag/templates/rocm_changes/5.1.0.md @@ -44,21 +44,21 @@ This enhancement enables ROCDebugger users to interact with the HIP source-level ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made: -- Added a new -lane-info command, listing the current thread's lanes. +* Added a new -lane-info command, listing the current thread's lanes. -- The -thread-select command now supports a lane switch to switch to a specific lane of a thread: +* The -thread-select command now supports a lane switch to switch to a specific lane of a thread: ```sh -thread-select -l LANE THREAD ``` -- The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected. +* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected. -- The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop. +* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop. -- MI commands now accept a global --lane option, similar to the global --thread and --frame options. +* MI commands now accept a global --lane option, similar to the global --thread and --frame options. -- MI varobjs are now lane-aware. +* MI varobjs are now lane-aware. For more information, refer to the ROC Debugger User Guide at {doc}`ROCgdb `. @@ -71,17 +71,17 @@ The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECI This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below: -- MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498) +* MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498) -- Fixed a correctness issue with ImplicitGemm algorithm +* Fixed a correctness issue with ImplicitGemm algorithm -- Updated the performance data for new kernel versions +* Updated the performance data for new kernel versions -- Improved MIOpen build time by splitting large kernel header files +* Improved MIOpen build time by splitting large kernel header files -- Fixed an issue in reduction kernels for padded tensors +* Fixed an issue in reduction kernels for padded tensors -- Various other bug fixes and performance improvements +* Various other bug fixes and performance improvements For more information, see {doc}`Documentation `. @@ -93,17 +93,12 @@ CRIU is a userspace tool to Checkpoint and Restore an application. CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes: -- Single and Multi GPU systems (Gfx9) - -- Checkpoint / Restore on a different system - -- Checkpoint / Restore inside a docker container - -- PyTorch - -- Tensorflow - -- Using CRIU Image Streamer +* Single and Multi GPU systems (Gfx9) +* Checkpoint / Restore on a different system +* Checkpoint / Restore inside a docker container +* PyTorch +* Tensorflow +* Using CRIU Image Streamer For more information, refer to @@ -117,9 +112,9 @@ For more information, refer to +* -- +* ### Fixed Defects diff --git a/tools/autotag/templates/rocm_changes/5.2.0.md b/tools/autotag/templates/rocm_changes/5.2.0.md index 838c9961e..4e57e2dc6 100644 --- a/tools/autotag/templates/rocm_changes/5.2.0.md +++ b/tools/autotag/templates/rocm_changes/5.2.0.md @@ -28,9 +28,9 @@ The following new HIP APIs are available in the ROCm v5.2 release. Note that thi The new device management HIP APIs are as follows: -- Gets a UUID for the device. This API returns a UUID for the device. +* Gets a UUID for the device. This API returns a UUID for the device. - ```h + ```cpp hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device); ``` @@ -38,25 +38,25 @@ The new device management HIP APIs are as follows: > > This new API corresponds to the following CUDA API: > - > ```h + > ```cpp > CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev); > ``` -- Gets default memory pool of the specified device +* Gets default memory pool of the specified device - ```h + ```cpp hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device); ``` -- Sets the current memory pool of a device +* Sets the current memory pool of a device - ```h + ```cpp hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool); ``` -- Gets the current memory pool for the specified device +* Gets the current memory pool for the specified device - ```h + ```cpp hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device); ``` @@ -64,69 +64,69 @@ The new device management HIP APIs are as follows: The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows: -- Allocates memory with stream ordered semantics +* Allocates memory with stream ordered semantics - ```h + ```cpp hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream); ``` -- Frees memory with stream ordered semantics +* Frees memory with stream ordered semantics - ```h + ```cpp hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream); ``` -- Releases freed memory back to the OS +* Releases freed memory back to the OS - ```h + ```cpp hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold); ``` -- Sets attributes of a memory pool +* Sets attributes of a memory pool - ```h + ```cpp hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); ``` -- Gets attributes of a memory pool +* Gets attributes of a memory pool - ```h + ```cpp hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value); ``` -- Controls visibility of the specified pool between devices +* Controls visibility of the specified pool between devices - ```h + ```cpp hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count); ``` -- Returns the accessibility of a pool from a device +* Returns the accessibility of a pool from a device - ```h + ```cpp hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location); ``` -- Creates a memory pool +* Creates a memory pool - ```h + ```cpp hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props); ``` -- Destroys the specified memory pool +* Destroys the specified memory pool - ```h + ```cpp hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool); ``` -- Allocates memory from a specified pool with stream ordered semantics +* Allocates memory from a specified pool with stream ordered semantics - ```h + ```cpp hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream); ``` -- Exports a memory pool to the requested handle type +* Exports a memory pool to the requested handle type - ```h + ```cpp hipError_t hipMemPoolExportToShareableHandle( void* shared_handle, hipMemPool_t mem_pool, @@ -134,9 +134,9 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory unsigned int flags); ``` -- Imports a memory pool from a shared handle +* Imports a memory pool from a shared handle - ```h + ```cpp hipError_t hipMemPoolImportFromShareableHandle( hipMemPool_t* mem_pool, void* shared_handle, @@ -144,9 +144,9 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory unsigned int flags); ``` -- Exports data to share a memory pool allocation between processes +* Exports data to share a memory pool allocation between processes - ```h + ```cpp hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr); Import a memory pool allocation from another process.t hipError_t hipMemPoolImportPointer( @@ -159,27 +159,27 @@ The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory The new HIP Graph Management APIs are as follows: -- Enqueues a host function call in a stream +* Enqueues a host function call in a stream - ```h + ```cpp hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); ``` -- Swaps the stream capture mode of a thread +* Swaps the stream capture mode of a thread - ```h + ```cpp hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode); ``` -- Sets a node attribute +* Sets a node attribute - ```h + ```cpp hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value); ``` -- Gets a node attribute +* Gets a node attribute - ```h + ```cpp hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value); ``` @@ -187,87 +187,87 @@ The new HIP Graph Management APIs are as follows: The new APIs for virtual memory management are as follows: -- Frees an address range reservation made via hipMemAddressReserve +* Frees an address range reservation made via hipMemAddressReserve - ```h + ```cpp hipError_t hipMemAddressFree(void* devPtr, size_t size); ``` -- Reserves an address range +* Reserves an address range - ```h + ```cpp hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags); ``` -- Creates a memory allocation described by the properties and size +* Creates a memory allocation described by the properties and size - ```h + ```cpp hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags); ``` -- Exports an allocation to a requested shareable handle type +* Exports an allocation to a requested shareable handle type - ```h + ```cpp hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags); ``` -- Gets the access flags set for the given location and ptr +* Gets the access flags set for the given location and ptr - ```h + ```cpp hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr); ``` -- Calculates either the minimal or recommended granularity +* Calculates either the minimal or recommended granularity - ```h + ```cpp hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option); ``` -- Retrieves the property structure of the given handle +* Retrieves the property structure of the given handle - ```h + ```cpp hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle); ``` -- Imports an allocation from a requested shareable handle type +* Imports an allocation from a requested shareable handle type - ```h + ```cpp hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType); ``` -- Maps an allocation handle to a reserved virtual address range +* Maps an allocation handle to a reserved virtual address range - ```h + ```cpp hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags); ``` -- Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays +* Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays - ```h + ```cpp hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream); ``` -- Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate +* Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate - ```h + ```cpp hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle); ``` -- Returns the allocation handle of the backing memory allocation given the address +* Returns the allocation handle of the backing memory allocation given the address - ```h + ```cpp hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr); ``` -- Sets the access flags for each location specified in desc for the given virtual address range +* Sets the access flags for each location specified in desc for the given virtual address range - ```h + ```cpp hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count); ``` -- Unmaps memory allocation of a given address range +* Unmaps memory allocation of a given address range - ```h + ```cpp hipError_t hipMemUnmap(void* ptr, size_t size); ``` @@ -289,7 +289,7 @@ This release introduces a new ROCm C++ library for accelerating mixed precision rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed. For more information, refer to -[Communication Libraries](../../../../docs/reference/libraries/gpu_libraries/communication.md). +[Communication Libraries](../../../../docs/reference/libraries/gpu-libraries/communication.md) #### OpenMP Enhancements in This Release @@ -364,7 +364,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -372,10 +372,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -383,7 +383,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -396,7 +396,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -479,13 +479,13 @@ So, the function argument that is potentially undef (because it is not intialize ##### Workaround -- Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to for more information. +* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to for more information. -- Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to for more information. +* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to for more information. -- Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute. +* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute. -- Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs. +* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs. #### Issue with Applications Triggering Oversubscription diff --git a/tools/autotag/templates/rocm_changes/5.2.3.md b/tools/autotag/templates/rocm_changes/5.2.3.md index 6d628bd62..af3c376a4 100644 --- a/tools/autotag/templates/rocm_changes/5.2.3.md +++ b/tools/autotag/templates/rocm_changes/5.2.3.md @@ -10,9 +10,9 @@ HIP and Other Runtimes ##### Fixes -- A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call. +* A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call. -- A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the `hiprtcGetLoweredName` call to fail for named expressions with whitespace in it. +* A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the `hiprtcGetLoweredName` call to fail for named expressions with whitespace in it. Example: @@ -25,33 +25,33 @@ ROCm Libraries Compatibility with NCCL 2.12.10 -- Packages for test and benchmark executables on all supported OSes using CPack +* Packages for test and benchmark executables on all supported OSes using CPack -- Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1 +* Added custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1 - - Additional details provided if Binary File Descriptor library (BFD) is pre-installed. + * Additional details provided if Binary File Descriptor library (BFD) is pre-installed. -- Added experimental support for using multiple ranks per device +* Added experimental support for using multiple ranks per device - - Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details. + * Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details. - - To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example, + * To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example, ```sh export GPU_MAX_HW_QUEUES=16 ``` -- Added support for reusing ports in NET/IB channels +* Added support for reusing ports in NET/IB channels - - Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1 + * Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1 - - When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue + * When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue - - Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1 + * Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1 ##### Removed -- Removed experimental clique-based kernels +* Removed experimental clique-based kernels #### Development Tools diff --git a/tools/autotag/templates/rocm_changes/5.3.0.md b/tools/autotag/templates/rocm_changes/5.3.0.md index 01a688ba3..102c40ec0 100644 --- a/tools/autotag/templates/rocm_changes/5.3.0.md +++ b/tools/autotag/templates/rocm_changes/5.3.0.md @@ -65,7 +65,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -73,10 +73,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -84,7 +84,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -97,7 +97,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -115,9 +115,9 @@ User code did not initialize certain data constructs, leading to a correctness i The compiler fix consists of the following patches: -- A new `noundef` attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits. +* A new `noundef` attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits. For more information, see -- The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution. +* The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution. For more information, see . @@ -151,8 +151,8 @@ Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV If IOMMU is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash. -- IO PAGE FAULT -- IRQ remapping does not support X2APIC mode -- NMI error +* IO PAGE FAULT +* IRQ remapping does not support X2APIC mode +* NMI error Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as indicated in the warning message. diff --git a/tools/autotag/templates/rocm_changes/5.3.2.md b/tools/autotag/templates/rocm_changes/5.3.2.md index 746195214..f90ad9028 100644 --- a/tools/autotag/templates/rocm_changes/5.3.2.md +++ b/tools/autotag/templates/rocm_changes/5.3.2.md @@ -37,8 +37,8 @@ An updated firmware maintenance bundle consisting of an installation tool and im There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases. -- thrust::merge no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. +* thrust::merge no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. -- rocprim::device_merge no longer correctly supports using different types for `keys_input1` and `keys_input2`. +* rocprim::device_merge no longer correctly supports using different types for `keys_input1` and `keys_input2`. This issue is currently under investigation and will be resolved in a future release. diff --git a/tools/autotag/templates/rocm_changes/5.3.3.md b/tools/autotag/templates/rocm_changes/5.3.3.md index bbf6dea03..02bbf9c8f 100644 --- a/tools/autotag/templates/rocm_changes/5.3.3.md +++ b/tools/autotag/templates/rocm_changes/5.3.3.md @@ -5,10 +5,10 @@ There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases. -- `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. -- `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and `keys_input2`. +* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and `keys_input2`. +* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and `keys_input2`. This issue is resolved with the following fixes to compilation failures: -- rocPRIM: in device_merge if the two key iterators do not match. -- rocTHRUST: in thrust::merge if the two key iterators do not match. +* rocPRIM: in device_merge if the two key iterators do not match. +* rocTHRUST: in thrust::merge if the two key iterators do not match. diff --git a/tools/autotag/templates/rocm_changes/5.4.0.md b/tools/autotag/templates/rocm_changes/5.4.0.md index 52768d443..4c26f4e64 100644 --- a/tools/autotag/templates/rocm_changes/5.4.0.md +++ b/tools/autotag/templates/rocm_changes/5.4.0.md @@ -10,7 +10,7 @@ The ROCm v5.4 release consists of the following HIP enhancements: A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device. -```h +```cpp long long int wall_clock64(); ``` @@ -18,7 +18,7 @@ It returns wall clock count at a constant frequency on the device, which can be Example: -```h +```cpp int wallClkRate = 0; //in kilohertz +HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, deviceId)); ``` @@ -51,13 +51,13 @@ The following new HIP APIs are available in the ROCm v5.4 release. ##### Error Handling -```h +```cpp hipError_t hipDrvGetErrorName(hipError_t hipError, const char** errorString); ``` This returns HIP errors in the text string format. -```h +```cpp hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString); ``` @@ -79,11 +79,11 @@ In future ROCm releases, catch2 tests and samples will be removed from the HIP p This release consists of the following OpenMP enhancements: -- Enable new device RTL in libomptarget as default. -- New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. -- Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. -- Initial implementation of optimized cross-team sum reduction for float and double type scalars. -- Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. +* Enable new device RTL in libomptarget as default. +* New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`. +* Support for the collapse clause and non-unit stride in cases where the No-Loop specialized kernel is generated. +* Initial implementation of optimized cross-team sum reduction for float and double type scalars. +* Pool-based optimization in the OpenMP runtime to reduce locking during data transfer. ### Deprecations and Warnings @@ -151,7 +151,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -159,10 +159,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -170,7 +170,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -183,7 +183,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake diff --git a/tools/autotag/templates/rocm_changes/5.4.1.md b/tools/autotag/templates/rocm_changes/5.4.1.md index 0370d15c9..492fd6f42 100644 --- a/tools/autotag/templates/rocm_changes/5.4.1.md +++ b/tools/autotag/templates/rocm_changes/5.4.1.md @@ -13,7 +13,7 @@ The following new HIP API is introduced in the ROCm v5.4.1 release. > > This is a pre-official version (beta) release of the new APIs. -```h +```cpp hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData); ``` diff --git a/tools/autotag/templates/rocm_changes/5.4.2.md b/tools/autotag/templates/rocm_changes/5.4.2.md index 3530145ab..75b24ecfd 100644 --- a/tools/autotag/templates/rocm_changes/5.4.2.md +++ b/tools/autotag/templates/rocm_changes/5.4.2.md @@ -13,18 +13,18 @@ The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, co The following hipcc options are being deprecated and will be removed in a future release: -- The `--amdgpu-target` option is being deprecated, and user must use the `–offload-arch` option to specify the GPU architecture. -- The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM option `-mllvm -mcode-object-version` to debug issues related to code object versions. -- The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls are already supported in production on AMD GPUs. +* The `--amdgpu-target` option is being deprecated, and user must use the `–offload-arch` option to specify the GPU architecture. +* The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM option `-mllvm -mcode-object-version` to debug issues related to code object versions. +* The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls are already supported in production on AMD GPUs. ### Known Issues Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages: -- > `error: unhandled SGPR spill to memory` +* > `error: unhandled SGPR spill to memory` -- > `cannot scavenge register without an emergency spill slot!` +* > `cannot scavenge register without an emergency spill slot!` -- > `error: ran out of registers during register allocation` +* > `error: ran out of registers during register allocation` This is a known issue and will be fixed in a future release. diff --git a/tools/autotag/templates/rocm_changes/5.4.3.md b/tools/autotag/templates/rocm_changes/5.4.3.md index 4aadc2795..5f1abd5db 100644 --- a/tools/autotag/templates/rocm_changes/5.4.3.md +++ b/tools/autotag/templates/rocm_changes/5.4.3.md @@ -65,7 +65,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -73,10 +73,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -84,7 +84,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -97,7 +97,7 @@ All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/ ../../../../lib/cmake/hip/hip-config.cmake @@ -109,14 +109,14 @@ lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake In ROCm v5.4.3, improvements to the compiler address errors with the following signatures: -- "error: unhandled SGPR spill to memory" -- "cannot scavenge register without an emergency spill slot!" -- "error: ran out of registers during register allocation" +* "error: unhandled SGPR spill to memory" +* "cannot scavenge register without an emergency spill slot!" +* "error: ran out of registers during register allocation" ### Known Issues #### Compiler Option Error at Runtime -Some users may encounter a “Cannot find Symbol” error at runtime when using -save-temps. While most -save-temps use cases work correctly, this error may appear occasionally. +Some users may encounter a “Cannot find Symbol” error at runtime when using `-save-temps`. While most `-save-temps` use cases work correctly, this error may appear occasionally. -This issue is under investigation, and the known workaround is not to use -save-temps when the error appears. +This issue is under investigation, and the known workaround is not to use `-save-temps` when the error appears. diff --git a/tools/autotag/templates/rocm_changes/5.5.0.md b/tools/autotag/templates/rocm_changes/5.5.0.md index efb158db6..5bd771a78 100644 --- a/tools/autotag/templates/rocm_changes/5.5.0.md +++ b/tools/autotag/templates/rocm_changes/5.5.0.md @@ -15,40 +15,41 @@ Applications requiring to update the stack size can use hipDeviceSetLimit API. The following hipcc changes are implemented in this release: -- `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them. -- `-use-staticlib` and `-use-sharedlib` options are deprecated. +* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them. +* `-use-staticlib` and `-use-sharedlib` options are deprecated. ##### Future Changes -- Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases. -- In a future ROCm release, the following samples will be removed from the `hip-tests` project. - - `hipBusbandWidth` at - - `hipCommander` at +* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases. + +* In a future ROCm release, the following samples will be removed from the `hip-tests` project. + * `hipBusbandWidth` at + * `hipCommander` at Note that the samples will continue to be available in previous release branches. -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - hostRegisterSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' -- Correct hipGetLastError to return the last error instead of last API call's return code -- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]" -- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess -- Remove hiparray* and make it opaque with hipArray_t +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * hostRegisterSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' +* Correct hipGetLastError to return the last error instead of last API call's return code +* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved\[16]" +* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess +* Remove hiparray* and make it opaque with hipArray_t ##### New HIP APIs in This Release @@ -60,9 +61,9 @@ The following hipcc changes are implemented in this release: The new memory management HIP API is as follows: -- Sets information on the specified pointer [BETA]. +* Sets information on the specified pointer \[BETA]. - ```h + ```cpp hipError_t hipPointerSetAttribute(const void* value, hipPointer_attribute attribute, hipDeviceptr_t ptr); ``` @@ -70,15 +71,15 @@ The new memory management HIP API is as follows: The new module management HIP APIs are as follows: -- Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute. +* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute. - ```h + ```cpp hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams); ``` -- Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute. +* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute. - ```h + ```cpp hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags); ``` @@ -86,60 +87,60 @@ The new module management HIP APIs are as follows: The new HIP Graph Management APIs are as follows: -- Creates a memory allocation node and adds it to a graph [BETA] +* Creates a memory allocation node and adds it to a graph \[BETA] - ```h + ```cpp hipError_t hipGraphAddMemAllocNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, hipMemAllocNodeParams* pNodeParams); ``` -- Return parameters for memory allocation node [BETA] +* Return parameters for memory allocation node \[BETA] - ```h + ```cpp hipError_t hipGraphMemAllocNodeGetParams(hipGraphNode_t node, hipMemAllocNodeParams* pNodeParams); ``` -- Creates a memory free node and adds it to a graph [BETA] +* Creates a memory free node and adds it to a graph \[BETA] - ```h + ```cpp hipError_t hipGraphAddMemFreeNode(hipGraphNode_t* pGraphNode, hipGraph_t graph, const hipGraphNode_t* pDependencies, size_t numDependencies, void* dev_ptr); ``` -- Returns parameters for memory free node [BETA]. +* Returns parameters for memory free node \[BETA]. - ```h + ```cpp hipError_t hipGraphMemFreeNodeGetParams(hipGraphNode_t node, void* dev_ptr); ``` -- Write a DOT file describing graph structure [BETA]. +* Write a DOT file describing graph structure \[BETA]. - ```h + ```cpp hipError_t hipGraphDebugDotPrint(hipGraph_t graph, const char* path, unsigned int flags); ``` -- Copies attributes from source node to destination node [BETA]. +* Copies attributes from source node to destination node \[BETA]. - ```h + ```cpp hipError_t hipGraphKernelNodeCopyAttributes(hipGraphNode_t hSrc, hipGraphNode_t hDst); ``` -- Enables or disables the specified node in the given graphExec [BETA] +* Enables or disables the specified node in the given graphExec \[BETA] - ```h + ```cpp hipError_t hipGraphNodeSetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int isEnabled); ``` -- Query whether a node in the given graphExec is enabled [BETA] +* Query whether a node in the given graphExec is enabled \[BETA] - ```h + ```cpp hipError_t hipGraphNodeGetEnabled(hipGraphExec_t hGraphExec, hipGraphNode_t hNode, unsigned int* isEnabled); ``` ##### OpenMP Enhancements This release consists of the following OpenMP enhancements: -- Additional support for OMPT functions `get_device_time` and `get_record_type`. -- Add support for min/max fast fp atomics on AMD GPUs. -- Fix the use of the abs function in C device regions. +* Additional support for OMPT functions `get_device_time` and `get_record_type`. +* Add support for min/max fast fp atomics on AMD GPUs. + Fix the use of the abs function in C device regions. ### Deprecations and Warnings @@ -207,7 +208,7 @@ ROCm has moved header files and libraries to its new location as indicated in th Wrapper header files are placed in the old location (`/opt/rocm-xxx//include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below: -```h +```cpp // Code snippet from hip_runtime.h #pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”. #include "hip/hip_runtime.h" @@ -215,10 +216,10 @@ Wrapper header files are placed in the old location (`/opt/rocm-xxx// The wrapper header files’ backward compatibility deprecation is as follows: -- `#pragma` message announcing deprecation -- ROCm v5.2 release -- `#pragma` message changed to `#warning` -- Future release -- `#warning` changed to `#error` -- Future release -- Backward compatibility wrappers removed -- Future release +* `#pragma` message announcing deprecation -- ROCm v5.2 release +* `#pragma` message changed to `#warning` -- Future release +* `#warning` changed to `#error` -- Future release +* Backward compatibility wrappers removed -- Future release ##### Library files @@ -226,7 +227,7 @@ Library files are available in the `/opt/rocm-xxx/lib` folder. For backward comp Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/ total 4 drwxr-xr-x 4 root root 4096 May 12 10:45 cmake @@ -240,7 +241,7 @@ For backward compatibility, the old CMake locations (`/opt/rocm-xxx// Example: -```log +```bash $ ls -l /opt/rocm/hip/lib/cmake/hip/ total 0 lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake @@ -256,13 +257,13 @@ The following APIs and macros have been marked as deprecated. These are expected ##### API Changes -- `amd_comgr_action_info_set_options()` -- `amd_comgr_action_info_get_options()` +* `amd_comgr_action_info_set_options()` +* `amd_comgr_action_info_get_options()` ##### Actions and Data Types -- `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES` -- `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN` +* `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES` +* `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN` For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the `AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros. @@ -270,28 +271,28 @@ For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, an The following environment variables are removed in this ROCm release: -- `GPU_MAX_COMMAND_QUEUES` -- `GPU_MAX_WORKGROUP_SIZE_2D_X` -- `GPU_MAX_WORKGROUP_SIZE_2D_Y` -- `GPU_MAX_WORKGROUP_SIZE_3D_X` -- `GPU_MAX_WORKGROUP_SIZE_3D_Y` -- `GPU_MAX_WORKGROUP_SIZE_3D_Z` -- `GPU_BLIT_ENGINE_TYPE` -- `GPU_USE_SYNC_OBJECTS` -- `AMD_OCL_SC_LIB` -- `AMD_OCL_ENABLE_MESSAGE_BOX` -- `GPU_FORCE_64BIT_PTR` -- `GPU_FORCE_OCL20_32BIT` -- `GPU_RAW_TIMESTAMP` -- `GPU_SELECT_COMPUTE_RINGS_ID` -- `GPU_USE_SINGLE_SCRATCH` -- `GPU_ENABLE_LARGE_ALLOCATION` -- `HSA_LOCAL_MEMORY_ENABLE` -- `HSA_ENABLE_COARSE_GRAIN_SVM` -- `GPU_IFH_MODE` -- `OCL_SYSMEM_REQUIREMENT` -- `OCL_CODE_CACHE_ENABLE` -- `OCL_CODE_CACHE_RESET` +* `GPU_MAX_COMMAND_QUEUES` +* `GPU_MAX_WORKGROUP_SIZE_2D_X` +* `GPU_MAX_WORKGROUP_SIZE_2D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_X` +* `GPU_MAX_WORKGROUP_SIZE_3D_Y` +* `GPU_MAX_WORKGROUP_SIZE_3D_Z` +* `GPU_BLIT_ENGINE_TYPE` +* `GPU_USE_SYNC_OBJECTS` +* `AMD_OCL_SC_LIB` +* `AMD_OCL_ENABLE_MESSAGE_BOX` +* `GPU_FORCE_64BIT_PTR` +* `GPU_FORCE_OCL20_32BIT` +* `GPU_RAW_TIMESTAMP` +* `GPU_SELECT_COMPUTE_RINGS_ID` +* `GPU_USE_SINGLE_SCRATCH` +* `GPU_ENABLE_LARGE_ALLOCATION` +* `HSA_LOCAL_MEMORY_ENABLE` +* `HSA_ENABLE_COARSE_GRAIN_SVM` +* `GPU_IFH_MODE` +* `OCL_SYSMEM_REQUIREMENT` +* `OCL_CODE_CACHE_ENABLE` +* `OCL_CODE_CACHE_RESET` ### Known Issues In This Release diff --git a/tools/autotag/templates/rocm_changes/5.5.1.md b/tools/autotag/templates/rocm_changes/5.5.1.md index 358a0a869..be19b8475 100644 --- a/tools/autotag/templates/rocm_changes/5.5.1.md +++ b/tools/autotag/templates/rocm_changes/5.5.1.md @@ -21,4 +21,4 @@ The following HIP API is updated in the ROCm 5.5.1 release: ##### `hipDeviceSetCacheConfig` -- The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess` +* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess` diff --git a/tools/autotag/templates/rocm_changes/5.6.0.md b/tools/autotag/templates/rocm_changes/5.6.0.md index edea590d6..19782782b 100644 --- a/tools/autotag/templates/rocm_changes/5.6.0.md +++ b/tools/autotag/templates/rocm_changes/5.6.0.md @@ -1,116 +1,116 @@ -#### Release Highlights +### Release Highlights ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include: -- New documentation portal at https://rocm.docs.amd.com -- Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite -- OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements -- Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers -- New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. +* New documentation portal at https://rocm.docs.amd.com +* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite +* OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements +* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers +* New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER. -#### OS and GPU Support Changes +### OS and GPU Support Changes -- SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. -- AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. - - No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 - - Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance [EOM])(will be aligned with the closest ROCm release) - - Bug fixes during the maintenance will be made to the next ROCm point release - - Bug fixes will not be back ported to older ROCm releases for this SKU - - Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. +* SLES15 SP5 support was added this release. SLES15 SP3 support was dropped. +* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date. + * No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7 + * Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance \[EOM])(will be aligned with the closest ROCm release) + * Bug fixes during the maintenance will be made to the next ROCm point release + * Bug fixes will not be back ported to older ROCm releases for this SKU + * Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM. -#### AMDSMI CLI 23.0.0.4 +### AMDSMI CLI 23.0.0.4 -##### Added +#### Added -- AMDSMI CLI tool enabled for Linux Bare Metal & Guest +* AMDSMI CLI tool enabled for Linux Bare Metal & Guest -- Package: amd-smi-lib - -##### Known Issues +* Package: amd-smi-lib -- not all Error Correction Code (ECC) fields are currently supported +#### Known Issues -- RHEL 8 & SLES 15 have extra install steps +* not all Error Correction Code (ECC) fields are currently supported -#### Kernel Modules (DKMS) +* RHEL 8 & SLES 15 have extra install steps -##### Fixes +### Kernel Modules (DKMS) -- Stability fix for multi GPU system reproducilble via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). +#### Fixes -#### HIP 5.6 (For ROCm 5.6) +* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198). -##### Optimizations +### HIP 5.6 (For ROCm 5.6) -- Consolidation of hipamd, rocclr and OpenCL projects in clr -- Optimized lock for graph global capture mode +#### Optimizations -##### Added +* Consolidation of hipamd, rocclr and OpenCL projects in clr +* Optimized lock for graph global capture mode -- Added hipRTC support for amd_hip_fp16 -- Added hipStreamGetDevice implementation to get the device associated with the stream -- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats -- hipArrayGetInfo for getting information about the specified array -- hipArrayGetDescriptor for getting 1D or 2D array descriptor -- hipArray3DGetDescriptor to get 3D array descriptor +#### Added -##### Changed +* Added hipRTC support for amd_hip_fp16 +* Added hipStreamGetDevice implementation to get the device associated with the stream +* Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats +* hipArrayGetInfo for getting information about the specified array +* hipArrayGetDescriptor for getting 1D or 2D array descriptor +* hipArray3DGetDescriptor to get 3D array descriptor -- hipMallocAsync to return success for zero size allocation to match hipMalloc -- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package -- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide -- Removed hipBusBandwidth and hipCommander samples from hip-tests +#### Changed -##### Fixed +* hipMallocAsync to return success for zero size allocation to match hipMalloc +* Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package +* Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide +* Removed hipBusBandwidth and hipCommander samples from hip-tests -- Fixed regression in hipMemCpyParam3D when offset is applied +#### Fixed -##### Known Issues +* Fixed regression in hipMemCpyParam3D when offset is applied -- Limited testing on xnack+ configuration - - Multiple HIP tests failures (gpuvm fault or hangs) -- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU -- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release +#### Known Issues -##### Upcoming changes in future release +* Limited testing on xnack+ configuration + * Multiple HIP tests failures (gpuvm fault or hangs) +* hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU +* Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction (unsigned int -> size_t) -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' +#### Upcoming changes in future release -#### ROCgdb-13 (For ROCm 5.6.0) +* Removal of gcnarch from hipDeviceProp_t structure +* Addition of new fields in hipDeviceProp_t structure + * maxTexture1D + * maxTexture2D + * maxTexture1DLayered + * maxTexture2DLayered + * sharedMemPerMultiprocessor + * deviceOverlap + * asyncEngineCount + * surfaceAlignment + * unifiedAddressing + * computePreemptionSupported + * uuid +* Removal of deprecated code + * hip-hcc codes from hip code tree +* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA +* HIPMEMCPY_3D fields correction (unsigned int -> size_t) +* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' -##### Optimized +### ROCgdb-13 (For ROCm 5.6.0) -- Improved performances when handling the end of a process with a large number of threads. +#### Optimized + +* Improved performances when handling the end of a process with a large number of threads. Known Issues -- On certain configurations, ROCgdb can show the following warning message: +* On certain configurations, ROCgdb can show the following warning message: `warning: Probes-based dynamic linker interface failed. Reverting to original interface.` This does not affect ROCgdb's functionalities. -#### ROCprofiler (For ROCm 5.6.0) +### ROCprofiler (For ROCm 5.6.0) In ROCm 5.6 the `rocprofilerv1` and `rocprofilerv2` include and library files of ROCm 5.5 are split into separate files. The `rocmtools` files that were @@ -126,7 +126,7 @@ The ROCm Profiler Tool that uses `rocprofilerV1` can be invoked using the following command: ```sh -$ rocprof … +rocprof … ``` To write a custom tool based on the `rocprofilerV1` API do the following: @@ -143,7 +143,7 @@ int main() { This can be built in the following manner: ```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64 +gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64 ``` The resulting `a.out` will depend on @@ -153,7 +153,7 @@ The ROCm Profiler that uses `rocprofilerV2` API can be invoked using the following command: ```sh -$ rocprofv2 … +rocprofv2 … ``` To write a custom tool based on the `rocprofilerV2` API do the following: @@ -170,22 +170,22 @@ int main() { This can be built in the following manner: ```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2 +gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2 ``` The resulting `a.out` will depend on `/opt/rocm-5.6.0/lib/librocprofiler64.so.2`. -##### Optimized +#### Optimized -- Improved Test Suite +* Improved Test Suite -##### Added +#### Added -- 'end_time' need to be disabled in roctx_trace.txt +* 'end_time' need to be disabled in roctx_trace.txt -##### Fixed +#### Fixed -- rocprof in ROcm/5.4.0 gpu selector broken. -- rocprof in ROCm/5.4.1 fails to generate kernel info. -- rocprof clobbers LD_PRELOAD. +* rocprof in ROcm/5.4.0 gpu selector broken. +* rocprof in ROCm/5.4.1 fails to generate kernel info. +* rocprof clobbers LD_PRELOAD. diff --git a/tools/autotag/templates/rocm_changes/5.6.1.md b/tools/autotag/templates/rocm_changes/5.6.1.md index cc4fffb7e..8ec42b9a5 100644 --- a/tools/autotag/templates/rocm_changes/5.6.1.md +++ b/tools/autotag/templates/rocm_changes/5.6.1.md @@ -9,7 +9,7 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime. ### Fixed Defects -- *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host -- Enabled xnack+ check in HIP catch2 tests hang when executing tests -- Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs -- Using *hipGraphAddMemFreeNode* no longer results in a crash +* *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host +* Enabled xnack+ check in HIP catch2 tests hang when executing tests +* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs +* Using *hipGraphAddMemFreeNode* no longer results in a crash