diff --git a/RELEASE.md b/RELEASE.md
index 8a23c474e..fd86f107c 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,4 +1,4 @@
-# Release notes
+# ROCm 6.2.0 release notes
@@ -11,671 +11,1998 @@
-This page contains the release notes for AMD ROCm™ Software.
-
--------------------
-
-## ROCm 6.1.2
-
-ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries.
-
-### OS support
-
-ROCm 6.1.2 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).### Library changes in ROCm 6.1.2
-
-| Category | Group | Name | Version | Repository |
-|----------|-------|------|---------|------------|
-| Libraries | Machine Learning and Computer Vision | [composable_kernel](https://rocm.docs.amd.com/projects/ROCm/composable_kernel/en/latest) | [0.2.0](https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.1.2) | [ROCm/composable_kernel](https://github.com/ROCm/ROCm/composable_kernel) |
-| | | [AMDMIGraphX](https://rocm.docs.amd.com/projects/ROCm/AMDMIGraphX/en/latest) | [2.9](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.1.2) | [ROCm/AMDMIGraphX](https://github.com/ROCm/ROCm/AMDMIGraphX) |
-| | | [MIOpen](https://rocm.docs.amd.com/projects/ROCm/MIOpen/en/latest) | [3.1.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.1.2) | [ROCm/MIOpen](https://github.com/ROCm/ROCm/MIOpen) |
-| | | [MIVisionX](https://rocm.docs.amd.com/projects/ROCm/MIVisionX/en/latest) | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.2) | [ROCm/MIVisionX](https://github.com/ROCm/ROCm/MIVisionX) |
-| | | [rpp](https://rocm.docs.amd.com/projects/ROCm/rpp/en/latest) | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.2) | [ROCm/rpp](https://github.com/ROCm/ROCm/rpp) |
-| | Communication | [rccl](https://rocm.docs.amd.com/projects/ROCm/rccl/en/latest) | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.2) | [ROCm/rccl](https://github.com/ROCm/ROCm/rccl) |
-| | | [hipBLAS](https://rocm.docs.amd.com/projects/ROCm/hipBLAS/en/latest) | [2.1.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.1.2) | [ROCm/hipBLAS](https://github.com/ROCm/ROCm/hipBLAS) |
-| | | [hipBLASLt](https://rocm.docs.amd.com/projects/ROCm/hipBLASLt/en/latest) | [0.7.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.1.2) | [ROCm/hipBLASLt](https://github.com/ROCm/ROCm/hipBLASLt) |
-| | | [hipFFT](https://rocm.docs.amd.com/projects/ROCm/hipFFT/en/latest) | [1.0.14](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.1.2) | [ROCm/hipFFT](https://github.com/ROCm/ROCm/hipFFT) |
-| | | [hipRAND](https://rocm.docs.amd.com/projects/ROCm/hipRAND/en/latest) | [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.1.2) | [ROCm/hipRAND](https://github.com/ROCm/ROCm/hipRAND) |
-| | | [hipSOLVER](https://rocm.docs.amd.com/projects/ROCm/hipSOLVER/en/latest) | [2.1.1](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.1.2) | [ROCm/hipSOLVER](https://github.com/ROCm/ROCm/hipSOLVER) |
-| | | [hipSPARSE](https://rocm.docs.amd.com/projects/ROCm/hipSPARSE/en/latest) | [3.0.1](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.1.2) | [ROCm/hipSPARSE](https://github.com/ROCm/ROCm/hipSPARSE) |
-| | | [hipSPARSELt](https://rocm.docs.amd.com/projects/ROCm/hipSPARSELt/en/latest) | [0.2.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.1.2) | [ROCm/hipSPARSELt](https://github.com/ROCm/ROCm/hipSPARSELt) |
-| | | [rocALUTION](https://rocm.docs.amd.com/projects/ROCm/rocALUTION/en/latest) | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.2) | [ROCm/rocALUTION](https://github.com/ROCm/ROCm/rocALUTION) |
-| | | [rocBLAS](https://rocm.docs.amd.com/projects/ROCm/rocBLAS/en/latest) | [4.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | [ROCm/rocBLAS](https://github.com/ROCm/ROCm/rocBLAS) |
-| | | [rocFFT](https://rocm.docs.amd.com/projects/ROCm/rocFFT/en/latest) | [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.2) | [ROCm/rocFFT](https://github.com/ROCm/ROCm/rocFFT) |
-| | | [rocRAND](https://rocm.docs.amd.com/projects/ROCm/rocRAND/en/latest) | [3.0.1](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.1.2) | [ROCm/rocRAND](https://github.com/ROCm/ROCm/rocRAND) |
-| | | [rocSOLVER](https://rocm.docs.amd.com/projects/ROCm/rocSOLVER/en/latest) | [3.25.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.1.2) | [ROCm/rocSOLVER](https://github.com/ROCm/ROCm/rocSOLVER) |
-| | | [rocSPARSE](https://rocm.docs.amd.com/projects/ROCm/rocSPARSE/en/latest) | [3.1.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.1.2) | [ROCm/rocSPARSE](https://github.com/ROCm/ROCm/rocSPARSE) |
-| | | [rocWMMA](https://rocm.docs.amd.com/projects/ROCm/rocWMMA/en/latest) | [1.4.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.1.2) | [ROCm/rocWMMA](https://github.com/ROCm/ROCm/rocWMMA) |
-| | | [Tensile](https://rocm.docs.amd.com/projects/ROCm/Tensile/en/latest) | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2) | [ROCm/Tensile](https://github.com/ROCm/ROCm/Tensile) |
-| | Primitives | [hipCUB](https://rocm.docs.amd.com/projects/ROCm/hipCUB/en/latest) | [3.1.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.1.2) | [ROCm/hipCUB](https://github.com/ROCm/ROCm/hipCUB) |
-| | | [hipTensor](https://rocm.docs.amd.com/projects/ROCm/hipTensor/en/latest) | [1.2.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.1.2) | [ROCm/hipTensor](https://github.com/ROCm/ROCm/hipTensor) |
-| | | [rocPRIM](https://rocm.docs.amd.com/projects/ROCm/rocPRIM/en/latest) | [3.1.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.1.2) | [ROCm/rocPRIM](https://github.com/ROCm/ROCm/rocPRIM) |
-| | | [rocThrust](https://rocm.docs.amd.com/projects/ROCm/rocThrust/en/latest) | [3.0.1](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.1.2) | [ROCm/rocThrust](https://github.com/ROCm/ROCm/rocThrust) |
-| | | [ROCdbgapi](https://rocm.docs.amd.com/projects/ROCm/ROCdbgapi/en/latest) | [0.71.0](https://github.com/ROCm/ROCdbgapi/releases/tag/rocm-6.1.2) | [ROCm/ROCdbgapi](https://github.com/ROCm/ROCm/ROCdbgapi) |
-| | | [rocm-cmake](https://rocm.docs.amd.com/projects/ROCm/rocm-cmake/en/latest) | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.2) | [ROCm/rocm-cmake](https://github.com/ROCm/ROCm/rocm-cmake) |
-
-#### AMDMIGraphX
-
-MIGraphX 2.9 for ROCm 6.1.2
-
-##### Additions
-
-* Added FP8 support
-* Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch
-* Added support for the `Hardmax`, `DynamicQuantizeLinear`, `Qlinearconcat`, `Unique`, `QLinearAveragePool`, `QLinearSigmoid`, `QLinearLeakyRelu`, `QLinearMul`, `IsInf` operators
-* Created web site examples for `Whisper`, `Llama-2`, and `Stable Diffusion 2.1`
-* Created examples of using the ONNX Runtime MIGraphX Execution Provider with the `InceptionV3` and `Resnet50` models
-* Updated operators to support ONNX Opset 19
-* Enable fuse_pointwise and fuse_reduce in the driver
-* Add support for dot-(mul)-softmax-dot offloads to MLIR
-* Added Blas auto-tuning for GEMMs
-* Added dynamic shape support for the multinomial operator
-* Added fp16 to accuracy checker
-* Added initial code for running on Windows OS
-
-##### Optimizations
-
-* Improved the output of migraphx-driver command
-* Documentation now shows all environment variables
-* Updates needed for general stride support
-* Enabled Asymmetric Quantization
-* Added ScatterND unsupported reduction modes
-* Rewrote softmax for better performance
-* General improvement to how quantization is performed to support INT8
-* Used problem_cache for gemm tuning
-* Improved performance by always using rocMLIR for quantized convolution
-* Improved group convolutions by using rocMLIR
-* Improved accuracy of fp16 models
-* ScatterElements unsupported reduction
-* Added concat fusions
-* Improved INT8 support to include UINT8
-* Allow reshape ops between dq and quant_op
-* Improve dpp reductions on navi
-* Have the accuracy checker print the whole final buffer
-* Added support for handling dynamic Slice and ConstantOfShape ONNX operators
-* Add support for the dilations attribute to Pooling ops
-* Add layout attribute support for LSTM operator
-* Improved performance by removing contiguous for reshapes
-* Handle all slice input variations
-* Add scales attribute parse in upsample for older opset versions
-* Added support for uneven Split operations
-* Improved unit testing to run in python virtual environments
-
-##### Fixes
-
-* Fixed outstanding issues in autogenerated documentation
-* Update model zoo paths for examples
-* Fixed promote_literals_test by using additional if condition
-* Fixed export API symbols from dynamic library
-* Fixed bug in pad operator from dimension reduction
-* Fixed using the LD to embed files and enable by default when building shared libraries on linux
-* fixed get_version()
-* Fixed Round operator inaccuracy
-* Fixed wrong size check when axes not present for slice
-* Set the .SO version correctly
-
-##### Changes
-
-* Cleanup LSTM and RNN activation functions
-* Placed gemm_pointwise at a higher priority than layernorm_pointwise
-* Updated README to mention the need to include GPU_TARGETS when building MIGraphX
-
-##### Removals
-
-* Removed unused device kernels from Gather and Pad operators
-* Removed int8x4 format
-
-#### composable_kernel
+The release notes provide a comprehensive summary of changes since the previous ROCm release.
-CK 0.2.0 for ROCm 6.1.2
+- [Release highlights](release-highlights)
-##### Fixes
+- [Operating system and hardware support changes](operating-system-and-hardware-support-changes)
-* Fixed a bug in 6-dimensional kernels (#555)
-* Fixed a test case failure with grouped convolution backward weight (#524)
+- [ROCm components versioning](rocm-components)
-##### Optimizations
+- [Detailed component changes](detailed-component-changes)
-* Improved the performance of the normalization kernel
+- [ROCm known issues](rocm-known-issues)
-##### Additions
+- [ROCm upcoming changes](rocm-upcoming-changes)
-* New CMake flags:
- * "DL_KERNELS"-* Must be set to "ON" in order to build the gemm_dl and batched_gemm_multi_d_dl instances
- * "DTYPES" -- Can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build an instance of the specified data types
- * "INSTANCES_ONLY" -- Only builds CK library and instances without tests, examples, or profiler
-* New feature: if GPU_TARGETS is not set in the CMake command line, CK will be built for all targets supported by the compiler
-* Support for MI300A/MI300X
-* Support for AMD RDNA 3
-* New user tutorial (#563)
-* Additional instances for irregular GEMM sizes (#560)
-* New inter-wave consumer-producer programming model for GEMM kernels (#310)
-* GEMM with support multiple elementwise fusions (multi-D) (#534)
-* Multi-embeddings support (#542)
-* AMD RDNA 3 blockwise GEMM and real GEMM support (#541)
-* AMD RDNA grouped convolution backward weight support (#505)
-* MaxPool and AvgPool forward (#815); MaxPool backward (#750)
+The [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html)
+provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.
-##### Changes
+Release notes for previous ROCm releases are available in earlier versions of the documentation.
+See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions.html).
-None
+## Release highlights
-#### hipBLAS
+This section introduces notable new features and improvements in ROCm 6.2. See the
+[Detailed component changes](#detailed-component-changes) for individual component changes.
-hipBLAS 2.1.0 for ROCm 6.1.2
+### New components
-##### Additions
+ROCm 6.2.0 introduces the following new components to the ROCm software stack.
-* New build option to automatically use hipconfig --platform to determine HIP platform
-* Level 1 functions have additional ILP64 API for both C and Fortran (`_64` name
- suffix) with int64_t function arguments
-* New functions hipblasGetMathMode and hipblasSetMathMode
+- **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
+ running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
+ or a GUI dashboard. For more information, see
+ [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest).
-##### Deprecations
+- **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
+ It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
+ which function and line number are executing. For more information, see
+ [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest).
-* USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig
+- **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
+ enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses
+ rocDecode APIs written primarily in C/C++ within Python. For more information, see
+ [rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/latest).
-##### Changes
+- **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
+ used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
+ efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
+ ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library
+ to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
+ For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest).
-* Some Level 2 function argument names have changed from `m` to `n` to match legacy BLAS; there
- was no change in implementation.
-* Updated client code to use YAML-based testing
-* Renamed `.doxygen` and `.sphinx` folders to `doxygen` and `sphinx`, respectively
-* Added CMake support for documentation
+ ```{note}
+ ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
+ ```
-#### hipBLASLt
+### ROCm Offline Installer Creator introduced
-hipBLASLt 0.7.0 for ROCm 6.1.2
+The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
+driver, or a combination of the two on a target system without network access. This new tool customizes
+multiple unique configurations for use when installing ROCm on a target. Other notable features include:
-##### Additions
+* A lightweight, easy-to-use user interface for configuring the creation of the installer
-* Added `hipblasltExtSoftmax` extension API
-* Added `hipblasltExtLayerNorm` extension API
-* Added `hipblasltExtAMax` extension API
-* Added `GemmTuning` extension parameter to set split-k by user
-* Support for mix precision datatype: fp16/fp8 in with fp16 out
+* Support for multiple Linux distributions
-##### Deprecations
+* Installer support for different ROCm releases and specific ROCm components
-* algoGetHeuristic() ext API for GroupGemm will be deprecated in a future release of hipBLASLt
+* Optional driver or driver-only installer creation
-#### hipCUB
+* Optional post-install preferences
-hipCUB 3.1.0 for ROCm 6.1.2
+* Lightweight installer packages, which are unique to the preconfigured ROCm setup
-##### Changed
+* Resolution and inclusion of dependency packages for offline installation
-- CUB backend references CUB and Thrust version 2.1.0.
-- Updated `HIPCUB_HOST_WARP_THREADS` macro definition to match `host_warp_size` changes from rocPRIM 3.0.
-- Implemented `__int128_t` and `__uint128_t` support for radix_sort.
+For more information, see
+[ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/rocm-install-on-linux/en/docs-6.2.0/install/rocm-offline-installer.html).
-##### Fixed
+### Math libraries default to Clang instead of HIPCC
-- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
+The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`.
+Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
+building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as
+described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/reference/rocmcc.html).
+The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX`
+environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or
+functionality.
-##### Added
+### Framework and library changes
-- Added interface `DeviceMemcpy::Batched` for batched memcpy from rocPRIM and CUB.
+This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.
-#### hipFFT
+#### Additional PyTorch and TensorFlow support
-hipFFT 1.0.14 for ROCm 6.1.2
+ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.
-##### Changes
+See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html)
+and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html)
+for installation instructions.
-* When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule.
+Refer to the
+[Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/reference/3rd-party-support-matrix.html#deep-learning)
+for a comprehensive list of third-party frameworks and libraries suppported by ROCm.
-##### Fixes
+#### Optimized framework support for OpenXLA
-* Fixed error when creating length-1 plans.
+PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
+ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
+[Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html)
+and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html).
-#### hipRAND
+#### PyTorch support for Autocast (automatic mixed precision)
-hipRAND 2.10.17 for ROCm 6.1.2
+PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
+workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
+original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see
+[Automatic mixed precision](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/train-a-model#automatic-mixed-precision-amp).
-##### Fixes
+#### Memory savings for bitsandbytes model quantization
-* Fixed benchmark and unit test builds on Windows
+The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP
+custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
+ROCm 6.2.0 introduces the following bitsandbytes changes:
-#### hipSOLVER
+- `Int8` matrix multiplication is enabled, and it includes the following functions:
+ - `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
+ - `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.
+ - `igemmlt` – new function for GEMM computation A*B^T. It uses
+ [hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
+ - `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
+- Blockwise quantization – input tensors are quantized for a fixed block size.
+- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled.
+- 8-bit and 32-bit optimizers are enabled.
-hipSOLVER 2.1.1 for ROCm 6.1.2
+```{note}
+These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and
+features to run them.
+```
-##### Changed
+For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/model-quantization.html).
-- `BUILD_WITH_SPARSE` now defaults to OFF on Windows.
+#### Improved vLLM support
-##### Fixed
+ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
+capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama.
+ROCm 6.2.0 adds support for the following vLLM features:
-- Fixed benchmark client build when `BUILD_WITH_SPARSE` is OFF.
+- MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
+ use `--distributed-executor-backed=mp`. The default depends on the commit in flux.
-#### hipSPARSE
+- FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
+ The QUARK quantizer currently only supports Llama.
-hipSPARSE 3.0.1 for ROCm 6.1.2
+- Triton Flash Attention:
-##### Fixes
+ ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
+ setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable.
-* Fixes to the build chain
+- PyTorch TunableOp:
-#### hipSPARSELt
+ Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
-hipSPARSELt 0.2.0 for ROCm 6.1.2
+For more information about enabling these features, see
+[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
-##### Added
+ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
+These features include:
-- Support Matrix B is a Structured Sparsity Matrix.
+- FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
+ in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.
-#### hipTensor
+- Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
+ computation in large-scale models. This benefits all workloads in `FP16` configurations.
-hipTensor 1.2.0 for ROCm 6.1.2
+To enable these experimental new features, see
+[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
+Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines
+all the accessible features, and the `vllm/Dockerfile.rocm` file can be used.
-##### Additions
+### Enhanced performance tuning on AMD Instinct accelerators
-* API support for permutation of rank 4 tensors: f16 and f32
-* New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64
-* Added scale and bilinear contraction samples and tests for new supported data types
-* Added permutation samples and tests for f16, f32 types
+ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
+The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
+detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
+accelerators for optimal performance. For more information, see
+[AMD MI300X tuning guides](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/index.html) and
+[AMD MI300A system optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html).
-##### Fixes
+### Removed clang-ocl
-* Fixed bug in contraction calculation with data type f32
+As of version 6.2, ROCm no longer provides the `clang-ocl` package.
+See the [clang-ocl README](https://github.com/ROCm/clang-ocl).
-#### MIOpen
+### ROCm documentation changes
-MIOpen 3.1.0 for ROCm 6.1.2
+The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
+improves the usability and readability of the documentation. For more information about the ROCm components, see
+[What is ROCm?](https://rocm.docs.amd.com/en/docs-6.2.0/what-is-rocm.html).
-##### Added
-
-- CK-based 2d/3d convolution solvers to support nchw/ncdhw layout
-- Fused solver for Fwd Convolution with Residual, Bias and activation
-- AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver
-- Forward, backward data and backward weight convolution solver with fp8/bfp8
-- check for packed tensors for convolution solvers
-- Integrate CK's layer norm
-- Combine gtests into single binary
-
-##### Fixed
-
-- fix for backward passes bwd/wrw for CK group conv 3d
-- Fixed out-of-bounds memory access : ConvOclDirectFwdGen
-- fixed build failure due to hipRTC
-
-##### Changed
-
-- Standardize workspace abstraction
-- Use split CK libraries
-
-##### Removed
-
-- clamping to MAX from CastTensor used in Bwd and WrW convolution
-
-#### MIVisionX
-
-MIVisionX for ROCm 6.1.2
-
-##### Additions
-
-* CTest: Tests for install verification
-* Hardware support updates
-* Doxygen support for API documentation
-
-##### Optimizations
-
-* CMakeList Cleanup
-* Readme
-
-##### Changes
-
-* rocAL: PyBind Link to prebuilt library
- * PyBind11
- * RapidJSON
-* Setup Updates
-* RPP - Use package install
-* Dockerfiles: Updates & bugfix
-* CuPy - No longer installed with setup.py
-
-##### Fixes
-
-* rocAL bug fix and updates
-
-##### Tested Configurations
-
-* Windows `10` / `11`
-* Linux distribution
- * Ubuntu - `20.04` / `22.04`
- * CentOS - `7` / `8`
- * RHEL - `8` / `9`
- * SLES - `15-SP4`
-* ROCm: rocm-core - `5.7.0.50700-6`
-* miopen-hip - `2.20.0.50700-63`
-* MIGraphX - `2.7.0.50700-63`
-* Protobuf - [V3.12.4](https://github.com/protocolbuffers/protobuf/releases/tag/v3.12.4)
-* OpenCV - [4.6.0](https://github.com/opencv/opencv/releases/tag/4.6.0)
-* RPP - [1.5.0]
-* FFMPEG - [n4.4.2](https://github.com/FFmpeg/FFmpeg/releases/tag/n4.4.2)
-* Dependencies for all preceding packages
-* MIVisionX setup script - `V2.6.1`
-
-##### Known Issues
-
-* OpenCV 4.X support for some applications is missing
-* MIVisionX package install requires manual prerequisites installation
-
-#### rccl
-
-RCCL 2.18.6 for ROCm 6.1.2
-
-##### Changed
-
-- Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow
-
-#### rocALUTION
-
-rocALUTION 3.1.1 for ROCm 6.1.2
-
-##### Additions
-
-* `TripleMatrixProduct` functionality for `GlobalMatrix`
-* Multi-Node/GPU support for `UA-AMG`, `SA-AMG` and `RS-AMG`
-* Iterative ILU0 preconditioner `ItILU0`
-* Iterative triangular solve, selectable via `SolverDecr` class
-
-##### Deprecations
-
-* `LocalMatrix::AMGConnect`
-* `LocalMatrix::AMGAggregate`
-* `LocalMatrix::AMGPMISAggregate`
-* `LocalMatrix::AMGSmoothedAggregation`
-* `LocalMatrix::AMGAggregation`
-* `PairwiseAMG`
-
-##### Known Issues
-
-* `PairwiseAMG` does currently not support matrix sizes that exceed int32 range
-* `PairwiseAMG` might fail building the hierarchy on certain input matrices
-
-#### rocBLAS
-
-rocBLAS 4.1.2 for ROCm 6.1.2
-
-##### Fixes
-
-* Fixes BF16 TT get_solutions
-
-##### Optimizations
-
-* Tune gfx942 BBS TN, TT
-
-#### ROCdbgapi
-
-rocm-dbgapi 0.71.0 for ROCm 6.1.2
-
-##### Added
-
-- Add support for gfx940, gfx941 and gfx942 architectures.
-
-#### rocFFT
-
-rocFFT 1.0.27 for ROCm 6.1.2
-
-##### Fixes
-
-* Fixed kernel launch failure on execute of very large odd-length real-complex transforms.
-
-##### Additions
-
-* Enable multi-gpu testing on systems without direct GPU-interconnects
-
-#### rocm-cmake
-
-rocm-cmake 0.12.0 for ROCm 6.1.2
-
-##### Changed
-
-- ROCMSphinxDoc: Allow separate source and config directories.
-- ROCMCreatePackage: Allow additional `PROVIDES` on header-only packages.
-- ROCMInstallTargets: Don't install executable targets by default for ASAN builds.
-- ROCMTest: Add RPATH for installed tests.
-- Finalize rename to ROCmCMakeBuildTools
-
-##### Fixed
-
-- ROCMClangTidy: Fixed invalid list index.
-- Test failures when ROCM_CMAKE_GENERATOR is empty.
-
-#### rocPRIM
-
-rocPRIM 3.1.0 for ROCm 6.1.2
-
-##### Additions
-
-* New primitive: `block_run_length_decode`
-* New primitive: `batch_memcpy`
-
-##### Changes
-
-* Renamed:
- * `scan_config_v2` to `scan_config`
- * `scan_by_key_config_v2` to `scan_by_key_config`
- * `radix_sort_config_v2` to `radix_sort_config`
- * `reduce_by_key_config_v2` to `reduce_by_key_config`
- * `radix_sort_config_v2` to `radix_sort_config`
-* Removed support for custom config types for device algorithms
-* `host_warp_size()` was moved into `rocprim/device/config_types.hpp`; it now uses either `device_id` or
- a `stream` parameter to query the proper device and a `device_id` out parameter
- * The return type is `hipError_t`
-* Added support for `__int128_t` in `device_radix_sort` and `block_radix_sort`
-* Improved the performance of `match_any`, and `block_histogram` which uses it
-
-##### Deprecations
-
-* Removed `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config`, and
- `radix_sort_config`
-
-##### Fixes
-
-* Build issues with `rmake.py` on Windows when using VS 2017 15.8 or later (due to a breaking fix with
- extended aligned storage)
-
-#### rocRAND
-
-rocRAND 3.0.1 for ROCm 6.1.2
-
-##### Fixes
-
-* Implemented workaround for regressions in XORWOW and LFSR on MI200
-
-#### rocSOLVER
-
-rocSOLVER 3.25.0 for ROCm 6.1.2
-
-##### Added
-
-- Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm:
- - SYEVDJ (with batched and strided\_batched versions)
- - HEEVDJ (with batched and strided\_batched versions)
-- Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm:
- - SYGVDJ (with batched and strided\_batched versions)
- - HEGVDJ (with batched and strided\_batched versions)
-
-##### Changed
-
-- Relaxed array length requirements for GESVDX with `rocblas_srange_index`.
-
-##### Removed
-
-- Removed gfx803 and gfx900 from default build targets.
-
-##### Fixed
-
-- Corrected singular vector normalization in BDSVDX and GESVDX
-- Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX
-
-#### rocSPARSE
-
-rocSPARSE 3.1.2 for ROCm 6.1.2
-
-##### Additions
-
-* New LRB algorithm to SpMV, supporting CSR format
-* rocBLAS as now an optional dependency for SDDMM algorithms
-* Additional verbose output for `csrgemm` and `bsrgemm`
-
-##### Optimizations
-
-* Triangular solve with multiple rhs (SpSM, csrsm, ...) now calls SpSV, csrsv, etcetera when nrhs equals 1
-* Improved user manual section *Installation and Building for Linux and Windows*
-* Improved SpMV in CSR format on MI300
-
-#### rocThrust
-
-rocThrust 3.0.1 for ROCm 6.1.2
-
-##### Fixes
-
-* Ported a fix from thrust 2.2 that ensures `thrust::optional` is trivially copyable.
-
-#### rocWMMA
-
-rocWMMA 1.4.0 for ROCm 6.1.2
-
-##### Additions
-
-* Added bf16 support for hipRTC sample
-
-##### Changes
-
-* Changed Clang C++ version to C++17
-* Updated rocwmma_coop API
-* Linked rocWMMA to hiprtc
-
-##### Fixes
-
-* Fixed compile/runtime arch checks
-* Built all test in large code model
-* Removed inefficient branching in layout loop unrolling
-
-#### rpp
-
-rpp for ROCm 6.1.2
-
-##### Changes
-
-* Prerequisites
-
-##### Tested Configurations
-
-* Linux distribution
- * Ubuntu - `20.04` / `22.04`
- * CentOS - `7`
- * RHEL - `8`/`9`
-* ROCm: rocm-core - `5.5.0.50500-63`
-* Clang - Version `5.0.1` and above
-* CMake - Version `3.22.3`
-* IEEE 754-based half-precision floating-point library - Version `1.12.0`
-
-#### Tensile
-
-Tensile 4.40.0 for ROCm 6.1.2
-
-##### Additions
-
-- new DisableKernelPieces values to invalidate local read, local write, and global read
-- stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
-- feature to allow testing stream-k grid multipliers
-- debug output to check occupancy for Stream-K
-- reject condition for FractionalLoad + DepthU!=power of 2
-- new TENSILE_DB debugging value to dump the common kernel parameters
-- predicate for APU libs
-- new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
-- new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
-- new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
-- initialization type and general batched options to the rocblas-bench input creator script
-
-##### Optimizations
-
-- enabled MFMA + LocalSplitU=4 for MT16x16
-- enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
-- optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
-- auto LdsPad calculation for TileMajorLds + MI16x16
-- auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
-
-##### Changes
-
-- cleared hipErrorNotFound error since it is an expected part of the search
-- modified hipcc search path for Linux
-- changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
-- changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
-- changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
-- updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
-- Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
-
-##### Fixes
-
-- hipcc compile append flag parallel-jobs=4
-- race condition in Stream-K that appeared with large grids and small sizes
-- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
-- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
-- incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
-- small fix for LdsPad optimization (LdsElement calculation)
-
-### AMD SMI
-
-AMD SMI for ROCm 6.1.2
-
-#### Additions
-
-* Added process isolation and clean shader APIs and CLI commands.
- * `amdsmi_get_gpu_process_isolation()`
- * `amdsmi_set_gpu_process_isolation()`
- * `amdsmi_set_gpu_clear_sram_data()`
-* Added the `MIN_POWER` metric to output provided by `amd-smi static --limit`.
-
-#### Optimizations
-
-* Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command.
+Since the release of ROCm 6.1, the documentation has added some key topics including:
+
+- [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/workload.html)
+- [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html)
+- [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300a.html)
+- [Using ROCm for AI](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/index.html)
+- [Using ROCm for HPC](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-hpc/index.html)
+- [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/index.html)
+- [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/)
+
+The following topics have been significantly improved, expanded, or both:
+
+- [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-6.2.0/)
+- [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html)
+
+```{note}
+All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the
+[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
+```
+
+## Operating system and hardware support changes
+
+ROCm 6.2.0 adds support for the following operating system and kernel versions.
+
+- Ubuntu 24.04 LTS (kernel: 6.8 [GA])
+
+- RHEL 8.10 (kernel: 4.18.0-544)
+
+- SLES 15 SP6 (kernel: 6.4)
+
+ROCm 6.2.0 marks the end of support (EoS) for:
+
+- Ubuntu 22.04.3
+
+- RHEL 9.2
+
+- RHEL 8.8
+
+- SLES 15 SP 4
+
+- CentOS 7.9
+
+ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]).
+
+See the [Compatibility matrix](https://rocm.docs.amd.com/en/docs/6.2.0/compatibility/compatibility-matrix.html) for an
+overview of supported operating systems and hardware architectures.
+
+## ROCm components
+
+The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the
+links in the *Version* column to go to the detailed component changelogs.
+
+
+
+## Detailed component changes
+
+The following sections describe key changes to ROCm components.
+
+### **AMD SMI** (24.6.2)
#### Changes
-* Updated `amismi_get_power_cap_info` to return values in uW instead of W.
-* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`.
-* Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks.
-
-#### Fixes
-
-* `amdsmi_get_gpu_board_info()` no longer returns junk character strings.
-* `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices.
-* Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices.
-* Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls.
+- Added the following functionality:
+ - `amd-smi dmon` is now available as an alias to `amd-smi monitor`.
+ - An optional process table under `amd-smi monitor -q`.
+ - Handling to detect VMs with passthrough configurations in CLI tool.
+ - Process Isolation and Clear SRAM functionality to the CLI tool for VMs.
+ - Added Ring Hang event.
+- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`.
+- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`.
#### Removals
-* Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release.
+- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported.
+- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`.
+
+#### Optimizations
+
+- Updated CLI error strings to specify invalid device type queried.
+- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest.
+ - Added `amdsmi.h` and `amdsmi_interface.py`.
+ - `amdsmi_clk_info_t` struct
+ - Added `AMDSMI` prefix to multiple structures.
+- Updated `dpm_policy` references to `soc_pstate`.
+- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file.
+- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`.
+- Updated CLI voltage curve command output.
+
+#### Resolved issues
+
+- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format.
+- `amdsmi_get_gpu_board_info()` no longer returns junk character strings.
+- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`.
+- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running
+ `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any
+ other subcommand that access the GPUs) a mutex would lock and not return requiring either a
+ clear of the mutex in `/dev/shm` or rebooting the machine.
+
+#### Known issues
+
+- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work.
+ They will be supported in a future release.
```{note}
-See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information.
+See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md)
+on GitHub for more information.
```
-### ROCm SMI
-
-ROCm SMI for ROCm 6.1.2
-
-#### Additions
-
-* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum.
-
-#### Fixes
-
-* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112).
-* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware.
-
-
-
-### HIPCC
-
-HIPCC for ROCm 6.1.2
+### **Composable Kernel** (1.1.0)
#### Changes
-* **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`.
-* **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly.
-* **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`.
+- Added suppport for:
+ - Permute scale for any dimension (#1198).
+ - Combined elementwise op (#1217).
+ - Multi D in grouped convolution backward weight (#1280).
+ - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280).
+ - Large batch in grouped convolution forward (#1332).
+- Added `CK_TILE` layernorm example (#1339).
+- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend.
+
+#### Optimizations
+
+- Support universal GEMM in grouped convolution forward (#1320).
+- Optimizations for low M and N in grouped convolution backward weight (#1303).
+- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel.
+- `FP8` GEMM performance optimization and tuning (#1384).
+- Added FlashAttention backward pass performance optimization (#1397).
+
+### **HIP** (6.2.0)
+
+#### Changes
+
+- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take
+ a 64-bit integer as an explicit mask argument.
+ - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled
+ unconditionally in a future HIP release.
+
+- Added new HIP APIs:
+ - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol.
+ - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`.
+ - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph.
+ - `hipGraphInstantiateWithParams` creates an executable graph from a graph.
+
+- Added a new flag `integrated` -- supported in device property.
+
+ - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver
+ detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for
+ `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device.
+
+- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `.
+
+- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`.
+
+#### Resolved issues
+
+- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime.
+- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`.
+
+#### Upcoming changes
+
+- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be enabled unconditionally in a future HIP release.
+
+### **hipBLAS** (2.2.0)
+
+#### Changes
+
+* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
+* Added a new ILP64 API for level 1 `_ex` functions.
+
+* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts.
+* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages
+ `rocblas-dev` and `rocsolver-dev`.
+* Updated Linux AOCL dependency to release 4.2 `gcc` build.
+* Updated Windows `vcpkg` dependencies to release 2024.02.14.
+
+### **hipBLASLt** (0.8.0)
+
+#### Changes
+
+* Added extension APIs:
+ *`hipblasltExtAMaxWithScale`.
+ * `GemmTuning` extension parameter to set `wgm` by user.
+* Added support for:
+ * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype.
+ * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only).
+ * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA.
+* Added support for gfx110x.
+
+#### Optimizations
+
+* Improved library loading time.
+
+### **HIPCC** (1.1.1)
+
+#### Changes
+
+* Split `hipcc` package into two packages for different hardware platforms.
+
+* Cleaned up references to environment variables.
+
+* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by
+ default, instead of their Perl counterparts.
+
+* Enabled function calls.
+
+* Added support for generating packages for ROCm stack targeting static libraries.
+
+#### Resolved issues
+
+* Implemented numerous bug fixes and quality improvements.
+
+### **hipCUB** (3.2.0)
+
+#### Changes
+
+* Added `DeviceCopy` function for parity with CUB.
+* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as
+ the new optional template argument for `WarpExchange`.
+ * The potential values for the enum are `WARP_EXCHANGE_SMEM` and
+ `WARP_EXCHANGE_SHUFFLE`.
+ * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while
+ `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations.
+ * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory,
+ but the `ItemsPerThread` must be a divisor of `WarpSize`.
+* Added `tuple.hpp` which defines templates `hipcub::tuple`,
+ `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`.
+* Added new overloaded member functions to `BlockRadixSort` and
+ `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type
+ (`key_type`) can be sorted via these overloads, if an appropriate decomposer
+ is passed. The decomposer has to implement `operator(const key_type&)` which
+ returns a `hipcub::tuple` of references pointing to members of `key_type`.
+
+* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of
+ HIP graphs, with several exceptions:
+ * `CachingDeviceAllocator`
+ * `GridBarrierLifetime`
+ * `DeviceSegmentedRadixSort`
+ * `DeviceRunLengthEncode`
+ Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs.
+
+#### Removals
+
+* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous`
+ to the device functions results in a deprecation warning both at runtime and at compile time.
+ * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time
+ by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition.
+ * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition.
+
+#### Resolved issues
+
+* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
+ It now derives the accumulator type as the result of the binary operator.
+
+### **hipFFT** (1.0.15)
+
+#### Resolved issues
+
+* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers.
+* Prevented C++ exceptions leaking from public API functions.
+* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs.
+
+
+### **HIPIFY** (18.0.0)
+
+#### Changes
+
+- Added support for:
+ - NVIDIA CUDA 12.4.1
+ - cuDNN 9.1.1
+ - LLVM 18.1.6
+- Added full hipBLASLt support.
+
+#### Resolved issues
+
+- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object;
+ affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`.
+
+### **hipRAND** (2.11.0)
+
+#### Changes
+
+* Added support for setting generator output ordering in C and C++ API.
+* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with
+ `uHIPRAND_STATUS_NOT_IMPLEMENTED`.
+* Added options to create:
+ * A host generator to the Fortran wrapper.
+ * A host generator to the Python wrapper.
+* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous
+ versions. The environment variable is now named `HIPRAND_USE_HMM`.
+* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
+* Device API documentation is improved in this version.
+
+#### Removals
+
+* Removed the option to build hipRAND as a submodule to rocRAND.
+* Removed references to, and workarounds for, the deprecated `hcc`.
+* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`.
+ Use `ROCRAND_PATH` instead.
+
+#### Resolved issues
+
+* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`.
+
+### **hipSOLVER** (2.2.0)
+
+#### Changes
+
+- Added compatibility-only functions:
+ - `auxiliary`
+ - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions`
+ - `getrf`
+ - `hipsolverDnXgetrf_bufferSize`
+ - `hipsolverDnXgetrf`
+ - `getrs`
+ - `hipsolverDnXgetrs`
+- Added support for building on Ubuntu 24.04 and CBL-Mariner.
+- Added `hip::host` to `roc::hipsolver` usage requirements.
+- Added functions
+ - `syevdx`/`heevdx`
+ - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize`
+ - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx`
+ - `sygvdx`/`hegvdx`
+ - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize`
+ - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx`
+
+- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU.
+- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`.
+
+#### Removals
+
+- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients.
+
+### **hipSPARSE** (3.1.1)
+
+#### Changes
+
+* Added the missing `hipsparseCscGet()` routine.
+
+* All internal hipSPARSE functions now exist inside a namespace.
+* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend.
+* Improved the user manual and contribution guidelines.
+
+#### Resolved issues
+
+* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior.
+
+#### Known Issues
+
+* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API
+ and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be
+ ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`.
+
+### **hipSPARSELt** (0.2.1)
+
+#### Optimizations
+
+* Refined test cases.
+
+### **hipTensor** (1.3.0)
+
+#### Changes
+
+* Added support for:
+ * Tensor permutation of ranks of 2, 3, 4, 5, and 6
+ * Tensor contraction of M6N6K6: M, N, K up to rank 6
+* Added tests for:
+ * Tensor permutation of ranks of 2, 3, 4, 5, and 6
+ * Tensor contraction of M6N6K6: M, N, K up to rank 6
+ * YAML parsing to support sequential parameters ordering.
+* Prefer `amd-llvm-devel` package before system LLVM library.
+* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`.
+* Updated actor-critic selection for new contraction kernel additions.
+* Updated installation, programmer's guide, and API reference documentation.
+
+#### Resolved issues
+
+* Fixed LLVM parsing crash.
+* Fixed memory consumption issue in complex kernels.
+* Workaround implemented for compiler crash during debug build.
+* Allow random modes ordering for tensor contractions.
+
+### **llvm-project** (18.0.0)
+
+#### Changes
+
+* LLVM IR
+
+ * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address
+ spaces.
+
+ * Added `llvm.exp10` intrinsic.
+
+* LLVM infrastruture
+
+ * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`.
+
+* TableGen
+
+ * Added constructs for debugging TableGen files:
+
+ * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793).
+
+ * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716).
+
+* AArch64 backend
+
+ * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs.
+
+* AMDGPU backend
+
+ * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access.
+
+ * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics.
+
+ * Implemented `llvm.get.rounding`.
+
+* ARM backend
+
+ * Added support for Cortex-M52 CPUs.
+
+ * Added execute-only support for Armv6-M.
+
+* RISC-V backend
+
+ * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental.
+
+ * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental.
+
+ * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See
+ [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification.
+
+ * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64.
+
+ * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`,
+ `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However,
+ the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions,
+ `-menable-experimental-extensions` needs to be passed to Clang.
+
+ * `-mcpu=sifive-p450` was added.
+
+ * CodeGen of `RV32E` and `RV64E` is supported experimentally.
+
+ * CodeGen of `ilp32e` and `lp64e` is supported experimentally.
+
+* X86 backend
+
+ * Added support for the RDMSRLIST and WRMSRLIST instructions.
+
+ * Added support for the WRMSRNS instruction.
+
+ * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction.
+
+ * Support ISA of CMPCCXADD.
+
+ * Support ISA of AVX-IFMA.
+
+ * Support ISA of AVX-VNNI-INT8.
+
+ * Support ISA of AVX-NE-CONVERT.
+
+ * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported.
+
+ * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported.
+
+ * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now
+ respects this builtin and does not convert CMOVs to branches.
+
+ * Add support for the PBNDKB instruction.
+
+ * Support ISA of SHA512.
+
+ * Support ISA of SM3.
+
+ * Support ISA of SM4.
+
+ * Support ISA of AVX-VNNI-INT16.
+
+ * `-mcpu=graniterapids-d` is now supported.
+
+ * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust
+ which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type
+ matched and called into `libgcc` helper functions.
+
+ * Support ISA of USER_MSR.
+
+ * Support ISA of AVX10.1-256 and AVX10.1-512.
+
+ * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported.
+
+ * `-mapxf` is supported.
+
+ * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow
+ 32-bit relocations or require 64-bit relocations to the global variable.
+
+ * The medium code model’s code generation was audited to be more similar to the small code model where possible.
+
+* C API
+
+ * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions.
+
+ * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an
+ alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target
+ ABI) and provides default values for unspecified settings.
+
+ * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and
+ `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions.
+
+ * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions
+ that use operand bundles:
+
+ * `LLVMBuildCallWithOperandBundles`
+
+ * `LLVMBuildInvokeWithOperandBundles`
+
+ * `LLVMCreateOperandBundle`
+
+ * `LLVMDisposeOperandBundle`
+
+ * `LLVMGetNumOperandBundles`
+
+ * `LLVMGetOperandBundleAtIndex`
+
+ * `LLVMGetNumOperandBundleArgs`
+
+ * `LLVMGetOperandBundleArgAtIndex`
+
+ * `LLVMGetOperandBundleTag`
+
+ * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an
+ instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag.
+
+* CodeGen infrastructure
+
+ * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e.
+ `-debug-only=isel-dump`). This new debug type can be filtered by function names using
+ `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the
+ existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single
+ function regardless of `-filter-print-funcs`’s values.
+
+* Metadata info
+
+ * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`.
+
+* LLVM tools
+
+ * `llvm-symbolizer` now treats invalid input as an address for which source information is not found.
+
+ * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing
+ symbols.
+
+ * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each
+ object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now
+ `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each
+ object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers
+ accordingly.
+
+ * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`.
+
+ * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers.
+
+ * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names.
+
+ * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only.
+
+* LLDB
+
+ * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type
+ when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`.
+
+ * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying
+ functionality remains unchanged.
+
+ * The `mte_ctrl` register can now be read from AArch64 Linux core files.
+
+ * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2
+ (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation.
+
+ * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism
+ of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users
+ can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`,
+ override or adding to the environment variable.
+
+ * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers:
+ `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`.
+
+* Sanitizers
+
+ * HWASan now defaults to detecting use-after-scope bugs.
+
+#### Removals
+
+* LLVM IR
+
+ * The constant expression variants of the following instructions have been removed:
+
+ * `and`
+
+ * `or`
+
+ * `lshr`
+
+ * `ashr`
+
+ * `zext`
+
+ * `sext`
+
+ * `fptrunc`
+
+ * `fpext`
+
+ * `fptoui`
+
+ * `fptosi`
+
+ * `uitofp`
+
+ * `sitofp`
+
+* RISC-V backend
+
+ * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and
+ instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with
+ SiFive to define and document real extension names for individual CSRs and instructions.
+
+* Python bindings
+
+ * The Python bindings have been removed.
+
+* C API
+
+ * The following functions for creating constant expressions have been removed, because the underlying constant
+ expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which
+ will constant fold the operands if possible and create an instruction otherwise:
+
+ * `LLVMConstAnd`
+
+ * `LLVMConstOr`
+
+ * `LLVMConstLShr`
+
+ * `LLVMConstAShr`
+
+ * `LLVMConstZExt`
+
+ * `LLVMConstSExt`
+
+ * `LLVMConstZExtOrBitCast`
+
+ * `LLVMConstSExtOrBitCast`
+
+ * `LLVMConstIntCast`
+
+ * `LLVMConstFPTrunc`
+
+ * `LLVMConstFPExt`
+
+ * `LLVMConstFPToUI`
+
+ * `LLVMConstFPToSI`
+
+ * `LLVMConstUIToFP`
+
+ * `LLVMConstSIToFP`
+
+ * `LLVMConstFPCast`
+
+* CodeGen infrastructure
+
+ * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets
+ should use backwards frame index elimination instead.
+
+ * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging
+ instead, which is preferred because it does not depend on accurate kill flags.
+
+* LLDB
+
+ * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable.
+
+ * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value
+ indicating failure.
+
+#### Resolved issues
+
+* AArch64 backend
+
+ * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto
+ options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for
+ this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the
+ exception of crypto.
+
+* Windows target
+
+ * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path
+ names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug
+ used to cause distinct files to be considered equivalent on some file systems. This change fixed the issues
+ [https://github.com/llvm/llvm-project/issues/61401]() and [https://github.com/llvm/llvm-project/issues/22079]().
+
+#### Known issues
+
+The compiler may incorrectly compile a program that uses the
+``__shfl(var, srcLane, width)`` function when one of the parameters to
+the function is undefined along some path to the function. For most functions,
+uninitialized inputs cause undefined behavior.
+
+```{note}
+The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path.
+```
+
+As a workaround, initialize the parameters to ``__shfl``. For example:
+
+```{code-block} cpp
+unsigned long istring = 0 // Initialize the input to __shfl
+return __shfl(istring, 0, 64)
+```
+
+### **MIGraphX** (2.10.0)
+
+#### Changes
+
+- Added support for ONNX Runtime MIGraphX EP on Windows.
+- Added `FP8` Python API.
+- Added examples for SD 2.1 and SDXL.
+- Added support for BERT to Dynamic Batch.
+- Added a `--test` flag in `migraphx-driver` to validate the installation.
+- Added support for ONNX Operator: Einsum.
+- Added `uint8` support in ONNX Operators.
+- Added Split-k kernel configurations for performance improvements.
+- Added fusion for group convolutions.
+- Added rocMLIR conv3d support.
+- Added rocgdb to the Dockerfile.
+- Changed default location of libraries with release specific ABI changes.
+- Reorganized documentation in GitHub.
+
+#### Removals
+
+- Removed the `--model` flag with `migraphx-driver`.
+
+#### Optimizations
+
+- Improved ONNX Model Zoo coverage.
+- Reorganized `memcpys` with ONNX Runtime to improve performance.
+- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast.
+- Improved MLIR kernel selection for multibroadcasted GEMMs.
+- Improved details of the perf report.
+- Enable mlir by default for GEMMs with small K.
+- Allow specifying dot or convolution fusion for mlir with environmental flag.
+- Improve performance on small reductions by doing multiple reduction per wavefront.
+- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants.
+- Use MLIR attention kernels in more cases.
+- Enables MIOpen and CK fusions for MI300 gfx arches.
+- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs.
+- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases.
+- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout.
+- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases.
+- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained.
+- Simplifies "slice + concat" pattern in SDXL UNet.
+- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros.
+- Improved inference performance by fusing Reduce to Broadcast.
+- Added additional information when printing the perf report.
+- Improve scalar fusions when not all strides are 0.
+- Added support for multi outputs in pointwise ops.
+- Improve reduction fusion with reshape operators.
+- Use the quantized output when an operator is used again.
+- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware.
+
+#### Resolved issues
+
+- Fixed Super Resolution model verification failed with `FP16`.
+- Fixed confusing messages by suppressing them when compiling the model.
+- Fixed an issue causing the mod operator with `int8` and `int32` inputs.
+- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled.
+- Fixed a bug when running `migraphx-driver` with the `--run 1` option.
+- Fixed Layernorm accuracy: calculations in `FP32`.
+- Fixed update Docker generator script to ROCm 6.1 to point at Jammy.
+- Fixed a floating point exception for `dim (-1)` in the reshape operator.
+- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input.
+- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix.
+- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns.
+- Fixed an issues during compilation caused by the incorrect constructor being used at compile time.
+ Removed list initializer of `prefix_scan_sum` which was causing issues during compilation.
+- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation.
+
+### **MIOpen** (3.2.0)
+
+#### Changes
+
+- Added:
+ - [Conv] bilinear (alpha beta) solvers.
+ - [Conv] enable bf16 for ck-based solvers.
+ - [Conv] Add split_k tuning to 2d wrw ck-based solver.
+ - [MHA] graph API fp8 fwd.
+ - [RNN] multi-stream as default solution.
+- Added TunaNetv2.0 for MI300.
+- Added Adam and AMP Adam optimizer.
+
+#### Resolved issues
+
+- Memory access fault caused by `GemmBwdRest`.
+- Context configuration in `GetWorkSpaceSize`.
+- Fixes to support huge tensors.
+
+#### Optimizations
+
+- Find: improved precision of benchmarking.
+
+### **MIVisionX** (3.0.0)
+
+#### Changes
+
+- Added support for:
+ - Advanced GPUs
+ - PreEmphasis Filter augmentation in openVX extensions
+ - Spectrogram augmentation in openVX extensions
+ - Downmix and ToDecibels augmentations in openVX extensions
+ - Resample augmentation and Operator overloading nodes in openVX extensions
+ - NonSilentRegion and Slice augmentations in openVX extensions
+ - Mel-Filter bank and Normalize augmentations in openVX extensions
+
+#### Removals
+
+- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL).
+
+#### Resolved issues
+
+- Fixed issues with dependencies.
+
+#### Known issues
+
+- MIVisionX package install requires manual prerequisites installation.
+
+### **Omniperf** (2.0.1)
+
+#### Known issues
+
+- Error when running Omniperf with an application with command line arguments. As a workaround, create an
+ intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This
+ issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347).
+
+- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error:
+ "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the
+ environment variable `export ROCPROF=rocprofv2`.
+
+- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message:
+
+ "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.*
+
+ *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.*
+
+ *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.*
+
+ *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*"
+
+ As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`.
+
+### **OpenMP** (17.0.0)
+
+#### Changes
+
+- Added basic experimental support for ``libc`` functions on the GPU via the
+ LLVM C Library for GPUs.
+- Added minimal support for calling host functions from the device using the
+ `libc` interface.
+- Added vendor agnostic OMPT callback support for OpenMP-based device offload.
+
+#### Removals
+
+- Removed the "old" device plugins along with support for the `remote` and
+ `ve` plugins.
+
+#### Resolved issues
+
+- Fixed the implementation of `omp_get_wtime` for AMDGPU targets.
+
+### **RCCL** (2.20.5)
+
+#### Changes
+
+- Added support for `fp8` and `rccl_bfloat8`.
+- Added support for using HIP contiguous memory.
+- Added ROC-TX for host-side profiling.
+- Added new rome model.
+- Added `fp16` and `fp8` cases to unit tests.
+- Added a new unit test for main kernel stack size.
+- Added the new `-n` option for `topo_expl` to override the number of nodes.
+- Improved debug messages of memory allocations.
+- Enabled static build.
+- Enabled compatibility with:
+ - NCCL 2.20.5.
+ - NCCL 2.19.4.
+- Performance tuning for some collective operations on MI300.
+- Enabled NVTX code in RCCL.
+- Replaced `rccl_bfloat16` with hip_bfloat16.
+- NPKit updates:
+ - Removed warm-up iteration removal by default, need to opt in now.
+ - Doubled the size of buffers to accommodate for more channels.
+- Modified rings to be rail-optimized topology friendly.
+
+#### Resolved issues
+
+- Fixed a bug when configuring RCCL for only LL128 protocol.
+- Fixed scratch memory allocation after API change for MSCCL.
+
+### **rocAL** (1.0.0)
+
+#### Changes
+
+- Added tests and samples.
+
+#### Removals
+
+- Removed CuPy from `setup.py`.
+
+
+#### Optimizations
+
+- Added setup and install updates.
+
+#### Resolved issues
+
+- Minor bug fixes.
+
+### **rocALUTION** (3.2.0)
+
+#### Changes
+
+* Added new file I/O based on rocSPARSE I/O format.
+* Added `GetConvergenceHistory` for ItILU0 preconditioner.
+
+#### Removals
+
+* Deprecated the following:
+ * `LocalMatrix::ReadFileCSR`
+ * `LocalMatrix::WriteFileCSR`
+ * `GlobalMatrix::ReadFileCSR`
+ * `GlobalMatrix::WriteFileCSR`
+
+### **rocBLAS** (4.2.0)
+
+#### Changes
+
+* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
+* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`.
+* Added Benchmark class for common timing code.
+* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`.
+* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`.
+
+* Updated Linux AOCL dependency to release 4.2 gcc build.
+* Updated Windows vcpkg dependencies to release 2024.02.14.
+* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40.
+
+#### Optimizations
+
+* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
+
+#### Removals
+
+* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next
+ major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage.
+
+### **ROCdbgapi** (0.75.0)
+
+#### Removals
+- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`.
+ The old names are still accessible but deprecated.
+
+### **rocDecode** (0.6.0)
+
+#### Changes
+
+- Added full H.264 support and bug fixes.
+
+### **rocFFT** (1.0.28)
+
+#### Changes
+
+* Randomly generated accuracy tests are now disabled by default. They can be enabled using
+ the `--nrand` option (which defaults to 0).
+
+#### Optimizations
+
+* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks
+ are transformed locally, with global transposes to make remaining dimensions contiguous.
+
+### **rocm-cmake** (0.13.0)
+
+#### Changes
+
+- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds.
+ - Package names are no longer pulled from `CPACK__PACKAGE_NAME`.
+ - Runtime packages will no longer be generated for static builds.
+
+### **ROCm Data Center Tool** (1.0.0)
+
+#### Changes
+
+- Added ROCProfiler `dmon` metrics.
+- Added new ECC metrics.
+- Added ROCm Validation Suite diagnostic command.
+- Fully migrated to AMD SMI.
+
+#### Removals
+
+- Removed RASLIB dependency and blobs.
+- Removed `rocm_smi_lib` dependency due to migration to AMD SMI.
+
+### **ROCm Debugger (ROCgdb)** (14.2)
+
+#### Changes
+
+- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile.
+- Added support for generating and opening core files for heterogeneous processes.
+
+### **ROCm SMI** (7.3.0)
+
+#### Changes
+
+- Added Partition ID API (`rsmi_dev_partition_id_get(..)`).
+
+#### Resolved issues
+
+- Fixed Partition ID CLI output.
+
+```{note}
+See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md)
+on GitHub for more information.
+```
+
+### **ROCm Validation Suite** (1.0.0)
+
+#### Changes
+
+* Added stress tests:
+
+ * IET (power) stress test for MI300A.
+
+ * IET (power transition) test for MI300X.
+
+* Added support:
+
+ * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output.
+
+ * Trignometric float and random integer matrix data initialization support.
+
+* Updated GST performance benchmark test for better numbers.
+
+### **rocPRIM** (3.2.0)
+
+#### Changes
+
+* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp.
+* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter.
+ * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
+ This is the same behaviour as before this change.
+* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element.
+* Added new public APIs for deriving resulting type on device-only functions:
+ * `rocprim::invoke_result`
+ * `rocprim::invoke_result_t`
+ * `rocprim::invoke_result_binary_op`
+ * `rocprim::invoke_result_binary_op_t`
+* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
+* Added more test cases, to better cover supported data types.
+* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
+ To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
+ of references to arithmetic types from the key.
+* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
+ It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
+ It can be instantiated with:
+ * `rocprim::make_predicate_iterator`
+ * `rocprim::make_mask_iterator`
+* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32.
+* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
+* Updated some tests to work with supported data types.
+
+#### Optimizations
+
+* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
+* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
+
+#### Resolved issues
+
+* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is
+ larger than the logical warp size. The test suite has been updated with such cases.
+* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`.
+* Fixed incorrect output type used in `device_adjacent_difference`.
+* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types.
+* Fixed an issue with `device_adjacent_difference`. It now considers both the
+ input and the output type for selecting the appropriate kernel launch config.
+ Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.
+* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`.
+* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
+
+#### Removals
+
+* Deprecated the internal header `detail/match_result_type.hpp`.
+* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`.
+* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name.
+
+### **ROCProfiler** (2.0.0)
+
+#### Removals
+
+- Removed `pcsampler` sample code due to deprecation from version 2.
+
+### **rocRAND** (3.1.0)
+
+#### Changes
+
+* Added `rocrand_create_generator_host`.
+ * The following generators are supported:
+ * `ROCRAND_RNG_PSEUDO_MRG31K3P`
+ * `ROCRAND_RNG_PSEUDO_MRG32K3A`
+ * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20`
+ * `ROCRAND_RNG_PSEUDO_XORWOW`
+ * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32`
+ * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64`
+ * `ROCRAND_RNG_QUASI_SOBOL32`
+ * `ROCRAND_RNG_QUASI_SOBOL64`
+ * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives).
+ * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread.
+* Added the option to create a host generator to the Python wrapper.
+* Added the option to create a host generator to the Fortran wrapper
+* Added dynamic ordering. This ordering is free to rearrange the produced numbers,
+ which can be specific to devices and distributions. It is implemented for:
+ * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry
+* Added support for using Clang as the host compiler for alternative platform compilation.
+* C++ wrapper:
+ * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`.
+ * Added optional order parameter to the constructor of `mt19937_engine`.
+* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator:
+ * `rocrand_normal2`
+ * `rocrand_normal_double2`
+ * `rocrand_log_normal2`
+ * `rocrand_log_normal_double2`
+* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics.
+* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`.
+* Added offset and skipahead functionality to LFSR113 generator.
+* Added dynamic ordering for architecture `gfx1102`.
+
+* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few
+ things to be aware of:
+ - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph.
+ - After the generator is created, you may call API functions to set its seed, offset, and order.
+ - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph.
+ - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredicable behaviour.
+ - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs.
+ - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp`
+* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features.
+* Building rocRAND should be faster on machines with multiple CPU cores as the library has been
+ split to multiple compilation units.
+* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`.
+* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE`
+* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
+
+#### Removals
+
+* Deprecated the following typedefs. Please use the unified `state_type` alias instead.
+ * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state`
+ * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state`
+ * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state`
+ * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state`
+* Deprecated the following internal headers:
+ * `src/rng/distribution/distributions.hpp`.
+ * `src/rng/device_engines.hpp`.
+* Removed references to and workarounds for deprecated hcc.
+* Removed support for HIP-CPU.
+
+#### Known Issues
+
+- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release.
+
+### **ROCr Runtime** (1.14.0)
+
+#### Changes
+
+- Added PC sampling feature (experimental feature).
+
+### **rocSOLVER** (3.26.0)
+
+#### Changes
+
+- Added 64-bit APIs for existing functions:
+ - GETF2_64 (with `batched` and `strided_batched` versions)
+ - GETRF_64 (with `batched` and `strided_batched` versions)
+ - GETRS_64 (with `batched` and `strided_batched` versions)
+- Added gfx900 to default build targets.
+- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection:
+ - SYEVDX (with `batched` and `strided_batched` versions)
+ - HEEVDX (with `batched` and `strided_batched` versions)
+- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection:
+ - SYGVDX (with `batched` and `strided_batched` versions)
+ - HEGVDX (with `batched` and `strided_batched` versions)
+- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for
+ backwards compatibility.
+- Functions working with arrays of size n - 1 can now accept null pointers when n = 1.
+
+#### Optimizations
+
+- Improved performance of Cholesky factorization.
+- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U.
+
+#### Resolved issues
+
+- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues.
+
+### **rocSPARSE** (3.2.0)
+
+#### Changes
+
+* Added a new Merge-Path algorithm to SpMM, supporting CSR format.
+* Added support for row order to SpSM.
+* Added rocsparseio I/O functionality to the library.
+* Added `rocsparse_set_identity_permutation`.
+
+* Adjusted rocSPARSE dependencies to related HIP packages.
+* Binary size has been reduced.
+* A namespace has been wrapped around internal rocSPARSE functions and kernels.
+* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0.
+* gfx803 target has been removed from address sanitizer builds.
+
+#### Optimizations
+
+* SpMV adaptive and LRB algorithms have been further optimized on CSR format
+* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format
+* Improved documentation and contribution guidelines.
+
+#### Resolved issues
+
+* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`.
+
+### **rocThrust** (3.1.0)
+
+#### Changes
+
+* Added changes from upstream CCCL/thrust 2.2.0.
+ * Updated the contents of `system/hip` and `test` with the upstream changes.
+* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`.
+* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call.
+* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`.
+* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`.
+
+#### Known Issues
+
+* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device.
+* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs.
+
+### **rocWMMA** (1.5.0)
+
+#### Changes
+
+* Added internal utilities for:
+ * Element-wise vector transforms.
+ * Cross-lane vector transforms.
+* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16.
+* Added tests for new internal transforms.
+
+* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`.
+* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments.
+* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary.
+* Refactored entry-point of std library usage to improve hipRTC support.
+* Updated installation, programmer's guide and API reference documentation.
+
+#### Resolved issues
+
+* Fixed the ordering of some header includes to improve portability.
+
+### **RPP** (1.8.0)
+
+#### Changes
+
+- Prerequisites - ROCm install requires only `--usecase=rocm`.
+- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory.
+- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP.
+- RPP Tensor Audio support on HOST for Spectrogram.
+- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version.
+- RPP Tensor Audio support on HOST for Mel Filter Bank.
+- RPP Tensor Normalize ND support on HOST and `HIP`.
+
+### **Tensile** (4.41.0)
+
+#### Changes
+
+- New tuning script to summarize rocBLAS log file
+- New environment variable to test fixed grid size with Stream-K kernels
+- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
+- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
+- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
+- Support NT flag for global load and store for gfx94x
+- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
+- Add option for rotating buffers timing with cache eviction
+- Add predicate for arithmetic intensity
+- Add DirectToVgpr + packing for f8/f16 + TLU cases
+- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
+- Add test cases for DirectToVgpr + packing
+- Add batch support for Stream-K kernels and new test cases
+- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
+- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
+- Support VectorWidthB (new parameter VectorWidthB)
+- Support VectorWidth + non SourceSwap
+- Add test cases for VectorWidthB, VectorWidth + non SourceSwap
+- Add code owners file
+- New environment variables to dynamically adjust number of CUs used in Stream-K
+- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
+- Add xf32 option to rocblas-bench input creator
+
+- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
+- Limit build threads based on CPUs/RAM available on system (for tests)
+- Update required workspace size for Stream-K, skip kernel initialization when possible
+- Use fallback libraries for archs without optimized logic
+- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
+- Remove OCL tests
+- Disable HostLibraryTests
+- Reduce extended test time by removing extra parameters in the test config files
+- Disable InitAccVgprOpt for Stream-K
+- Skip sgemm 64bit offset tests for gfx94x
+- Skip DTV, DTL, LSU+MFMA tests for gfx908
+- Increase extended test timeout to 720 min
+- Update xfail test (1sum tests only failing on gfx90a)
+- Update lib logic convertor script
+- Test limiting CI threads for only gfx11
+- wGM related kernargs are removed if they are not needed (WGM=-1,0,1)
+- Cleanup on unused old code, mostly related to old client
+- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
+- Update efficiency script for new architecture and xf32 datatype
+- Re-enable negative values for WorkGroupMapping (asm kernel only)
+- Disable HW monitor for aquvavanjaram941
+- Pre-apply offsets for strided batch kernels
+- Update tensile build with 16 threads
+
+#### Optimizations
+
+- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py).
+- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
+- Reversed MFMA order in inner loop for odd outer iteration
+- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
+- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
+
+#### Resolved issues
+
+- Fixed `WorkspaceCheck` implementation when used in rocBLAS.
+- Fixed Stream-K partials cache behavior.
+- Fixed `MasterSolutionLibrary` indexing for multiple architecture build.
+- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size).
+- Fixed BufferLoad=False with Stream-K.
+- Fixed mismatch issue with `GlobalReadCoalesceGroup`.
+- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]).
+- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2).
+- Fixed inacurate vgpr allocation for ClusterLocalRead.
+- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2.
+- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0).
+- Fixed HostLibraryTests on gfx942 and gfx941.
+- Fixed LLVM crash issue.
+- Fixed for newer windows vcpkg msgpack and vcpkg version package name.
+- Fixed an error with DisableKernelPieces + 32bit ShadowLimit.
+- Ignore asm cap check for kernel arg preload for rocm6.0 and older.
+
+## ROCm known issues
+
+ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
+issues related to individual components, review the [Detailed component changes](detailed-component-changes).
+
+### Default processor affinity behavior for helper threads
+
+Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
+helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to
+threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if
+the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If
+`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when
+creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads
+by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment,
+batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing
+the setting may be advantageous to performance.
+
+To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the
+`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows:
+
+```{code} shell
+export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
+```
+
+To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable
+as follows:
+
+``` shell
+export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1
+```
+
+Or the default:
+
+``` shell
+
+unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG
+```
+
+If unsure of the default processor affinity settings for your environment, run the following command from the shell:
+
+``` shell
+
+bash -c "echo taskset -p \$\$"
+```
+### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver
+
+AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video
+output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox
+G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions.
+
+The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems,
+this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading
+the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when
+running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods.
+
+### KFDTest failure on Instinct MI300X with Oracle Linux 8.9
+
+The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full
+suite to not execute properly. This issue is suspected to be hardware-related.
+
+### Bandwidth limitation in gang and non-gang modes on Instinct MI300A
+
+Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang
+and non-gang performance are observed to be limited at 45GB/s.
+
+This issue will be addressed in a future ROCm release.
+
+### rocm-llvm-alt
+
+ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for
+users interested in additional closed-source CPU optimizations. This feature is not functional in
+the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an
+LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows
+use of the closed-source compiler. It is recommended to compile using the default open-source
+compiler, which generates high-quality AMD CPU and AMD GPU code.
+
+## ROCm upcoming changes
+
+The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review
+the [Detailed component changes](detailed-component-changes).
+
+### rocm-llvm-alt
+
+The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the
+functionality provided by the closed-source compiler should transition to the open-source compiler.
+Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by
+the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler
+has been removed*".
diff --git a/docs/conf.py b/docs/conf.py
index 929046e75..70b642277 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -25,16 +25,16 @@ latex_elements = {
project = "ROCm Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
-version = "6.1.2"
-release = "6.1.2"
+version = "6.2.0"
+release = "6.2.0"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_author = ""
# pages with specific settings
article_pages = [
- {"file": "about/release-notes", "os": ["linux", "windows"], "date": "2024-06-04"},
- {"file": "about/changelog", "os": ["linux", "windows"], "date": "2024-06-04"},
+ {"file": "about/release-notes", "os": ["linux", "windows"], "date": "2024-08-02"},
+ {"file": "about/changelog", "os": ["linux", "windows"], "date": "2024-08-02"},
{"file": "how-to/deep-learning-rocm", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/install", "os": ["linux"]},
@@ -104,7 +104,7 @@ html_theme = "rocm_docs_theme"
html_theme_options = {"flavor": "rocm-docs-home"}
html_static_path = ["sphinx/static/css"]
-html_css_files = ["rocm_custom.css"]
+html_css_files = ["rocm_custom.css", "rocm_rn.css"]
html_title = "ROCm Documentation"
diff --git a/docs/how-to/system-optimization/mi300a.rst b/docs/how-to/system-optimization/mi300a.rst
index 0c19e3245..8fb65e6f8 100644
--- a/docs/how-to/system-optimization/mi300a.rst
+++ b/docs/how-to/system-optimization/mi300a.rst
@@ -191,6 +191,8 @@ This section describes performance-based settings.
echo 20 > /proc/sys/vm/compaction_proactiveness
echo 1 > /proc/sys/vm/compact_unevictable_allowed
+.. _mi300a-processor-affinity:
+
* **Change affinity of ROCm helper threads**
This change prevents internal ROCm threads from having their CPU core affinity mask
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index aefb9fb2c..f1c76c05a 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -9,12 +9,6 @@ subtrees:
- file: what-is-rocm.rst
- file: about/release-notes.md
title: Release notes
- subtrees:
- - entries:
- - file: about/changelog.md
- title: Changelog
- - url: https://github.com/ROCm/ROCm/labels/Verified%20Issue
- title: Known issues
- caption: Install
entries:
diff --git a/docs/sphinx/static/css/rocm_custom.css b/docs/sphinx/static/css/rocm_custom.css
index 3649fe537..476a90041 100644
--- a/docs/sphinx/static/css/rocm_custom.css
+++ b/docs/sphinx/static/css/rocm_custom.css
@@ -1,6 +1,21 @@
+/* Override PyData Sphinx Theme default colors */
+html[data-theme='light'] {
+ --pst-color-table-row-hover-bg: #E2E8F0;
+}
+
+html[data-theme='dark'] {
+ --pst-color-table-row-hover-bg: #1E293B;
+}
+
+a svg {
+ color: var(--pst-color-text-base);
+}
+
+a svg:hover {
+ color: var(--pst-color-link-hover);
+}
/* Adds container for big tables, used for Compatibility Matrix */
-
.format-big-table {
white-space: nowrap;
- }
+}
diff --git a/docs/sphinx/static/css/rocm_rn.css b/docs/sphinx/static/css/rocm_rn.css
new file mode 100644
index 000000000..dd82ed4b7
--- /dev/null
+++ b/docs/sphinx/static/css/rocm_rn.css
@@ -0,0 +1,126 @@
+#rocm-rn-components col {
+ width: 6rem;
+}
+#rocm-rn-components col:nth-child(2) {
+ width: 12rem;
+}
+#rocm-rn-components td {
+ white-space: nowrap;
+}
+#rocm-rn-components td:last-of-type {
+ text-align: center;
+}
+#rocm-rn-components a svg {
+ color: var(--pst-color-text-base);
+}
+#rocm-rn-components a svg:hover {
+ color: var(--pst-color-link-hover);
+}
+#rocm-rn-components .tbody-reverse-zebra tr:nth-child(2n + 1) td {
+ background-color: var(--pst-color-table-row-zebra-high-bg);
+}
+#rocm-rn-components .tbody-reverse-zebra tr:nth-child(2n) td {
+ background-color: var(--pst-color-table-row-zebra-low-bg);
+}
+
+#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) .rocm-components-libs,
+#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) .rocm-components-libs td,
+#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) tbody.rocm-components-libs th {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools,
+#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools td,
+#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) tbody.rocm-components-tools th {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers,
+#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes,
+#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+
+#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+
+#rocm-rn-components:has(tbody.rocm-components-ml th[rowspan]:nth-of-type(2):hover) .rocm-components-ml td,
+#rocm-rn-components:has(tbody.rocm-components-ml th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-communication th[rowspan]:nth-of-type(2):hover) .rocm-components-communication td,
+#rocm-rn-components:has(tbody.rocm-components-communication th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-math th[rowspan]:nth-of-type(2):hover) .rocm-components-math td,
+#rocm-rn-components:has(tbody.rocm-components-math th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-primitives th[rowspan]:nth-of-type(2):hover) .rocm-components-primitives td,
+#rocm-rn-components:has(tbody.rocm-components-primitives th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-dev th[rowspan]:nth-of-type(2):hover) .rocm-components-dev td,
+#rocm-rn-components:has(tbody.rocm-components-dev th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-perf th[rowspan]:nth-of-type(2):hover) .rocm-components-perf td,
+#rocm-rn-components:has(tbody.rocm-components-perf th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-system th[rowspan]:nth-of-type(2):hover) .rocm-components-system td,
+#rocm-rn-components:has(tbody.rocm-components-system th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+
+#rocm-rn-components:has(tbody.rocm-components-ml td:hover) .rocm-components-ml th,
+#rocm-rn-components:has(tbody.rocm-components-ml td:hover) .rocm-components-libs th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-ml td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-communication td:hover) .rocm-components-communication th,
+#rocm-rn-components:has(tbody.rocm-components-communication td:hover) .rocm-components-libs th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-communication td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-math td:hover) .rocm-components-math th,
+#rocm-rn-components:has(tbody.rocm-components-math td:hover) .rocm-components-libs th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-math td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) .rocm-components-primitives th,
+#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) .rocm-components-libs th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-dev td:hover) .rocm-components-dev th,
+#rocm-rn-components:has(tbody.rocm-components-dev td:hover) .rocm-components-tools th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-dev td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-perf td:hover) .rocm-components-perf th,
+#rocm-rn-components:has(tbody.rocm-components-perf td:hover) .rocm-components-tools th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-perf td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-system td:hover) .rocm-components-system th,
+#rocm-rn-components:has(tbody.rocm-components-system td:hover) .rocm-components-tools th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-system td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-compilers td:hover) .rocm-components-compilers th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-compilers td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
+#rocm-rn-components:has(tbody.rocm-components-runtimes td:hover) .rocm-components-runtimes th:first-of-type,
+#rocm-rn-components:has(tbody.rocm-components-runtimes td:hover) tr:hover > td {
+ background-color: var(--pst-color-table-row-hover-bg);
+}
diff --git a/temp.md b/temp.md
new file mode 100644
index 000000000..8234d4a67
--- /dev/null
+++ b/temp.md
@@ -0,0 +1,58 @@
+## Components
+
+The following table lists ROCm components and their individual versions for ROCm 6.2.0. Find an overview of officially
+supported versions of ROCm components, third-party libraries, and frameworks in the
+[Compatibility matrix](https://rocm.docs.amd.com/en/latest/release/docs/6.2.0/compatibility/compatibility-matrix).
+
+| Category | Group | Name | Version | |
+|----------|-------|------|---------|:-:|
+| **Libraries** | **Machine learning and computer vision** | [Composable Kernel](https://rocm.docs.amd.com/projects/composable_kernel/en/docs/6.2.0) | 1.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.2.0) |
+| | | [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs/6.2.0) | 2.9 ⇒ [2.10](migraphx-2-10-0) | [{fab}`github fa-lg`](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.2.0) |
+| | | [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](miopen-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.2.0) |
+| | | [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/docs/6.2.0) | 2.5.0 ⇒ [3.0.0](mivisionx-3-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.2.0) |
+| | | [rocAL](https://rocm.docs.amd.com/projects/rocAL/en/docs/6.2.0) | 2.0.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocAL/releases/tag/rocm-6.2.0) |
+| | | [rocDecode](https://rocm.docs.amd.com/projects/rocDecode/en/docs/6.2.0) | 0.6.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.2.0) |
+| | | [rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/docs/6.2.0) | 0.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocPyDecode/releases/tag/rocm-6.2.0) |
+| | | [RPP](https://rocm.docs.amd.com/projects/rpp/en/docs/6.2.0) | 1.5.0 ⇒ [1.8.0](rpp-1-8-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rpp/releases/tag/rocm-6.2.0) |
+| | **Communication** | [rccl](https://rocm.docs.amd.com/projects/rccl/en/docs/6.2.0) | 2.18.6 ⇒ [2.20.5](rccl-2-20-5) | [{fab}`github fa-lg`](https://github.com/ROCm/rccl/releases/tag/rocm-6.2.0) |
+| | **Math** | [hipBLAS](https://rocm.docs.amd.com/projects/hipBLAS/en/docs/6.2.0) | 2.1.0 ⇒ [2.2.0](hipblas-2-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.2.0) |
+| | | [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs/6.2.0) | 0.7.0 ⇒ [0.8.0](hipblaslt-0-8-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.2.0) |
+| | | [hipFFT](https://rocm.docs.amd.com/projects/hipFFT/en/docs/6.2.0) | [1.0.14](hipfft-1-0-14) | [{fab}`github fa-lg`](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.2.0) |
+| | | [hipfort](https://rocm.docs.amd.com/projects/hipfort/en/docs/6.2.0) | 0.4-0 | [{fab}`github fa-lg`](https://github.com/ROCm/hipfort/releases/tag/rocm-6.2.0) |
+| | | [hipRAND](https://rocm.docs.amd.com/projects/hipRAND/en/docs/6.2.0) | 2.10.17 ⇒ [2.11.0](hiprand-2-11-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.2.0) |
+| | | [hipSOLVER](https://rocm.docs.amd.com/projects/hipSOLVER/en/docs/6.2.0) | 2.1.1 ⇒ [2.2.0](hipsolver-2-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.2.0) |
+| | | [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/docs/6.2.0) | 3.0.1 ⇒ [3.1.1](hipsparse-3-1-1) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.2.0) |
+| | | [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/docs/6.2.0) | 0.2.0 ⇒ [0.2.1](hipsparselt-0-2-1) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.2.0) |
+| | | [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/docs/6.2.0) | 3.1.1 ⇒ [3.2.0](rocalution-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.2.0) |
+| | | [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/docs/6.2.0) | 4.1.0 ⇒ [4.2.0](rocblas-4-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.2.0) |
+| | | [rocFFT](https://rocm.docs.amd.com/projects/rocFFT/en/docs/6.2.0) | 1.0.27 ⇒ [1.0.28](rocfft-1-0-28) | [{fab}`github fa-lg`](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.2.0) |
+| | | [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/docs/6.2.0) | 3.0.0 ⇒ [3.1.0](rocrand-3-1-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.2.0) |
+| | | [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/docs/6.2.0) | 3.25.0 ⇒ [3.26.0](rocsolver-3-26-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.2.0) |
+| | | [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/docs/6.2.0) | 3.1.1 ⇒ [3.2.0](rocsparse-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.2.0) |
+| | | [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/docs/6.2.0) | 1.4.0 ⇒ [1.5.0](rocwmma-1-5-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.2.0) |
+| | | [Tensile](https://rocm.docs.amd.com/projects/tensile/en/docs/6.2.0) | 4.40.0 ⇒ [4.41.0](tensile-4-41-0) | [{fab}`github fa-lg`](https://github.com/ROCm/tensile/releases/tag/rocm-6.2.0) |
+| | **Primitives** | [hipCUB](https://rocm.docs.amd.com/projects/hipCUB/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](hipcub-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.2.0) |
+| | | [hipTensor](https://rocm.docs.amd.com/projects/hipTensor/en/docs/6.2.0) | 1.2.0 ⇒ [1.3.0](hiptensor-1-3-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.2.0) |
+| | | [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](rocprim-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.2.0) |
+| | | [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/docs/6.2.0) | 3.0.0 ⇒ [3.1.0](rocthrust-3-1-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.2.0) |
+| **Tools** | **Development** | [HIPIFY](https://rocm.docs.amd.com/projects/HIPIFY/docs/6.2.0) | 17.0.0 ⇒ [18.0.0](hipify-18-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/HIPIFY/releases/tag/rocm-6.2.0) |
+| | | [ROCdbgapi](https://rocm.docs.amd.com/projects/ROCdbgapi/en/docs/6.2.0) | 0.71.0 ⇒ [0.76.0](rocdbgapi-0-76-0) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCdbgapi/releases/tag/rocm-6.2.0) |
+| | | [ROCm CMake](https://rocm.docs.amd.com/projects/rocm-cmake/en/docs/6.2.0) | 0.12.0 ⇒ [0.13.0](rocm-cmake-0-13-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.2.0) |
+| | | [ROCm Debugger (ROCgdb)](https://rocm.docs.amd.com/projects/rocm-cmake/en/docs/6.2.0) | 13 ⇒ [15](rocgdb-15) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.2.0) |
+| | | [ROCr Debug Agent](https://rocm.docs.amd.com/projects/rocr_debug_agent/en/docs/6.2.0) | 2.0.3 | [{fab}`github fa-lg`](https://github.com/ROCm/rocr_debug_agent/releases/tag/rocm-6.2.0) |
+| | **Performance** | [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/docs/6.2.0) | 2.0.1 | [{fab}`github fa-lg`](https://github.com/ROCm/omniperf/releases/tag/rocm-6.2.0) |
+| | | [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/docs/6.2.0) | 1.11.2 | [{fab}`github fa-lg`](https://github.com/ROCm/omnitrace/releases/tag/rocm-6.2.0) |
+| | | [ROCm Bandwidth Test](https://rocm.docs.amd.com/projects/rocm_bandwidth_test/en/docs/6.2.0) | 1.4.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) |
+| | | [ROCProfiler](https://rocm.docs.amd.com/projects/ROCProfiler/en/docs/6.2.0) | 2.0.0 ⇒ [2.0.0](rocprofiler-2-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) |
+| | | [ROCProfiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs/6.2.0) | 0.4.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) |
+| | | [ROCTracer](https://rocm.docs.amd.com/projects/ROCTracer/en/docs/6.2.0) | 4.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) |
+| | **System** | [AMD SMI](https://rocm.docs.amd.com/projects/amdsmi/en/docs/6.2.0) | 24.5.2 ⇒ [24.6.1](amd-smi-24-6-1) | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | | [rocminfo](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.0.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | | [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 0.3.0 ⇒ [1.0.0](rocm-data-center-tool-1-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | | [ROCm SMI](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 7.2.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | | [ROCm Validation Suite](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | | [TransferBench](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.5.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) |
+| | **Compilers** | [hipCC](https://rocm.docs.amd.com/projects/hipCC/en/docs/6.2.0) | 1.0.0 ⇒ [1.1.1](hipcc-1-1-1) | [{fab}`github fa-lg`](https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0) |
+| | | [llvm-project](https://rocm.docs.amd.com/projects/llvm-project/en/docs/6.2.0) | 17.0.0 ⇒ [18.0.0](llvm-project-18-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0) |
+| **Runtimes** | | [HIP](https://rocm.docs.amd.com/projects/HIP/en/docs/6.2.0) | 6.1 ⇒ [6.2](hip-6-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/HIP/releases/tag/rocm-6.2.0) |
+| | | [ROCr Runtime](https://rocm.docs.amd.com/projects/ROCr-Runtime/en/docs/6.2.0) | 6.1 ⇒ [6.2](hip-6-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCR-Runtime/releases/tag/rocm-6.2.0) |
diff --git a/tools/autotag/templates/extra_components/6.2.0.md b/tools/autotag/templates/extra_components/6.2.0.md
new file mode 100644
index 000000000..c5d1467e9
--- /dev/null
+++ b/tools/autotag/templates/extra_components/6.2.0.md
@@ -0,0 +1,1307 @@
+
+## Detailed component changes
+
+The following sections briefly describe key changes to ROCm components.
+
+### **AMD SMI** (24.6.2)
+
+#### Changes
+
+- Added the following functionality:
+ - `amd-smi dmon` is now available as an alias to `amd-smi monitor`.
+ - An optional process table under `amd-smi monitor -q`.
+ - Handling to detect VMs with passthrough configurations in CLI tool.
+ - Process Isolation and Clear SRAM functionality to the CLI tool for VMs.
+ - Added Ring Hang event.
+- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`.
+- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`.
+
+#### Removals
+
+- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported.
+- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`.
+
+#### Optimizations
+
+- Updated CLI error strings to specify invalid device type queried.
+- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest.
+ - Added `amdsmi.h` and `amdsmi_interface.py`.
+ - `amdsmi_clk_info_t` struct
+ - Added `AMDSMI` prefix to multiple structures.
+- Updated `dpm_policy` references to `soc_pstate`.
+- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file.
+- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`.
+- Updated CLI voltage curve command output.
+
+#### Resolved issues
+
+- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format.
+- `amdsmi_get_gpu_board_info()` no longer returns junk character strings.
+- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`.
+- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running
+ `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any
+ other subcommand that access the GPUs) a mutex would lock and not return requiring either a
+ clear of the mutex in `/dev/shm` or rebooting the machine.
+
+#### Known issues
+
+- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work.
+ They will be supported in a future release.
+
+```{note}
+See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md)
+on GitHub for more information.
+```
+
+### **Composable Kernel** (1.1.0)
+
+#### Changes
+
+- Added suppport for:
+ - Permute scale for any dimension (#1198).
+ - Combined elementwise op (#1217).
+ - Multi D in grouped convolution backward weight (#1280).
+ - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280).
+ - Large batch in grouped convolution forward (#1332).
+- Added `CK_TILE` layernorm example (#1339).
+- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend.
+
+#### Optimizations
+
+- Support universal GEMM in grouped convolution forward (#1320).
+- Optimizations for low M and N in grouped convolution backward weight (#1303).
+- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel.
+- `FP8` GEMM performance optimization and tuning (#1384).
+- Added FlashAttention backward pass performance optimization (#1397).
+
+### **HIP** (6.2.0)
+
+#### Changes
+
+- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take
+ a 64-bit integer as an explicit mask argument.
+ - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled
+ unconditionally in a future HIP release.
+
+- Added new HIP APIs:
+ - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol.
+ - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`.
+ - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph.
+ - `hipGraphInstantiateWithParams` creates an executable graph from a graph.
+
+- Added a new flag integrated support in device property
+
+ - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver
+ detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for
+ `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device.
+
+- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `.
+
+- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`.
+
+#### Resolved issues
+
+- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime.
+- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`.
+
+#### Upcoming changes
+
+- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be unconditionally in a future HIP release.
+
+### **hipBLAS** (2.2.0)
+
+#### Changes
+
+* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
+* Added a new ILP64 API for level 1 `_ex` functions.
+
+* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts.
+* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages
+ `rocblas-dev` and `rocsolver-dev`.
+* Updated Linux AOCL dependency to release 4.2 `gcc` build.
+* Updated Windows `vcpkg` dependencies to release 2024.02.14.
+
+### **hipBLASLt** (0.8.0)
+
+#### Changes
+
+* Added extension APIs:
+ *`hipblasltExtAMaxWithScale`.
+ * `GemmTuning` extension parameter to set `wgm` by user.
+* Added support for:
+ * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype.
+ * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only).
+ * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA.
+* Added support for gfx110x.
+
+#### Optimizations
+
+* Improved library loading time.
+
+### **HIPCC** (1.1.1)
+
+#### Changes
+
+* Split `hipcc` package into 2 packages for different hardware platforms.
+
+* Cleaned up references to environment variables.
+
+* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by
+ default, instead of their Perl counterparts.
+
+* Enabled function calls.
+
+* Added support for generating packages for ROCm stack targeting static libraries.
+
+#### Resolved issues
+
+* Implemented numerous bug fixes and quality improvements.
+
+### **hipCUB** (3.2.0)
+
+#### Changes
+
+* Added `DeviceCopy` function for parity with CUB.
+* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as
+ the new optional template argument for `WarpExchange`.
+ * The potential values for the enum are `WARP_EXCHANGE_SMEM` and
+ `WARP_EXCHANGE_SHUFFLE`.
+ * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while
+ `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations.
+ * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory,
+ but the `ItemsPerThread` must be a divisor of `WarpSize`.
+* Added `tuple.hpp` which defines templates `hipcub::tuple`,
+ `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`.
+* Added new overloaded member functions to `BlockRadixSort` and
+ `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type
+ (`key_type`) can be sorted via these overloads, if an appropriate decomposer
+ is passed. The decomposer has to implement `operator(const key_type&)` which
+ returns a `hipcub::tuple` of references pointing to members of `key_type`.
+
+* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of
+ HIP graphs, with several exceptions:
+ * `CachingDeviceAllocator`
+ * `GridBarrierLifetime`
+ * `DeviceSegmentedRadixSort`
+ * `DeviceRunLengthEncode`
+ Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs.
+
+#### Removals
+
+* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous`
+ to the device functions results in a deprecation warning both at runtime and at compile time.
+ * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time
+ by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition.
+ * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition.
+
+#### Resolved issues
+
+* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
+ It now derives the accumulator type as the result of the binary operator.
+
+### **hipFFT** (1.0.15)
+
+#### Resolved issues
+
+* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers.
+* Prevented C++ exceptions leaking from public API functions.
+* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs.
+
+
+### **HIPIFY** (18.0.0)
+
+#### Changes
+
+- Added support for:
+ - NVIDIA CUDA 12.4.1
+ - cuDNN 9.1.1
+ - LLVM 18.1.6
+- Added full hipBLASLt support.
+
+#### Resolved issues
+
+- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object;
+ affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`.
+
+### **hipRAND** (2.11.0)
+
+#### Changes
+
+* Added support for setting generator output ordering in C and C++ API.
+* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with
+ `uHIPRAND_STATUS_NOT_IMPLEMENTED`.
+* Added options to create:
+ * A host generator to the Fortran wrapper.
+ * A host generator to the Python wrapper.
+* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous
+ versions. The environment variable is now named `HIPRAND_USE_HMM`.
+* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
+* Device API documentation is improved in this version.
+
+#### Removals
+
+* Removed the option to build hipRAND as a submodule to rocRAND.
+* Removed references to, and workarounds for, the deprecated `hcc`.
+* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`.
+ Use `ROCRAND_PATH` instead.
+
+#### Resolved issues
+
+* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`.
+
+### **hipSOLVER** (2.2.0)
+
+#### Changes
+
+- Added compatibility-only functions:
+ - `auxiliary`
+ - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions`
+ - `getrf`
+ - `hipsolverDnXgetrf_bufferSize`
+ - `hipsolverDnXgetrf`
+ - `getrs`
+ - `hipsolverDnXgetrs`
+- Added support for building on Ubuntu 24.04 and CBL-Mariner.
+- Added `hip::host` to `roc::hipsolver` usage requirements.
+- Added functions
+ - `syevdx`/`heevdx`
+ - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize`
+ - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx`
+ - `sygvdx`/`hegvdx`
+ - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize`
+ - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx`
+
+- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU.
+- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`.
+
+#### Removals
+
+- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients.
+
+### **hipSPARSE** (3.1.1)
+
+#### Changes
+
+* Added the missing `hipsparseCscGet()` routine.
+
+* All internal hipSPARSE functions now exist inside a namespace.
+* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend.
+* Improved the user manual and contribution guidelines.
+
+#### Resolved issues
+
+* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior.
+
+#### Known Issues
+
+* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API
+ and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be
+ ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`.
+
+### **hipSPARSELt** (0.2.1)
+
+#### Optimizations
+
+* Refined test cases.
+
+### **hipTensor** (1.3.0)
+
+#### Changes
+
+* Added support for:
+ * Tensor permutation of ranks of 2, 3, 4, 5, and 6
+ * Tensor contraction of M6N6K6: M, N, K up to rank 6
+* Added tests for:
+ * Tensor permutation of ranks of 2, 3, 4, 5, and 6
+ * Tensor contraction of M6N6K6: M, N, K up to rank 6
+ * YAML parsing to support sequential parameters ordering.
+* Prefer `amd-llvm-devel` package before system LLVM library.
+* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`.
+* Updated actor-critic selection for new contraction kernel additions.
+* Updated installation, programmer's guide, and API reference documentation.
+
+#### Resolved issues
+
+* Fixed LLVM parsing crash.
+* Fixed memory consumption issue in complex kernels.
+* Workaround implemented for compiler crash during debug build.
+* Allow random modes ordering for tensor contractions.
+
+### **llvm-project** (18.0.0)
+
+#### Changes
+
+* LLVM IR
+
+ * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address
+ spaces.
+
+ * Added `llvm.exp10` intrinsic.
+
+* LLVM infrastruture
+
+ * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`.
+
+* TableGen
+
+ * Added constructs for debugging TableGen files:
+
+ * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793).
+
+ * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716).
+
+* AArch64 backend
+
+ * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs.
+
+* AMDGPU backend
+
+ * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access.
+
+ * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics.
+
+ * Implemented `llvm.get.rounding`.
+
+* ARM backend
+
+ * Added support for Cortex-M52 CPUs.
+
+ * Added execute-only support for Armv6-M.
+
+* RISC-V backend
+
+ * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental.
+
+ * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental.
+
+ * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See
+ [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification.
+
+ * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64.
+
+ * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`,
+ `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However,
+ the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions,
+ `-menable-experimental-extensions` needs to be passed to Clang.
+
+ * `-mcpu=sifive-p450` was added.
+
+ * CodeGen of `RV32E` and `RV64E` is supported experimentally.
+
+ * CodeGen of `ilp32e` and `lp64e` is supported experimentally.
+
+* X86 backend
+
+ * Added support for the RDMSRLIST and WRMSRLIST instructions.
+
+ * Added support for the WRMSRNS instruction.
+
+ * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction.
+
+ * Support ISA of CMPCCXADD.
+
+ * Support ISA of AVX-IFMA.
+
+ * Support ISA of AVX-VNNI-INT8.
+
+ * Support ISA of AVX-NE-CONVERT.
+
+ * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported.
+
+ * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported.
+
+ * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now
+ respects this builtin and does not convert CMOVs to branches.
+
+ * Add support for the PBNDKB instruction.
+
+ * Support ISA of SHA512.
+
+ * Support ISA of SM3.
+
+ * Support ISA of SM4.
+
+ * Support ISA of AVX-VNNI-INT16.
+
+ * `-mcpu=graniterapids-d` is now supported.
+
+ * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust
+ which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type
+ matched and called into `libgcc` helper functions.
+
+ * Support ISA of USER_MSR.
+
+ * Support ISA of AVX10.1-256 and AVX10.1-512.
+
+ * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported.
+
+ * `-mapxf` is supported.
+
+ * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow
+ 32-bit relocations or require 64-bit relocations to the global variable.
+
+ * The medium code model’s code generation was audited to be more similar to the small code model where possible.
+
+* C API
+
+ * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions.
+
+ * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an
+ alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target
+ ABI) and provides default values for unspecified settings.
+
+ * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and
+ `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions.
+
+ * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions
+ that use operand bundles:
+
+ * `LLVMBuildCallWithOperandBundles`
+
+ * `LLVMBuildInvokeWithOperandBundles`
+
+ * `LLVMCreateOperandBundle`
+
+ * `LLVMDisposeOperandBundle`
+
+ * `LLVMGetNumOperandBundles`
+
+ * `LLVMGetOperandBundleAtIndex`
+
+ * `LLVMGetNumOperandBundleArgs`
+
+ * `LLVMGetOperandBundleArgAtIndex`
+
+ * `LLVMGetOperandBundleTag`
+
+ * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an
+ instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag.
+
+* CodeGen infrastructure
+
+ * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e.
+ `-debug-only=isel-dump`). This new debug type can be filtered by function names using
+ `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the
+ existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single
+ function regardless of `-filter-print-funcs`’s values.
+
+* Metadata info
+
+ * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`.
+
+* LLVM tools
+
+ * `llvm-symbolizer` now treats invalid input as an address for which source information is not found.
+
+ * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing
+ symbols.
+
+ * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each
+ object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now
+ `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each
+ object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers
+ accordingly.
+
+ * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`.
+
+ * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers.
+
+ * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names.
+
+ * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only.
+
+* LLDB
+
+ * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type
+ when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`.
+
+ * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying
+ functionality remains unchanged.
+
+ * The `mte_ctrl` register can now be read from AArch64 Linux core files.
+
+ * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2
+ (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation.
+
+ * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism
+ of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users
+ can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`,
+ override or adding to the environment variable.
+
+ * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers:
+ `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`.
+
+* Sanitizers
+
+ * HWASan now defaults to detecting use-after-scope bugs.
+
+#### Removals
+
+* LLVM IR
+
+ * The constant expression variants of the following instructions have been removed:
+
+ * `and`
+
+ * `or`
+
+ * `lshr`
+
+ * `ashr`
+
+ * `zext`
+
+ * `sext`
+
+ * `fptrunc`
+
+ * `fpext`
+
+ * `fptoui`
+
+ * `fptosi`
+
+ * `uitofp`
+
+ * `sitofp`
+
+* RISC-V backend
+
+ * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and
+ instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with
+ SiFive to define and document real extension names for individual CSRs and instructions.
+
+* Python bindings
+
+ * The Python bindings have been removed.
+
+* C API
+
+ * The following functions for creating constant expressions have been removed, because the underlying constant
+ expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which
+ will constant fold the operands if possible and create an instruction otherwise:
+
+ * `LLVMConstAnd`
+
+ * `LLVMConstOr`
+
+ * `LLVMConstLShr`
+
+ * `LLVMConstAShr`
+
+ * `LLVMConstZExt`
+
+ * `LLVMConstSExt`
+
+ * `LLVMConstZExtOrBitCast`
+
+ * `LLVMConstSExtOrBitCast`
+
+ * `LLVMConstIntCast`
+
+ * `LLVMConstFPTrunc`
+
+ * `LLVMConstFPExt`
+
+ * `LLVMConstFPToUI`
+
+ * `LLVMConstFPToSI`
+
+ * `LLVMConstUIToFP`
+
+ * `LLVMConstSIToFP`
+
+ * `LLVMConstFPCast`
+
+* CodeGen infrastructure
+
+ * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets
+ should use backwards frame index elimination instead.
+
+ * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging
+ instead, which is preferred because it does not depend on accurate kill flags.
+
+* LLDB
+
+ * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable.
+
+ * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value
+ indicating failure.
+
+#### Resolved issues
+
+* AArch64 backend
+
+ * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto
+ options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for
+ this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the
+ exception of crypto.
+
+* Windows target
+
+ * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path
+ names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug
+ used to cause distinct files to be considered equivalent on some file systems. This change fixed the issues
+ [https://github.com/llvm/llvm-project/issues/61401]() and [https://github.com/llvm/llvm-project/issues/22079]().
+
+#### Known issues
+
+The compiler may incorrectly compile a program that uses the
+``__shfl(var, srcLane, width)`` function when one of the parameters to
+the function is undefined along some path to the function. For most functions,
+uninitialized inputs cause undefined behavior.
+
+```{note}
+The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path.
+```
+
+As a workaround, initialize the parameters to ``__shfl``. For example:
+
+```{code-block} cpp
+unsigned long istring = 0 // Initialize the input to __shfl
+return __shfl(istring, 0, 64)
+```
+
+### **MIGraphX** (2.10.0)
+
+#### Changes
+
+- Added support for ONNX Runtime MIGraphX EP on Windows.
+- Added `FP8` Python API.
+- Added examples for SD 2.1 and SDXL.
+- Added support for BERT to Dynamic Batch.
+- Added a `--test` flag in `migraphx-driver` to validate the installation.
+- Added support for ONNX Operator: Einsum.
+- Added `uint8` support in ONNX Operators.
+- Added Split-k kernel configurations for performance improvements.
+- Added fusion for group convolutions.
+- Added rocMLIR conv3d support.
+- Added rocgdb to the Dockerfile.
+- Changed default location of libraries with release specific ABI changes.
+- Reorganized documentation in GitHub.
+
+#### Removals
+
+- Removed the `--model` flag with `migraphx-driver`.
+
+#### Optimizations
+
+- Improved ONNX Model Zoo coverage.
+- Reorganized `memcpys` with ONNX Runtime to improve performance.
+- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast.
+- Improved MLIR kernel selection for multibroadcasted GEMMs.
+- Improved details of the perf report.
+- Enable mlir by default for GEMMs with small K.
+- Allow specifying dot or convolution fusion for mlir with environmental flag.
+- Improve performance on small reductions by doing multiple reduction per wavefront.
+- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants.
+- Use MLIR attention kernels in more cases.
+- Enables MIOpen and CK fusions for MI300 gfx arches.
+- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs.
+- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases.
+- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout.
+- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases.
+- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained.
+- Simplifies "slice + concat" pattern in SDXL UNet.
+- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros.
+- Improved inference performance by fusing Reduce to Broadcast.
+- Added additional information when printing the perf report.
+- Improve scalar fusions when not all strides are 0.
+- Added support for multi outputs in pointwise ops.
+- Improve reduction fusion with reshape operators.
+- Use the quantized output when an operator is used again.
+- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware.
+
+#### Resolved issues
+
+- Fixed Super Resolution model verification failed with `FP16`.
+- Fixed confusing messages by suppressing them when compiling the model.
+- Fixed an issue causing the mod operator with `int8` and `int32` inputs.
+- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled.
+- Fixed a bug when running `migraphx-driver` with the `--run 1` option.
+- Fixed Layernorm accuracy: calculations in `FP32`.
+- Fixed update Docker generator script to ROCm 6.1 to point at Jammy.
+- Fixed a floating point exception for `dim (-1)` in the reshape operator.
+- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input.
+- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix.
+- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns.
+- Fixed an issues during compilation caused by the incorrect constructor being used at compile time.
+ Removed list initializer of `prefix_scan_sum` which was causing issues during compilation.
+- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation.
+
+### **MIOpen** (3.2.0)
+
+#### Changes
+
+- Added:
+ - [Conv] bilinear (alpha beta) solvers.
+ - [Conv] enable bf16 for ck-based solvers.
+ - [Conv] Add split_k tuning to 2d wrw ck-based solver.
+ - [MHA] graph API fp8 fwd.
+ - [RNN] multi-stream as default solution.
+- Added TunaNetv2.0 for MI300.
+- Added Adam and AMP Adam optimizer.
+
+#### Resolved issues
+
+- Memory access fault caused by `GemmBwdRest`.
+- Context configuration in `GetWorkSpaceSize`.
+- Fixes to support huge tensors.
+
+#### Optimizations
+
+- Find: improved precision of benchmarking.
+
+### **MIVisionX** (3.0.0)
+
+#### Changes
+
+- Added support for:
+ - Advanced GPUs
+ - PreEmphasis Filter augmentation in openVX extensions
+ - Spectrogram augmentation in openVX extensions
+ - Downmix and ToDecibels augmentations in openVX extensions
+ - Resample augmentation and Operator overloading nodes in openVX extensions
+ - NonSilentRegion and Slice augmentations in openVX extensions
+ - Mel-Filter bank and Normalize augmentations in openVX extensions
+
+#### Removals
+
+- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL).
+
+#### Resolved issues
+
+- Fixed issues with dependencies.
+
+#### Known issues
+
+- MIVisionX package install requires manual prerequisites installation.
+
+### **Omniperf** (2.0.1)
+
+#### Known issues
+
+- Error when running Omniperf with an application with command line arguments. As a workaround, create an
+ intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This
+ issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347).
+
+- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error:
+ "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the
+ environment variable `export ROCPROF=rocprofv2`.
+
+- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message:
+
+ "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.*
+
+ *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.*
+
+ *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.*
+
+ *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*"
+
+ As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`.
+
+### **OpenMP** (17.0.0)
+
+#### Changes
+
+- Added basic experimental support for ``libc`` functions on the GPU via the
+ LLVM C Library for GPUs.
+- Added minimal support for calling host functions from the device using the
+ `libc` interface.
+- Added vendor agnostic OMPT callback support for OpenMP-based device offload.
+
+#### Removals
+
+- Removed the "old" device plugins along with support for the `remote` and
+ `ve` plugins.
+
+#### Resolved issues
+
+- Fixed the implementation of `omp_get_wtime` for AMDGPU targets.
+
+### **RCCL** (2.20.5)
+
+#### Changes
+
+- Added support for `fp8` and `rccl_bfloat8`.
+- Added support for using HIP contiguous memory.
+- Added ROC-TX for host-side profiling.
+- Added new rome model.
+- Added `fp16` and `fp8` cases to unit tests.
+- Added a new unit test for main kernel stack size.
+- Added the new `-n` option for `topo_expl` to override the number of nodes.
+- Improved debug messages of memory allocations.
+- Enabled static build.
+- Enabled compatibility with:
+ - NCCL 2.20.5.
+ - NCCL 2.19.4.
+- Performance tuning for some collective operations on MI300.
+- Enabled NVTX code in RCCL.
+- Replaced `rccl_bfloat16` with hip_bfloat16.
+- NPKit updates:
+ - Removed warm-up iteration removal by default, need to opt in now.
+ - Doubled the size of buffers to accommodate for more channels.
+- Modified rings to be rail-optimized topology friendly.
+
+#### Resolved issues
+
+- Fixed a bug when configuring RCCL for only LL128 protocol.
+- Fixed scratch memory allocation after API change for MSCCL.
+
+### **rocAL** (1.0.0)
+
+#### Changes
+
+- Added tests and samples.
+
+#### Removals
+
+- Removed CuPy from `setup.py`.
+
+
+#### Optimizations
+
+- Added setup and install updates.
+
+#### Resolved issues
+
+- Minor bug fixes.
+
+### **rocALUTION** (3.2.0)
+
+#### Changes
+
+* Added new file I/O based on rocSPARSE I/O format.
+* Added `GetConvergenceHistory` for ItILU0 preconditioner.
+
+#### Removals
+
+* Deprecated the following:
+ * `LocalMatrix::ReadFileCSR`
+ * `LocalMatrix::WriteFileCSR`
+ * `GlobalMatrix::ReadFileCSR`
+ * `GlobalMatrix::WriteFileCSR`
+
+### **rocBLAS** (4.2.0)
+
+#### Changes
+
+* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
+* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`.
+* Added Benchmark class for common timing code.
+* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`.
+* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`.
+
+* Updated Linux AOCL dependency to release 4.2 gcc build.
+* Updated Windows vcpkg dependencies to release 2024.02.14.
+* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40.
+
+#### Optimizations
+
+* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
+
+#### Removals
+
+* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next
+ major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage.
+
+### **ROCdbgapi** (0.75.0)
+
+#### Removals
+- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`.
+ The old names are still accessible but deprecated.
+
+### **rocDecode** (0.6.0)
+
+#### Changes
+
+- Added full H.264 support and bug fixes.
+
+### **rocFFT** (1.0.28)
+
+#### Changes
+
+* Randomly generated accuracy tests are now disabled by default. They can be enabled using
+ the `--nrand` option (which defaults to 0).
+
+#### Optimizations
+
+* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks
+ are transformed locally, with global transposes to make remaining dimensions contiguous.
+
+### **rocm-cmake** (0.13.0)
+
+#### Changes
+
+- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds.
+ - Package names are no longer pulled from `CPACK__PACKAGE_NAME`.
+ - Runtime packages will no longer be generated for static builds.
+
+### **ROCm Data Center Tool** (1.0.0)
+
+#### Changes
+
+- Added ROCProfiler `dmon` metrics.
+- Added new ECC metrics.
+- Added ROCm Validation Suite diagnostic command.
+- Fully migrated to AMD SMI.
+
+#### Removals
+
+- Removed RASLIB dependency and blobs.
+- Removed `rocm_smi_lib` dependency due to migration to AMD SMI.
+
+### **ROCm Debugger (ROCgdb)** (14.2)
+
+#### Changes
+
+- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile.
+- Added support for generating and opening core files for heterogeneous processes.
+
+### **ROCm SMI** (7.3.0)
+
+#### Changes
+
+- Added Partition ID API (`rsmi_dev_partition_id_get(..)`).
+
+#### Resolved issues
+
+- Fixed Partition ID CLI output.
+
+```{note}
+See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md)
+on GitHub for more information.
+```
+
+### **ROCm Validation Suite** (1.0.0)
+
+#### Changes
+
+* Added stress tests:
+
+ * IET (power) stress test for MI300A.
+
+ * IET (power transition) test for MI300X.
+
+* Added support:
+
+ * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output.
+
+ * Trignometric float and random integer matrix data initialization support.
+
+* Updated GST performance benchmark test for better numbers.
+
+### **rocPRIM** (3.2.0)
+
+#### Changes
+
+* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp.
+* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter.
+ * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
+ This is the same behaviour as before this change.
+* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element.
+* Added new public APIs for deriving resulting type on device-only functions:
+ * `rocprim::invoke_result`
+ * `rocprim::invoke_result_t`
+ * `rocprim::invoke_result_binary_op`
+ * `rocprim::invoke_result_binary_op_t`
+* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
+* Added more test cases, to better cover supported data types.
+* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
+ To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
+ of references to arithmetic types from the key.
+* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
+ It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
+ It can be instantiated with:
+ * `rocprim::make_predicate_iterator`
+ * `rocprim::make_mask_iterator`
+* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32.
+* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
+* Updated some tests to work with supported data types.
+
+#### Optimizations
+
+* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
+* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
+
+#### Resolved issues
+
+* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is
+ larger than the logical warp size. The test suite has been updated with such cases.
+* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`.
+* Fixed incorrect output type used in `device_adjacent_difference`.
+* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types.
+* Fixed an issue with `device_adjacent_difference`. It now considers both the
+ input and the output type for selecting the appropriate kernel launch config.
+ Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.
+* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`.
+* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
+
+#### Removals
+
+* Deprecated the internal header `detail/match_result_type.hpp`.
+* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`.
+* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name.
+
+### **ROCProfiler** (2.0.0)
+
+#### Removals
+
+- Removed `pcsampler` sample code due to deprecation from version 2.
+
+### **rocRAND** (3.1.0)
+
+#### Changes
+
+* Added `rocrand_create_generator_host`.
+ * The following generators are supported:
+ * `ROCRAND_RNG_PSEUDO_MRG31K3P`
+ * `ROCRAND_RNG_PSEUDO_MRG32K3A`
+ * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20`
+ * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20`
+ * `ROCRAND_RNG_PSEUDO_XORWOW`
+ * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32`
+ * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64`
+ * `ROCRAND_RNG_QUASI_SOBOL32`
+ * `ROCRAND_RNG_QUASI_SOBOL64`
+ * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives).
+ * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread.
+* Added the option to create a host generator to the Python wrapper.
+* Added the option to create a host generator to the Fortran wrapper
+* Added dynamic ordering. This ordering is free to rearrange the produced numbers,
+ which can be specific to devices and distributions. It is implemented for:
+ * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry
+* Added support for using Clang as the host compiler for alternative platform compilation.
+* C++ wrapper:
+ * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`.
+ * Added optional order parameter to the constructor of `mt19937_engine`.
+* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator:
+ * `rocrand_normal2`
+ * `rocrand_normal_double2`
+ * `rocrand_log_normal2`
+ * `rocrand_log_normal_double2`
+* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics.
+* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`.
+* Added offset and skipahead functionality to LFSR113 generator.
+* Added dynamic ordering for architecture `gfx1102`.
+
+* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few
+ things to be aware of:
+ - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph.
+ - After the generator is created, you may call API functions to set its seed, offset, and order.
+ - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph.
+ - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredicable behaviour.
+ - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs.
+ - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp`
+* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features.
+* Building rocRAND should be faster on machines with multiple CPU cores as the library has been
+ split to multiple compilation units.
+* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`.
+* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE`
+* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
+
+#### Removals
+
+* Deprecated the following typedefs. Please use the unified `state_type` alias instead.
+ * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state`
+ * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state`
+ * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state`
+ * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state`
+* Deprecated the following internal headers:
+ * `src/rng/distribution/distributions.hpp`.
+ * `src/rng/device_engines.hpp`.
+* Removed references to and workarounds for deprecated hcc.
+* Removed support for HIP-CPU.
+
+#### Known Issues
+
+- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release.
+
+### **ROCr Runtime** (1.14.0)
+
+#### Changes
+
+- Added PC sampling feature (experimental feature).
+
+### **rocSOLVER** (3.26.0)
+
+#### Changes
+
+- Added 64-bit APIs for existing functions:
+ - GETF2_64 (with `batched` and `strided_batched` versions)
+ - GETRF_64 (with `batched` and `strided_batched` versions)
+ - GETRS_64 (with `batched` and `strided_batched` versions)
+- Added gfx900 to default build targets.
+- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection:
+ - SYEVDX (with `batched` and `strided_batched` versions)
+ - HEEVDX (with `batched` and `strided_batched` versions)
+- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection:
+ - SYGVDX (with `batched` and `strided_batched` versions)
+ - HEGVDX (with `batched` and `strided_batched` versions)
+- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for
+ backwards compatibility.
+- Functions working with arrays of size n - 1 can now accept null pointers when n = 1.
+
+#### Optimizations
+
+- Improved performance of Cholesky factorization.
+- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U.
+
+#### Resolved issues
+
+- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues.
+
+### **rocSPARSE** (3.2.0)
+
+#### Changes
+
+* Added a new Merge-Path algorithm to SpMM, supporting CSR format.
+* Added support for row order to SpSM.
+* Added rocsparseio I/O functionality to the library.
+* Added `rocsparse_set_identity_permutation`.
+
+* Adjusted rocSPARSE dependencies to related HIP packages.
+* Binary size has been reduced.
+* A namespace has been wrapped around internal rocSPARSE functions and kernels.
+* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0.
+* gfx803 target has been removed from address sanitizer builds.
+
+#### Optimizations
+
+* SpMV adaptive and LRB algorithms have been further optimized on CSR format
+* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format
+* Improved documentation and contribution guidelines.
+
+#### Resolved issues
+
+* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`.
+
+### **rocThrust** (3.1.0)
+
+#### Changes
+
+* Added changes from upstream CCCL/thrust 2.2.0.
+ * Updated the contents of `system/hip` and `test` with the upstream changes.
+* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`.
+* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call.
+* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`.
+* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`.
+
+#### Known Issues
+
+* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device.
+* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs.
+
+### **rocWMMA** (1.5.0)
+
+#### Changes
+
+* Added internal utilities for:
+ * Element-wise vector transforms.
+ * Cross-lane vector transforms.
+* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16.
+* Added tests for new internal transforms.
+
+* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`.
+* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments.
+* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary.
+* Refactored entry-point of std library usage to improve hipRTC support.
+* Updated installation, programmer's guide and API reference documentation.
+
+#### Resolved issues
+
+* Fixed the ordering of some header includes to improve portability.
+
+### **RPP** (1.8.0)
+
+#### Changes
+
+- Prerequisites - ROCm install requires only `--usecase=rocm`.
+- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory.
+- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP.
+- RPP Tensor Audio support on HOST for Spectrogram.
+- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version.
+- RPP Tensor Audio support on HOST for Mel Filter Bank.
+- RPP Tensor Normalize ND support on HOST and `HIP`.
+
+### **Tensile** (4.41.0)
+
+#### Changes
+
+- New tuning script to summarize rocBLAS log file
+- New environment variable to test fixed grid size with Stream-K kernels
+- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
+- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
+- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
+- Support NT flag for global load and store for gfx94x
+- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
+- Add option for rotating buffers timing with cache eviction
+- Add predicate for arithmetic intensity
+- Add DirectToVgpr + packing for f8/f16 + TLU cases
+- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
+- Add test cases for DirectToVgpr + packing
+- Add batch support for Stream-K kernels and new test cases
+- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
+- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
+- Support VectorWidthB (new parameter VectorWidthB)
+- Support VectorWidth + non SourceSwap
+- Add test cases for VectorWidthB, VectorWidth + non SourceSwap
+- Add code owners file
+- New environment variables to dynamically adjust number of CUs used in Stream-K
+- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
+- Add xf32 option to rocblas-bench input creator
+
+- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
+- Limit build threads based on CPUs/RAM available on system (for tests)
+- Update required workspace size for Stream-K, skip kernel initialization when possible
+- Use fallback libraries for archs without optimized logic
+- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
+- Remove OCL tests
+- Disable HostLibraryTests
+- Reduce extended test time by removing extra parameters in the test config files
+- Disable InitAccVgprOpt for Stream-K
+- Skip sgemm 64bit offset tests for gfx94x
+- Skip DTV, DTL, LSU+MFMA tests for gfx908
+- Increase extended test timeout to 720 min
+- Update xfail test (1sum tests only failing on gfx90a)
+- Update lib logic convertor script
+- Test limiting CI threads for only gfx11
+- wGM related kernargs are removed if they are not needed (WGM=-1,0,1)
+- Cleanup on unused old code, mostly related to old client
+- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
+- Update efficiency script for new architecture and xf32 datatype
+- Re-enable negative values for WorkGroupMapping (asm kernel only)
+- Disable HW monitor for aquvavanjaram941
+- Pre-apply offsets for strided batch kernels
+- Update tensile build with 16 threads
+
+#### Optimizations
+
+- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py).
+- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
+- Reversed MFMA order in inner loop for odd outer iteration
+- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
+- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
+
+#### Resolved issues
+
+- Fixed `WorkspaceCheck` implementation when used in rocBLAS.
+- Fixed Stream-K partials cache behavior.
+- Fixed `MasterSolutionLibrary` indexing for multiple architecture build.
+- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size).
+- Fixed BufferLoad=False with Stream-K.
+- Fixed mismatch issue with `GlobalReadCoalesceGroup`.
+- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]).
+- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2).
+- Fixed inacurate vgpr allocation for ClusterLocalRead.
+- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2.
+- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0).
+- Fixed HostLibraryTests on gfx942 and gfx941.
+- Fixed LLVM crash issue.
+- Fixed for newer windows vcpkg msgpack and vcpkg version package name.
+- Fixed an error with DisableKernelPieces + 32bit ShadowLimit.
+- Ignore asm cap check for kernel arg preload for rocm6.0 and older.
diff --git a/tools/autotag/templates/highlights/6.2.0.md b/tools/autotag/templates/highlights/6.2.0.md
new file mode 100644
index 000000000..cdb355185
--- /dev/null
+++ b/tools/autotag/templates/highlights/6.2.0.md
@@ -0,0 +1,223 @@
+
+The release notes provide a comprehensive summary of changes since the previous ROCm release.
+
+- [Release highlights](release-highlights)
+
+- [Operating system and hardware support changes](operating-system-and-hardware-support-changes)
+
+- [ROCm components versioning](rocm-components)
+
+- [Detailed component changes](detailed-component-changes)
+
+- [ROCm known issues](rocm-known-issues)
+
+- [ROCm upcoming changes](rocm-upcoming-changes)
+
+The [Compatibility matrix](https://rocm.docs.amd.com/en/latest/release/docs/6.2.0/compatibility/compatibility-matrix)
+provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.
+
+Release notes for previous ROCm releases are available in earlier versions of the documentation.
+See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions).
+
+## Release highlights
+
+This section introduces notable new features and improvements in ROCm 6.2. See the
+[Detailed component changes](#detailed-component-changes) for individual component changes.
+
+### New components
+
+ROCm 6.2.0 introduces the following new components to the ROCm software stack.
+
+- **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
+ running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
+ or a GUI dashboard. For more information, see
+ [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest).
+
+- **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
+ It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
+ which function and line number are executing. For more information, see
+ [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest).
+
+- **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
+ enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses
+ rocDecode APIs written primarily in C/C++ within Python. For more information, see
+ [rocPyDecode](https://rocm.docs.amd.com/projects/rocpydecode/en/latest).
+
+- **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
+ used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
+ efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
+ ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library
+ to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
+ For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest).
+
+ ```{note}
+ ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
+ ```
+
+### ROCm Offline Installer Creator introduced
+
+The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
+driver, or a combination of the two on a target system without network access. This new tool customizes
+multiple unique configurations for use when installing ROCm on a target. Other notable features include:
+
+* A lightweight, easy-to-use user interface for configuring the creation of the installer
+
+* Support for multiple Linux distributions
+
+* Installer support for different ROCm releases and specific ROCm components
+
+* Optional driver or driver-only installer creation
+
+* Optional post-install preferences
+
+* Lightweight installer packages, which are unique to the preconfigured ROCm setup
+
+* Resolution and inclusion of dependency packages for offline installation
+
+For more information, see
+[ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/rocm-install-on-linux/en/latest/install/rocm-offline-installer.html).
+
+### Math libraries default to Clang instead of HIPCC
+
+The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`.
+Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
+building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as
+described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/latest/reference/rocmcc.html).
+The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX`
+environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or
+functionality.
+
+### Framework and library changes
+
+This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.
+
+#### Additional PyTorch and TensorFlow support
+
+ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.
+
+See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html)
+and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html)
+for installation instructions.
+
+Refer to the
+[Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html#deep-learning)
+for a comprehensive list of third-party frameworks and libraries suppported by ROCm.
+
+#### Optimized framework support for OpenXLA
+
+PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
+ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
+[Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html)
+and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html).
+
+#### PyTorch support for Autocast (automatic mixed precision)
+
+PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
+workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
+original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see
+[Automatic mixed precision](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/train-a-model#automatic-mixed-precision-amp).
+
+#### Memory savings for bitsandbytes model quantization
+
+The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP
+custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
+ROCm 6.2.0 introduces the following bitsandbytes changes:
+
+- `Int8` matrix multiplication is enabled, and it includes the following functions:
+ - `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
+ - `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.
+ - `igemmlt` – new function for GEMM computation A*B^T. It uses
+ [hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
+ - `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
+- Blockwise quantization – input tensors are quantized for a fixed block size.
+- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled.
+- 8-bit and 32-bit optimizers are enabled.
+
+```{note}
+These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and
+features to run them.
+```
+
+For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-quantization.html).
+
+#### Improved vLLM support
+
+ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
+capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama.
+ROCm 6.2.0 adds support for the following vLLM features:
+
+- MP:
+
+ Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
+ use `--distributed-executor-backed=mp`. The default depends on the commit in flux.
+
+- FP8 KV cache:
+
+ Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
+ The QUARK quantizer currently only supports Llama.
+
+- Triton Flash Attention:
+
+ ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
+ setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable.
+
+- PyTorch TunableOp:
+
+ Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
+
+For more information about enabling these features, see
+[vLLM inference](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
+
+ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
+These features include:
+
+- FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
+ in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.
+
+- Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
+ computation in large-scale models. This benefits all workloads in `FP16` configurations.
+
+To enable these experimental new features, see
+[vLLM inference](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
+Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines
+all the accessible features, and the `vllm/Dockerfile.rocm` file can be used.
+
+### Enhanced performance tuning on AMD Instinct accelerators
+
+ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
+The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
+detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
+accelerators for optimal performance. For more information, see
+[AMD MI300X tuning guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) and
+[AMD MI300A system optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html).
+
+### Removed clang-ocl
+
+As of version 6.2, ROCm no longer provides the `clang-ocl` package. The project will be archived in the future.
+See the [clang-ocl README](https://github.com/ROCm/clang-ocl).
+
+### ROCm documentation changes
+
+The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
+improves the usability and readability of the documentation. For more information about the ROCm components, see
+[What is ROCm?](https://rocm.docs.amd.com/en/latest/what-is-rocm.html).
+
+Since the release of ROCm 6.1, the documentation has added some key topics including:
+
+- [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html)
+- [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
+- [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300a.html)
+- [Using ROCm for AI](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/index.html)
+- [Using ROCm for HPC](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-hpc/index.html)
+- [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/index.html)
+- [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/latest/)
+
+The following topics have been significantly improved, expanded, or both:
+
+- [HIP programming manual](https://rocm.docs.amd.com/projects/HIP/en/latest/)
+- [Compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html)
+
+```{note}
+All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the
+[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
+```
diff --git a/tools/autotag/templates/support/6.2.0.md b/tools/autotag/templates/support/6.2.0.md
new file mode 100644
index 000000000..1585ef2b6
--- /dev/null
+++ b/tools/autotag/templates/support/6.2.0.md
@@ -0,0 +1,27 @@
+
+## Operating system and hardware support changes
+
+ROCm 6.2.0 adds support for the following operating system and kernel versions.
+
+- Ubuntu 24.04 LTS (kernel: 6.8 [GA])
+
+- RHEL 8.10 (kernel: 4.18.0-544)
+
+- SLES 15 SP6 (kernel: 6.4)
+
+ROCm 6.2.0 marks the end of support (EoS) for:
+
+- Ubuntu 22.04.3
+
+- RHEL 9.2
+
+- RHEL 8.8
+
+- SLES 15 SP 4
+
+- CentOS 7.9
+
+ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]).
+
+See the [Compatibility matrix](https://rocm-stg.amd.com/en/docs/6.2.0/compatibility/compatibility-matrix.html) for an
+overview of supported operating systems and hardware architectures.
diff --git a/tools/autotag/templates/upcoming_changes/6.2.0.md b/tools/autotag/templates/upcoming_changes/6.2.0.md
new file mode 100644
index 000000000..191951f81
--- /dev/null
+++ b/tools/autotag/templates/upcoming_changes/6.2.0.md
@@ -0,0 +1,79 @@
+
+## ROCm known issues
+
+ROCm known issues are noted on [{fab}`github` GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
+issues related to individual components, review the [Detailed component changes](detailed-component-changes).
+
+### Default processor affinity behavior for helper threads
+
+Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
+helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to
+threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if
+the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If
+`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when
+creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads
+by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment,
+batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing
+the setting may be advantageous to performance.
+
+To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the
+`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows:
+
+```{code} shell
+export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
+```
+
+To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable
+as follows:
+
+``` shell
+export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1
+```
+
+Or the default:
+
+``` shell
+
+unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG
+```
+
+If unsure of the default processor affinity settings for your environment, run the following command from the shell:
+
+``` shell
+
+bash -c "echo taskset -p \$\$"
+```
+
+### KFDTest failure on Instinct MI300X with Oracle Linux 8.9
+
+The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full
+suite to not execute properly. This issue is suspected to be hardware-related.
+
+### Bandwidth limitation in gang and non-gang modes on Instinct MI300A
+
+Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang
+and non-gang performance are observed to be limited at 45GB/s.
+
+This issue will be addressed in a future ROCm release.
+
+### rocm-llvm-alt
+
+ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for
+users interested in additional closed-source CPU optimizations. This feature is not functional in
+the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an
+LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows
+use of the closed-source compiler. It is recommended to compile using the default open-source
+compiler, which generates high-quality AMD CPU and AMD GPU code.
+
+## ROCm upcoming changes
+
+The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review
+the [Detailed component changes](detailed-component-changes).
+
+### rocm-llvm-alt
+
+The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the
+functionality provided by the closed-source compiler should transition to the open-source compiler.
+Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by
+the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler
+has been removed*".