diff --git a/RELEASE.md b/RELEASE.md index 8a23c474e..fd86f107c 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,4 +1,4 @@ -# Release notes +# ROCm 6.2.0 release notes @@ -11,671 +11,1998 @@ -This page contains the release notes for AMD ROCm™ Software. - -------------------- - -## ROCm 6.1.2 - -ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries. - -### OS support - -ROCm 6.1.2 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).### Library changes in ROCm 6.1.2 - -| Category | Group | Name | Version | Repository | -|----------|-------|------|---------|------------| -| Libraries | Machine Learning and Computer Vision | [composable_kernel](https://rocm.docs.amd.com/projects/ROCm/composable_kernel/en/latest) | [0.2.0](https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.1.2) | [ROCm/composable_kernel](https://github.com/ROCm/ROCm/composable_kernel) | -| | | [AMDMIGraphX](https://rocm.docs.amd.com/projects/ROCm/AMDMIGraphX/en/latest) | [2.9](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.1.2) | [ROCm/AMDMIGraphX](https://github.com/ROCm/ROCm/AMDMIGraphX) | -| | | [MIOpen](https://rocm.docs.amd.com/projects/ROCm/MIOpen/en/latest) | [3.1.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.1.2) | [ROCm/MIOpen](https://github.com/ROCm/ROCm/MIOpen) | -| | | [MIVisionX](https://rocm.docs.amd.com/projects/ROCm/MIVisionX/en/latest) | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.2) | [ROCm/MIVisionX](https://github.com/ROCm/ROCm/MIVisionX) | -| | | [rpp](https://rocm.docs.amd.com/projects/ROCm/rpp/en/latest) | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.2) | [ROCm/rpp](https://github.com/ROCm/ROCm/rpp) | -| | Communication | [rccl](https://rocm.docs.amd.com/projects/ROCm/rccl/en/latest) | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.2) | [ROCm/rccl](https://github.com/ROCm/ROCm/rccl) | -| | | [hipBLAS](https://rocm.docs.amd.com/projects/ROCm/hipBLAS/en/latest) | [2.1.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.1.2) | [ROCm/hipBLAS](https://github.com/ROCm/ROCm/hipBLAS) | -| | | [hipBLASLt](https://rocm.docs.amd.com/projects/ROCm/hipBLASLt/en/latest) | [0.7.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.1.2) | [ROCm/hipBLASLt](https://github.com/ROCm/ROCm/hipBLASLt) | -| | | [hipFFT](https://rocm.docs.amd.com/projects/ROCm/hipFFT/en/latest) | [1.0.14](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.1.2) | [ROCm/hipFFT](https://github.com/ROCm/ROCm/hipFFT) | -| | | [hipRAND](https://rocm.docs.amd.com/projects/ROCm/hipRAND/en/latest) | [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.1.2) | [ROCm/hipRAND](https://github.com/ROCm/ROCm/hipRAND) | -| | | [hipSOLVER](https://rocm.docs.amd.com/projects/ROCm/hipSOLVER/en/latest) | [2.1.1](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.1.2) | [ROCm/hipSOLVER](https://github.com/ROCm/ROCm/hipSOLVER) | -| | | [hipSPARSE](https://rocm.docs.amd.com/projects/ROCm/hipSPARSE/en/latest) | [3.0.1](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.1.2) | [ROCm/hipSPARSE](https://github.com/ROCm/ROCm/hipSPARSE) | -| | | [hipSPARSELt](https://rocm.docs.amd.com/projects/ROCm/hipSPARSELt/en/latest) | [0.2.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.1.2) | [ROCm/hipSPARSELt](https://github.com/ROCm/ROCm/hipSPARSELt) | -| | | [rocALUTION](https://rocm.docs.amd.com/projects/ROCm/rocALUTION/en/latest) | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.2) | [ROCm/rocALUTION](https://github.com/ROCm/ROCm/rocALUTION) | -| | | [rocBLAS](https://rocm.docs.amd.com/projects/ROCm/rocBLAS/en/latest) | [4.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | [ROCm/rocBLAS](https://github.com/ROCm/ROCm/rocBLAS) | -| | | [rocFFT](https://rocm.docs.amd.com/projects/ROCm/rocFFT/en/latest) | [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.2) | [ROCm/rocFFT](https://github.com/ROCm/ROCm/rocFFT) | -| | | [rocRAND](https://rocm.docs.amd.com/projects/ROCm/rocRAND/en/latest) | [3.0.1](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.1.2) | [ROCm/rocRAND](https://github.com/ROCm/ROCm/rocRAND) | -| | | [rocSOLVER](https://rocm.docs.amd.com/projects/ROCm/rocSOLVER/en/latest) | [3.25.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.1.2) | [ROCm/rocSOLVER](https://github.com/ROCm/ROCm/rocSOLVER) | -| | | [rocSPARSE](https://rocm.docs.amd.com/projects/ROCm/rocSPARSE/en/latest) | [3.1.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.1.2) | [ROCm/rocSPARSE](https://github.com/ROCm/ROCm/rocSPARSE) | -| | | [rocWMMA](https://rocm.docs.amd.com/projects/ROCm/rocWMMA/en/latest) | [1.4.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.1.2) | [ROCm/rocWMMA](https://github.com/ROCm/ROCm/rocWMMA) | -| | | [Tensile](https://rocm.docs.amd.com/projects/ROCm/Tensile/en/latest) | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2) | [ROCm/Tensile](https://github.com/ROCm/ROCm/Tensile) | -| | Primitives | [hipCUB](https://rocm.docs.amd.com/projects/ROCm/hipCUB/en/latest) | [3.1.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.1.2) | [ROCm/hipCUB](https://github.com/ROCm/ROCm/hipCUB) | -| | | [hipTensor](https://rocm.docs.amd.com/projects/ROCm/hipTensor/en/latest) | [1.2.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.1.2) | [ROCm/hipTensor](https://github.com/ROCm/ROCm/hipTensor) | -| | | [rocPRIM](https://rocm.docs.amd.com/projects/ROCm/rocPRIM/en/latest) | [3.1.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.1.2) | [ROCm/rocPRIM](https://github.com/ROCm/ROCm/rocPRIM) | -| | | [rocThrust](https://rocm.docs.amd.com/projects/ROCm/rocThrust/en/latest) | [3.0.1](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.1.2) | [ROCm/rocThrust](https://github.com/ROCm/ROCm/rocThrust) | -| | | [ROCdbgapi](https://rocm.docs.amd.com/projects/ROCm/ROCdbgapi/en/latest) | [0.71.0](https://github.com/ROCm/ROCdbgapi/releases/tag/rocm-6.1.2) | [ROCm/ROCdbgapi](https://github.com/ROCm/ROCm/ROCdbgapi) | -| | | [rocm-cmake](https://rocm.docs.amd.com/projects/ROCm/rocm-cmake/en/latest) | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.2) | [ROCm/rocm-cmake](https://github.com/ROCm/ROCm/rocm-cmake) | - -#### AMDMIGraphX - -MIGraphX 2.9 for ROCm 6.1.2 - -##### Additions - -* Added FP8 support -* Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch -* Added support for the `Hardmax`, `DynamicQuantizeLinear`, `Qlinearconcat`, `Unique`, `QLinearAveragePool`, `QLinearSigmoid`, `QLinearLeakyRelu`, `QLinearMul`, `IsInf` operators -* Created web site examples for `Whisper`, `Llama-2`, and `Stable Diffusion 2.1` -* Created examples of using the ONNX Runtime MIGraphX Execution Provider with the `InceptionV3` and `Resnet50` models -* Updated operators to support ONNX Opset 19 -* Enable fuse_pointwise and fuse_reduce in the driver -* Add support for dot-(mul)-softmax-dot offloads to MLIR -* Added Blas auto-tuning for GEMMs -* Added dynamic shape support for the multinomial operator -* Added fp16 to accuracy checker -* Added initial code for running on Windows OS - -##### Optimizations - -* Improved the output of migraphx-driver command -* Documentation now shows all environment variables -* Updates needed for general stride support -* Enabled Asymmetric Quantization -* Added ScatterND unsupported reduction modes -* Rewrote softmax for better performance -* General improvement to how quantization is performed to support INT8 -* Used problem_cache for gemm tuning -* Improved performance by always using rocMLIR for quantized convolution -* Improved group convolutions by using rocMLIR -* Improved accuracy of fp16 models -* ScatterElements unsupported reduction -* Added concat fusions -* Improved INT8 support to include UINT8 -* Allow reshape ops between dq and quant_op -* Improve dpp reductions on navi -* Have the accuracy checker print the whole final buffer -* Added support for handling dynamic Slice and ConstantOfShape ONNX operators -* Add support for the dilations attribute to Pooling ops -* Add layout attribute support for LSTM operator -* Improved performance by removing contiguous for reshapes -* Handle all slice input variations -* Add scales attribute parse in upsample for older opset versions -* Added support for uneven Split operations -* Improved unit testing to run in python virtual environments - -##### Fixes - -* Fixed outstanding issues in autogenerated documentation -* Update model zoo paths for examples -* Fixed promote_literals_test by using additional if condition -* Fixed export API symbols from dynamic library -* Fixed bug in pad operator from dimension reduction -* Fixed using the LD to embed files and enable by default when building shared libraries on linux -* fixed get_version() -* Fixed Round operator inaccuracy -* Fixed wrong size check when axes not present for slice -* Set the .SO version correctly - -##### Changes - -* Cleanup LSTM and RNN activation functions -* Placed gemm_pointwise at a higher priority than layernorm_pointwise -* Updated README to mention the need to include GPU_TARGETS when building MIGraphX - -##### Removals - -* Removed unused device kernels from Gather and Pad operators -* Removed int8x4 format - -#### composable_kernel +The release notes provide a comprehensive summary of changes since the previous ROCm release. -CK 0.2.0 for ROCm 6.1.2 +- [Release highlights](release-highlights) -##### Fixes +- [Operating system and hardware support changes](operating-system-and-hardware-support-changes) -* Fixed a bug in 6-dimensional kernels (#555) -* Fixed a test case failure with grouped convolution backward weight (#524) +- [ROCm components versioning](rocm-components) -##### Optimizations +- [Detailed component changes](detailed-component-changes) -* Improved the performance of the normalization kernel +- [ROCm known issues](rocm-known-issues) -##### Additions +- [ROCm upcoming changes](rocm-upcoming-changes) -* New CMake flags: - * "DL_KERNELS"-* Must be set to "ON" in order to build the gemm_dl and batched_gemm_multi_d_dl instances - * "DTYPES" -- Can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build an instance of the specified data types - * "INSTANCES_ONLY" -- Only builds CK library and instances without tests, examples, or profiler -* New feature: if GPU_TARGETS is not set in the CMake command line, CK will be built for all targets supported by the compiler -* Support for MI300A/MI300X -* Support for AMD RDNA 3 -* New user tutorial (#563) -* Additional instances for irregular GEMM sizes (#560) -* New inter-wave consumer-producer programming model for GEMM kernels (#310) -* GEMM with support multiple elementwise fusions (multi-D) (#534) -* Multi-embeddings support (#542) -* AMD RDNA 3 blockwise GEMM and real GEMM support (#541) -* AMD RDNA grouped convolution backward weight support (#505) -* MaxPool and AvgPool forward (#815); MaxPool backward (#750) +The [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) +provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases. -##### Changes +Release notes for previous ROCm releases are available in earlier versions of the documentation. +See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions.html). -None +## Release highlights -#### hipBLAS +This section introduces notable new features and improvements in ROCm 6.2. See the +[Detailed component changes](#detailed-component-changes) for individual component changes. -hipBLAS 2.1.0 for ROCm 6.1.2 +### New components -##### Additions +ROCm 6.2.0 introduces the following new components to the ROCm software stack. -* New build option to automatically use hipconfig --platform to determine HIP platform -* Level 1 functions have additional ILP64 API for both C and Fortran (`_64` name - suffix) with int64_t function arguments -* New functions hipblasGetMathMode and hipblasSetMathMode +- **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads + running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line + or a GUI dashboard. For more information, see + [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest). -##### Deprecations +- **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU. + It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining + which function and line number are executing. For more information, see + [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest). -* USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig +- **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries, + enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses + rocDecode APIs written primarily in C/C++ within Python. For more information, see + [rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/latest). -##### Changes +- **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software + used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more + efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use. + ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library + to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on. + For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest). -* Some Level 2 function argument names have changed from `m` to `n` to match legacy BLAS; there - was no change in implementation. -* Updated client code to use YAML-based testing -* Renamed `.doxygen` and `.sphinx` folders to `doxygen` and `sphinx`, respectively -* Added CMake support for documentation + ```{note} + ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change. + ``` -#### hipBLASLt +### ROCm Offline Installer Creator introduced -hipBLASLt 0.7.0 for ROCm 6.1.2 +The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU +driver, or a combination of the two on a target system without network access. This new tool customizes +multiple unique configurations for use when installing ROCm on a target. Other notable features include: -##### Additions +* A lightweight, easy-to-use user interface for configuring the creation of the installer -* Added `hipblasltExtSoftmax` extension API -* Added `hipblasltExtLayerNorm` extension API -* Added `hipblasltExtAMax` extension API -* Added `GemmTuning` extension parameter to set split-k by user -* Support for mix precision datatype: fp16/fp8 in with fp16 out +* Support for multiple Linux distributions -##### Deprecations +* Installer support for different ROCm releases and specific ROCm components -* algoGetHeuristic() ext API for GroupGemm will be deprecated in a future release of hipBLASLt +* Optional driver or driver-only installer creation -#### hipCUB +* Optional post-install preferences -hipCUB 3.1.0 for ROCm 6.1.2 +* Lightweight installer packages, which are unique to the preconfigured ROCm setup -##### Changed +* Resolution and inclusion of dependency packages for offline installation -- CUB backend references CUB and Thrust version 2.1.0. -- Updated `HIPCUB_HOST_WARP_THREADS` macro definition to match `host_warp_size` changes from rocPRIM 3.0. -- Implemented `__int128_t` and `__uint128_t` support for radix_sort. +For more information, see +[ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/rocm-install-on-linux/en/docs-6.2.0/install/rocm-offline-installer.html). -##### Fixed +### Math libraries default to Clang instead of HIPCC -- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage. +The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`. +Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when +building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as +described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/reference/rocmcc.html). +The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX` +environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or +functionality. -##### Added +### Framework and library changes -- Added interface `DeviceMemcpy::Batched` for batched memcpy from rocPRIM and CUB. +This section highlights updates to supported deep learning frameworks and notable third-party library optimizations. -#### hipFFT +#### Additional PyTorch and TensorFlow support -hipFFT 1.0.14 for ROCm 6.1.2 +ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16. -##### Changes +See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html) +and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html) +for installation instructions. -* When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule. +Refer to the +[Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/reference/3rd-party-support-matrix.html#deep-learning) +for a comprehensive list of third-party frameworks and libraries suppported by ROCm. -##### Fixes +#### Optimized framework support for OpenXLA -* Fixed error when creating length-1 plans. +PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler +ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see +[Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html) +and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html). -#### hipRAND +#### PyTorch support for Autocast (automatic mixed precision) -hipRAND 2.10.17 for ROCm 6.1.2 +PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational +workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the +original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see +[Automatic mixed precision](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/train-a-model#automatic-mixed-precision-amp). -##### Fixes +#### Memory savings for bitsandbytes model quantization -* Fixed benchmark and unit test builds on Windows +The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP +custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. +ROCm 6.2.0 introduces the following bitsandbytes changes: -#### hipSOLVER +- `Int8` matrix multiplication is enabled, and it includes the following functions: + - `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization. + - `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation. + - `igemmlt` – new function for GEMM computation A*B^T. It uses + [hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations. + - `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization. +- Blockwise quantization – input tensors are quantized for a fixed block size. +- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled. +- 8-bit and 32-bit optimizers are enabled. -hipSOLVER 2.1.1 for ROCm 6.1.2 +```{note} +These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and +features to run them. +``` -##### Changed +For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/model-quantization.html). -- `BUILD_WITH_SPARSE` now defaults to OFF on Windows. +#### Improved vLLM support -##### Fixed +ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding +capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama. +ROCm 6.2.0 adds support for the following vLLM features: -- Fixed benchmark client build when `BUILD_WITH_SPARSE` is OFF. +- MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP, + use `--distributed-executor-backed=mp`. The default depends on the commit in flux. -#### hipSPARSE +- FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements. + The QUARK quantizer currently only supports Llama. -hipSPARSE 3.0.1 for ROCm 6.1.2 +- Triton Flash Attention: -##### Fixes + ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this + setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable. -* Fixes to the build chain +- PyTorch TunableOp: -#### hipSPARSELt + Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later. -hipSPARSELt 0.2.0 for ROCm 6.1.2 +For more information about enabling these features, see +[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). -##### Added +ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing. +These features include: -- Support Matrix B is a Structured Sparsity Matrix. +- FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used + in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama. -#### hipTensor +- Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention + computation in large-scale models. This benefits all workloads in `FP16` configurations. -hipTensor 1.2.0 for ROCm 6.1.2 +To enable these experimental new features, see +[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). +Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines +all the accessible features, and the `vllm/Dockerfile.rocm` file can be used. -##### Additions +### Enhanced performance tuning on AMD Instinct accelerators -* API support for permutation of rank 4 tensors: f16 and f32 -* New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64 -* Added scale and bilinear contraction samples and tests for new supported data types -* Added permutation samples and tests for f16, f32 types +ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing. +The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes +detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these +accelerators for optimal performance. For more information, see +[AMD MI300X tuning guides](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/index.html) and +[AMD MI300A system optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html). -##### Fixes +### Removed clang-ocl -* Fixed bug in contraction calculation with data type f32 +As of version 6.2, ROCm no longer provides the `clang-ocl` package. +See the [clang-ocl README](https://github.com/ROCm/clang-ocl). -#### MIOpen +### ROCm documentation changes -MIOpen 3.1.0 for ROCm 6.1.2 +The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This +improves the usability and readability of the documentation. For more information about the ROCm components, see +[What is ROCm?](https://rocm.docs.amd.com/en/docs-6.2.0/what-is-rocm.html). -##### Added - -- CK-based 2d/3d convolution solvers to support nchw/ncdhw layout -- Fused solver for Fwd Convolution with Residual, Bias and activation -- AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver -- Forward, backward data and backward weight convolution solver with fp8/bfp8 -- check for packed tensors for convolution solvers -- Integrate CK's layer norm -- Combine gtests into single binary - -##### Fixed - -- fix for backward passes bwd/wrw for CK group conv 3d -- Fixed out-of-bounds memory access : ConvOclDirectFwdGen -- fixed build failure due to hipRTC - -##### Changed - -- Standardize workspace abstraction -- Use split CK libraries - -##### Removed - -- clamping to MAX from CastTensor used in Bwd and WrW convolution - -#### MIVisionX - -MIVisionX for ROCm 6.1.2 - -##### Additions - -* CTest: Tests for install verification -* Hardware support updates -* Doxygen support for API documentation - -##### Optimizations - -* CMakeList Cleanup -* Readme - -##### Changes - -* rocAL: PyBind Link to prebuilt library - * PyBind11 - * RapidJSON -* Setup Updates -* RPP - Use package install -* Dockerfiles: Updates & bugfix -* CuPy - No longer installed with setup.py - -##### Fixes - -* rocAL bug fix and updates - -##### Tested Configurations - -* Windows `10` / `11` -* Linux distribution - * Ubuntu - `20.04` / `22.04` - * CentOS - `7` / `8` - * RHEL - `8` / `9` - * SLES - `15-SP4` -* ROCm: rocm-core - `5.7.0.50700-6` -* miopen-hip - `2.20.0.50700-63` -* MIGraphX - `2.7.0.50700-63` -* Protobuf - [V3.12.4](https://github.com/protocolbuffers/protobuf/releases/tag/v3.12.4) -* OpenCV - [4.6.0](https://github.com/opencv/opencv/releases/tag/4.6.0) -* RPP - [1.5.0] -* FFMPEG - [n4.4.2](https://github.com/FFmpeg/FFmpeg/releases/tag/n4.4.2) -* Dependencies for all preceding packages -* MIVisionX setup script - `V2.6.1` - -##### Known Issues - -* OpenCV 4.X support for some applications is missing -* MIVisionX package install requires manual prerequisites installation - -#### rccl - -RCCL 2.18.6 for ROCm 6.1.2 - -##### Changed - -- Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow - -#### rocALUTION - -rocALUTION 3.1.1 for ROCm 6.1.2 - -##### Additions - -* `TripleMatrixProduct` functionality for `GlobalMatrix` -* Multi-Node/GPU support for `UA-AMG`, `SA-AMG` and `RS-AMG` -* Iterative ILU0 preconditioner `ItILU0` -* Iterative triangular solve, selectable via `SolverDecr` class - -##### Deprecations - -* `LocalMatrix::AMGConnect` -* `LocalMatrix::AMGAggregate` -* `LocalMatrix::AMGPMISAggregate` -* `LocalMatrix::AMGSmoothedAggregation` -* `LocalMatrix::AMGAggregation` -* `PairwiseAMG` - -##### Known Issues - -* `PairwiseAMG` does currently not support matrix sizes that exceed int32 range -* `PairwiseAMG` might fail building the hierarchy on certain input matrices - -#### rocBLAS - -rocBLAS 4.1.2 for ROCm 6.1.2 - -##### Fixes - -* Fixes BF16 TT get_solutions - -##### Optimizations - -* Tune gfx942 BBS TN, TT - -#### ROCdbgapi - -rocm-dbgapi 0.71.0 for ROCm 6.1.2 - -##### Added - -- Add support for gfx940, gfx941 and gfx942 architectures. - -#### rocFFT - -rocFFT 1.0.27 for ROCm 6.1.2 - -##### Fixes - -* Fixed kernel launch failure on execute of very large odd-length real-complex transforms. - -##### Additions - -* Enable multi-gpu testing on systems without direct GPU-interconnects - -#### rocm-cmake - -rocm-cmake 0.12.0 for ROCm 6.1.2 - -##### Changed - -- ROCMSphinxDoc: Allow separate source and config directories. -- ROCMCreatePackage: Allow additional `PROVIDES` on header-only packages. -- ROCMInstallTargets: Don't install executable targets by default for ASAN builds. -- ROCMTest: Add RPATH for installed tests. -- Finalize rename to ROCmCMakeBuildTools - -##### Fixed - -- ROCMClangTidy: Fixed invalid list index. -- Test failures when ROCM_CMAKE_GENERATOR is empty. - -#### rocPRIM - -rocPRIM 3.1.0 for ROCm 6.1.2 - -##### Additions - -* New primitive: `block_run_length_decode` -* New primitive: `batch_memcpy` - -##### Changes - -* Renamed: - * `scan_config_v2` to `scan_config` - * `scan_by_key_config_v2` to `scan_by_key_config` - * `radix_sort_config_v2` to `radix_sort_config` - * `reduce_by_key_config_v2` to `reduce_by_key_config` - * `radix_sort_config_v2` to `radix_sort_config` -* Removed support for custom config types for device algorithms -* `host_warp_size()` was moved into `rocprim/device/config_types.hpp`; it now uses either `device_id` or - a `stream` parameter to query the proper device and a `device_id` out parameter - * The return type is `hipError_t` -* Added support for `__int128_t` in `device_radix_sort` and `block_radix_sort` -* Improved the performance of `match_any`, and `block_histogram` which uses it - -##### Deprecations - -* Removed `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config`, and - `radix_sort_config` - -##### Fixes - -* Build issues with `rmake.py` on Windows when using VS 2017 15.8 or later (due to a breaking fix with - extended aligned storage) - -#### rocRAND - -rocRAND 3.0.1 for ROCm 6.1.2 - -##### Fixes - -* Implemented workaround for regressions in XORWOW and LFSR on MI200 - -#### rocSOLVER - -rocSOLVER 3.25.0 for ROCm 6.1.2 - -##### Added - -- Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm: - - SYEVDJ (with batched and strided\_batched versions) - - HEEVDJ (with batched and strided\_batched versions) -- Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm: - - SYGVDJ (with batched and strided\_batched versions) - - HEGVDJ (with batched and strided\_batched versions) - -##### Changed - -- Relaxed array length requirements for GESVDX with `rocblas_srange_index`. - -##### Removed - -- Removed gfx803 and gfx900 from default build targets. - -##### Fixed - -- Corrected singular vector normalization in BDSVDX and GESVDX -- Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX - -#### rocSPARSE - -rocSPARSE 3.1.2 for ROCm 6.1.2 - -##### Additions - -* New LRB algorithm to SpMV, supporting CSR format -* rocBLAS as now an optional dependency for SDDMM algorithms -* Additional verbose output for `csrgemm` and `bsrgemm` - -##### Optimizations - -* Triangular solve with multiple rhs (SpSM, csrsm, ...) now calls SpSV, csrsv, etcetera when nrhs equals 1 -* Improved user manual section *Installation and Building for Linux and Windows* -* Improved SpMV in CSR format on MI300 - -#### rocThrust - -rocThrust 3.0.1 for ROCm 6.1.2 - -##### Fixes - -* Ported a fix from thrust 2.2 that ensures `thrust::optional` is trivially copyable. - -#### rocWMMA - -rocWMMA 1.4.0 for ROCm 6.1.2 - -##### Additions - -* Added bf16 support for hipRTC sample - -##### Changes - -* Changed Clang C++ version to C++17 -* Updated rocwmma_coop API -* Linked rocWMMA to hiprtc - -##### Fixes - -* Fixed compile/runtime arch checks -* Built all test in large code model -* Removed inefficient branching in layout loop unrolling - -#### rpp - -rpp for ROCm 6.1.2 - -##### Changes - -* Prerequisites - -##### Tested Configurations - -* Linux distribution - * Ubuntu - `20.04` / `22.04` - * CentOS - `7` - * RHEL - `8`/`9` -* ROCm: rocm-core - `5.5.0.50500-63` -* Clang - Version `5.0.1` and above -* CMake - Version `3.22.3` -* IEEE 754-based half-precision floating-point library - Version `1.12.0` - -#### Tensile - -Tensile 4.40.0 for ROCm 6.1.2 - -##### Additions - -- new DisableKernelPieces values to invalidate local read, local write, and global read -- stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3 -- feature to allow testing stream-k grid multipliers -- debug output to check occupancy for Stream-K -- reject condition for FractionalLoad + DepthU!=power of 2 -- new TENSILE_DB debugging value to dump the common kernel parameters -- predicate for APU libs -- new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS -- new parameter (ExtraLatencyForLR) to add extra interval between local read and wait -- new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows -- initialization type and general batched options to the rocblas-bench input creator script - -##### Optimizations - -- enabled MFMA + LocalSplitU=4 for MT16x16 -- enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile -- optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size -- auto LdsPad calculation for TileMajorLds + MI16x16 -- auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth - -##### Changes - -- cleared hipErrorNotFound error since it is an expected part of the search -- modified hipcc search path for Linux -- changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor -- changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately -- changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1 -- updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead -- Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug - -##### Fixes - -- hipcc compile append flag parallel-jobs=4 -- race condition in Stream-K that appeared with large grids and small sizes -- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop -- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds -- incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case -- small fix for LdsPad optimization (LdsElement calculation) - -### AMD SMI - -AMD SMI for ROCm 6.1.2 - -#### Additions - -* Added process isolation and clean shader APIs and CLI commands. - * `amdsmi_get_gpu_process_isolation()` - * `amdsmi_set_gpu_process_isolation()` - * `amdsmi_set_gpu_clear_sram_data()` -* Added the `MIN_POWER` metric to output provided by `amd-smi static --limit`. - -#### Optimizations - -* Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command. +Since the release of ROCm 6.1, the documentation has added some key topics including: + +- [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/workload.html) +- [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html) +- [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300a.html) +- [Using ROCm for AI](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/index.html) +- [Using ROCm for HPC](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-hpc/index.html) +- [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/index.html) +- [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/) + +The following topics have been significantly improved, expanded, or both: + +- [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-6.2.0/) +- [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) + +```{note} +All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the +[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html). +``` + +## Operating system and hardware support changes + +ROCm 6.2.0 adds support for the following operating system and kernel versions. + +- Ubuntu 24.04 LTS (kernel: 6.8 [GA]) + +- RHEL 8.10 (kernel: 4.18.0-544) + +- SLES 15 SP6 (kernel: 6.4) + +ROCm 6.2.0 marks the end of support (EoS) for: + +- Ubuntu 22.04.3 + +- RHEL 9.2 + +- RHEL 8.8 + +- SLES 15 SP 4 + +- CentOS 7.9 + +ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]). + +See the [Compatibility matrix](https://rocm.docs.amd.com/en/docs/6.2.0/compatibility/compatibility-matrix.html) for an +overview of supported operating systems and hardware architectures. + +## ROCm components + +The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the +links in the *Version* column to go to the detailed component changelogs. + +

CategoryGroupNameVersion
LibrariesMachine learning and computer visionComposable Kernel + 1.1.0 ⇒ 1.1.0
MIGraphX2.9 ⇒ 2.10
MIOpen3.1.0 ⇒ 3.2.0
MIVisionX2.5.0 ⇒ 3.0.0
rocAL1.0.0 ⇒ 1.0.0
rocDecode0.6.0 ⇒ 0.6.0
rocPyDecode0.1.0
RPP1.5.0 ⇒ 1.8.0
CommunicationRCCL2.18.6 ⇒ 2.20.5
MathhipBLAS2.1.0 ⇒ 2.2.0
hipBLASLt0.7.0 ⇒ 0.8.0
hipFFT1.0.14 ⇒ 1.0.15
hipfort0.4.0
hipRAND2.10.17 ⇒ 2.11.0
hipSOLVER2.1.1 ⇒ 2.2.0
hipSPARSE3.0.1 ⇒ 3.1.1
hipSPARSELt0.2.0 ⇒ 0.2.1
rocALUTION3.1.1 ⇒ 3.2.0
rocBLAS4.1.2 ⇒ 4.2.0
rocFFT1.0.27 ⇒ 1.0.28
rocRAND3.0.1 ⇒ 3.1.0
rocSOLVER3.25.0 ⇒ 3.26.0
rocSPARSE3.1.2 ⇒ 3.2.0
rocWMMA1.4.0 ⇒ 1.5.0
Tensile4.40.0 ⇒ 4.41.0
PrimitiveshipCUB3.1.0 ⇒ 3.2.0
hipTensor1.2.0 ⇒ 1.3.0
rocPRIM3.1.0 ⇒ 3.2.0
rocThrust3.0.0 ⇒ 3.1.0
ToolsSystem managementAMD SMI24.5.2 ⇒ 24.6.2
rocminfo1.0.0
ROCm Data Center Tool0.3.0 ⇒ 1.0.0
ROCm SMI7.2.0 ⇒ 7.3.0
ROCm Validation Suite1.0.0 ⇒ 1.0.0
PerformanceOmniperf2.0.1
Omnitrace1.11.2
ROCm Bandwidth + Test1.4.0
ROCProfiler2.0.0 ⇒ 2.0.0
ROCprofiler-SDK0.4.0
ROCTracer4.1.0
DevelopmentHIPIFY17.0.0 ⇒ 18.0.0
ROCdbgapi0.71.0 ⇒ 0.75.0
ROCm CMake0.12.0 ⇒ 0.13.0
ROCm Debugger (ROCgdb) + 14.1 ⇒ 14.2
ROCr Debug Agent + 2.0.3
CompilersHIPCC1.0.0 ⇒ 1.1.1
llvm-project17.0.0 ⇒ 18.0.0
RuntimesHIP6.1 ⇒ 6.2.0
ROCr Runtime1.13.0 ⇒ 1.14.0
+
+ +## Detailed component changes + +The following sections describe key changes to ROCm components. + +### **AMD SMI** (24.6.2) #### Changes -* Updated `amismi_get_power_cap_info` to return values in uW instead of W. -* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. -* Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. - -#### Fixes - -* `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -* `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices. -* Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices. -* Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls. +- Added the following functionality: + - `amd-smi dmon` is now available as an alias to `amd-smi monitor`. + - An optional process table under `amd-smi monitor -q`. + - Handling to detect VMs with passthrough configurations in CLI tool. + - Process Isolation and Clear SRAM functionality to the CLI tool for VMs. + - Added Ring Hang event. +- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`. +- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`. #### Removals -* Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release. +- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported. +- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`. + +#### Optimizations + +- Updated CLI error strings to specify invalid device type queried. +- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest. + - Added `amdsmi.h` and `amdsmi_interface.py`. + - `amdsmi_clk_info_t` struct + - Added `AMDSMI` prefix to multiple structures. +- Updated `dpm_policy` references to `soc_pstate`. +- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file. +- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`. +- Updated CLI voltage curve command output. + +#### Resolved issues + +- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format. +- `amdsmi_get_gpu_board_info()` no longer returns junk character strings. +- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. +- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running + `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any + other subcommand that access the GPUs) a mutex would lock and not return requiring either a + clear of the mutex in `/dev/shm` or rebooting the machine. + +#### Known issues + +- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work. + They will be supported in a future release. ```{note} -See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information. +See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md) +on GitHub for more information. ``` -### ROCm SMI - -ROCm SMI for ROCm 6.1.2 - -#### Additions - -* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. - -#### Fixes - -* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112). -* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. - - - -### HIPCC - -HIPCC for ROCm 6.1.2 +### **Composable Kernel** (1.1.0) #### Changes -* **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. -* **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. -* **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. +- Added suppport for: + - Permute scale for any dimension (#1198). + - Combined elementwise op (#1217). + - Multi D in grouped convolution backward weight (#1280). + - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280). + - Large batch in grouped convolution forward (#1332). +- Added `CK_TILE` layernorm example (#1339). +- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend. + +#### Optimizations + +- Support universal GEMM in grouped convolution forward (#1320). +- Optimizations for low M and N in grouped convolution backward weight (#1303). +- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel. +- `FP8` GEMM performance optimization and tuning (#1384). +- Added FlashAttention backward pass performance optimization (#1397). + +### **HIP** (6.2.0) + +#### Changes + +- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take + a 64-bit integer as an explicit mask argument. + - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled + unconditionally in a future HIP release. + +- Added new HIP APIs: + - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol. + - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`. + - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph. + - `hipGraphInstantiateWithParams` creates an executable graph from a graph. + +- Added a new flag `integrated` -- supported in device property. + + - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver + detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for + `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device. + +- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `. + +- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`. + +#### Resolved issues + +- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime. +- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`. + +#### Upcoming changes + +- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be enabled unconditionally in a future HIP release. + +### **hipBLAS** (2.2.0) + +#### Changes + +* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. +* Added a new ILP64 API for level 1 `_ex` functions. + +* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts. +* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages + `rocblas-dev` and `rocsolver-dev`. +* Updated Linux AOCL dependency to release 4.2 `gcc` build. +* Updated Windows `vcpkg` dependencies to release 2024.02.14. + +### **hipBLASLt** (0.8.0) + +#### Changes + +* Added extension APIs: + *`hipblasltExtAMaxWithScale`. + * `GemmTuning` extension parameter to set `wgm` by user. +* Added support for: + * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype. + * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only). + * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA. +* Added support for gfx110x. + +#### Optimizations + +* Improved library loading time. + +### **HIPCC** (1.1.1) + +#### Changes + +* Split `hipcc` package into two packages for different hardware platforms. + +* Cleaned up references to environment variables. + +* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by + default, instead of their Perl counterparts. + +* Enabled function calls. + +* Added support for generating packages for ROCm stack targeting static libraries. + +#### Resolved issues + +* Implemented numerous bug fixes and quality improvements. + +### **hipCUB** (3.2.0) + +#### Changes + +* Added `DeviceCopy` function for parity with CUB. +* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as + the new optional template argument for `WarpExchange`. + * The potential values for the enum are `WARP_EXCHANGE_SMEM` and + `WARP_EXCHANGE_SHUFFLE`. + * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while + `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations. + * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, + but the `ItemsPerThread` must be a divisor of `WarpSize`. +* Added `tuple.hpp` which defines templates `hipcub::tuple`, + `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`. +* Added new overloaded member functions to `BlockRadixSort` and + `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type + (`key_type`) can be sorted via these overloads, if an appropriate decomposer + is passed. The decomposer has to implement `operator(const key_type&)` which + returns a `hipcub::tuple` of references pointing to members of `key_type`. + +* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of + HIP graphs, with several exceptions: + * `CachingDeviceAllocator` + * `GridBarrierLifetime` + * `DeviceSegmentedRadixSort` + * `DeviceRunLengthEncode` + Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs. + +#### Removals + +* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous` + to the device functions results in a deprecation warning both at runtime and at compile time. + * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time + by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition. + * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition. + +#### Resolved issues + +* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. + It now derives the accumulator type as the result of the binary operator. + +### **hipFFT** (1.0.15) + +#### Resolved issues + +* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers. +* Prevented C++ exceptions leaking from public API functions. +* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs. + + +### **HIPIFY** (18.0.0) + +#### Changes + +- Added support for: + - NVIDIA CUDA 12.4.1 + - cuDNN 9.1.1 + - LLVM 18.1.6 +- Added full hipBLASLt support. + +#### Resolved issues + +- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object; + affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`. + +### **hipRAND** (2.11.0) + +#### Changes + +* Added support for setting generator output ordering in C and C++ API. +* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with + `uHIPRAND_STATUS_NOT_IMPLEMENTED`. +* Added options to create: + * A host generator to the Fortran wrapper. + * A host generator to the Python wrapper. +* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous + versions. The environment variable is now named `HIPRAND_USE_HMM`. +* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. +* Device API documentation is improved in this version. + +#### Removals + +* Removed the option to build hipRAND as a submodule to rocRAND. +* Removed references to, and workarounds for, the deprecated `hcc`. +* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`. + Use `ROCRAND_PATH` instead. + +#### Resolved issues + +* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`. + +### **hipSOLVER** (2.2.0) + +#### Changes + +- Added compatibility-only functions: + - `auxiliary` + - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions` + - `getrf` + - `hipsolverDnXgetrf_bufferSize` + - `hipsolverDnXgetrf` + - `getrs` + - `hipsolverDnXgetrs` +- Added support for building on Ubuntu 24.04 and CBL-Mariner. +- Added `hip::host` to `roc::hipsolver` usage requirements. +- Added functions + - `syevdx`/`heevdx` + - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize` + - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx` + - `sygvdx`/`hegvdx` + - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize` + - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx` + +- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU. +- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`. + +#### Removals + +- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients. + +### **hipSPARSE** (3.1.1) + +#### Changes + +* Added the missing `hipsparseCscGet()` routine. + +* All internal hipSPARSE functions now exist inside a namespace. +* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend. +* Improved the user manual and contribution guidelines. + +#### Resolved issues + +* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior. + +#### Known Issues + +* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API + and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be + ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`. + +### **hipSPARSELt** (0.2.1) + +#### Optimizations + +* Refined test cases. + +### **hipTensor** (1.3.0) + +#### Changes + +* Added support for: + * Tensor permutation of ranks of 2, 3, 4, 5, and 6 + * Tensor contraction of M6N6K6: M, N, K up to rank 6 +* Added tests for: + * Tensor permutation of ranks of 2, 3, 4, 5, and 6 + * Tensor contraction of M6N6K6: M, N, K up to rank 6 + * YAML parsing to support sequential parameters ordering. +* Prefer `amd-llvm-devel` package before system LLVM library. +* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`. +* Updated actor-critic selection for new contraction kernel additions. +* Updated installation, programmer's guide, and API reference documentation. + +#### Resolved issues + +* Fixed LLVM parsing crash. +* Fixed memory consumption issue in complex kernels. +* Workaround implemented for compiler crash during debug build. +* Allow random modes ordering for tensor contractions. + +### **llvm-project** (18.0.0) + +#### Changes + +* LLVM IR + + * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address + spaces. + + * Added `llvm.exp10` intrinsic. + +* LLVM infrastruture + + * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`. + +* TableGen + + * Added constructs for debugging TableGen files: + + * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793). + + * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716). + +* AArch64 backend + + * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs. + +* AMDGPU backend + + * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access. + + * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics. + + * Implemented `llvm.get.rounding`. + +* ARM backend + + * Added support for Cortex-M52 CPUs. + + * Added execute-only support for Armv6-M. + +* RISC-V backend + + * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental. + + * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental. + + * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See + [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification. + + * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64. + + * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`, + `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However, + the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions, + `-menable-experimental-extensions` needs to be passed to Clang. + + * `-mcpu=sifive-p450` was added. + + * CodeGen of `RV32E` and `RV64E` is supported experimentally. + + * CodeGen of `ilp32e` and `lp64e` is supported experimentally. + +* X86 backend + + * Added support for the RDMSRLIST and WRMSRLIST instructions. + + * Added support for the WRMSRNS instruction. + + * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction. + + * Support ISA of CMPCCXADD. + + * Support ISA of AVX-IFMA. + + * Support ISA of AVX-VNNI-INT8. + + * Support ISA of AVX-NE-CONVERT. + + * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported. + + * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported. + + * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now + respects this builtin and does not convert CMOVs to branches. + + * Add support for the PBNDKB instruction. + + * Support ISA of SHA512. + + * Support ISA of SM3. + + * Support ISA of SM4. + + * Support ISA of AVX-VNNI-INT16. + + * `-mcpu=graniterapids-d` is now supported. + + * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust + which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type + matched and called into `libgcc` helper functions. + + * Support ISA of USER_MSR. + + * Support ISA of AVX10.1-256 and AVX10.1-512. + + * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported. + + * `-mapxf` is supported. + + * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow + 32-bit relocations or require 64-bit relocations to the global variable. + + * The medium code model’s code generation was audited to be more similar to the small code model where possible. + +* C API + + * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions. + + * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an + alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target + ABI) and provides default values for unspecified settings. + + * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and + `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions. + + * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions + that use operand bundles: + + * `LLVMBuildCallWithOperandBundles` + + * `LLVMBuildInvokeWithOperandBundles` + + * `LLVMCreateOperandBundle` + + * `LLVMDisposeOperandBundle` + + * `LLVMGetNumOperandBundles` + + * `LLVMGetOperandBundleAtIndex` + + * `LLVMGetNumOperandBundleArgs` + + * `LLVMGetOperandBundleArgAtIndex` + + * `LLVMGetOperandBundleTag` + + * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an + instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag. + +* CodeGen infrastructure + + * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e. + `-debug-only=isel-dump`). This new debug type can be filtered by function names using + `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the + existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single + function regardless of `-filter-print-funcs`’s values. + +* Metadata info + + * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`. + +* LLVM tools + + * `llvm-symbolizer` now treats invalid input as an address for which source information is not found. + + * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing + symbols. + + * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each + object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now + `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each + object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers + accordingly. + + * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`. + + * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers. + + * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names. + + * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only. + +* LLDB + + * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type + when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`. + + * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying + functionality remains unchanged. + + * The `mte_ctrl` register can now be read from AArch64 Linux core files. + + * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 + (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation. + + * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism + of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users + can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`, + override or adding to the environment variable. + + * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers: + `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`. + +* Sanitizers + + * HWASan now defaults to detecting use-after-scope bugs. + +#### Removals + +* LLVM IR + + * The constant expression variants of the following instructions have been removed: + + * `and` + + * `or` + + * `lshr` + + * `ashr` + + * `zext` + + * `sext` + + * `fptrunc` + + * `fpext` + + * `fptoui` + + * `fptosi` + + * `uitofp` + + * `sitofp` + +* RISC-V backend + + * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and + instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with + SiFive to define and document real extension names for individual CSRs and instructions. + +* Python bindings + + * The Python bindings have been removed. + +* C API + + * The following functions for creating constant expressions have been removed, because the underlying constant + expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which + will constant fold the operands if possible and create an instruction otherwise: + + * `LLVMConstAnd` + + * `LLVMConstOr` + + * `LLVMConstLShr` + + * `LLVMConstAShr` + + * `LLVMConstZExt` + + * `LLVMConstSExt` + + * `LLVMConstZExtOrBitCast` + + * `LLVMConstSExtOrBitCast` + + * `LLVMConstIntCast` + + * `LLVMConstFPTrunc` + + * `LLVMConstFPExt` + + * `LLVMConstFPToUI` + + * `LLVMConstFPToSI` + + * `LLVMConstUIToFP` + + * `LLVMConstSIToFP` + + * `LLVMConstFPCast` + +* CodeGen infrastructure + + * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets + should use backwards frame index elimination instead. + + * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging + instead, which is preferred because it does not depend on accurate kill flags. + +* LLDB + + * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable. + + * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value + indicating failure. + +#### Resolved issues + +* AArch64 backend + + * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto + options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for + this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the + exception of crypto. + +* Windows target + + * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path + names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug + used to cause distinct files to be considered equivalent on some file systems. This change fixed the issues + [https://github.com/llvm/llvm-project/issues/61401]() and [https://github.com/llvm/llvm-project/issues/22079](). + +#### Known issues + +The compiler may incorrectly compile a program that uses the +``__shfl(var, srcLane, width)`` function when one of the parameters to +the function is undefined along some path to the function. For most functions, +uninitialized inputs cause undefined behavior. + +```{note} +The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path. +``` + +As a workaround, initialize the parameters to ``__shfl``. For example: + +```{code-block} cpp +unsigned long istring = 0 // Initialize the input to __shfl +return __shfl(istring, 0, 64) +``` + +### **MIGraphX** (2.10.0) + +#### Changes + +- Added support for ONNX Runtime MIGraphX EP on Windows. +- Added `FP8` Python API. +- Added examples for SD 2.1 and SDXL. +- Added support for BERT to Dynamic Batch. +- Added a `--test` flag in `migraphx-driver` to validate the installation. +- Added support for ONNX Operator: Einsum. +- Added `uint8` support in ONNX Operators. +- Added Split-k kernel configurations for performance improvements. +- Added fusion for group convolutions. +- Added rocMLIR conv3d support. +- Added rocgdb to the Dockerfile. +- Changed default location of libraries with release specific ABI changes. +- Reorganized documentation in GitHub. + +#### Removals + +- Removed the `--model` flag with `migraphx-driver`. + +#### Optimizations + +- Improved ONNX Model Zoo coverage. +- Reorganized `memcpys` with ONNX Runtime to improve performance. +- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast. +- Improved MLIR kernel selection for multibroadcasted GEMMs. +- Improved details of the perf report. +- Enable mlir by default for GEMMs with small K. +- Allow specifying dot or convolution fusion for mlir with environmental flag. +- Improve performance on small reductions by doing multiple reduction per wavefront. +- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants. +- Use MLIR attention kernels in more cases. +- Enables MIOpen and CK fusions for MI300 gfx arches. +- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs. +- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases. +- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout. +- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases. +- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained. +- Simplifies "slice + concat" pattern in SDXL UNet. +- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros. +- Improved inference performance by fusing Reduce to Broadcast. +- Added additional information when printing the perf report. +- Improve scalar fusions when not all strides are 0. +- Added support for multi outputs in pointwise ops. +- Improve reduction fusion with reshape operators. +- Use the quantized output when an operator is used again. +- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware. + +#### Resolved issues + +- Fixed Super Resolution model verification failed with `FP16`. +- Fixed confusing messages by suppressing them when compiling the model. +- Fixed an issue causing the mod operator with `int8` and `int32` inputs. +- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled. +- Fixed a bug when running `migraphx-driver` with the `--run 1` option. +- Fixed Layernorm accuracy: calculations in `FP32`. +- Fixed update Docker generator script to ROCm 6.1 to point at Jammy. +- Fixed a floating point exception for `dim (-1)` in the reshape operator. +- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input. +- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix. +- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns. +- Fixed an issues during compilation caused by the incorrect constructor being used at compile time. + Removed list initializer of `prefix_scan_sum` which was causing issues during compilation. +- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation. + +### **MIOpen** (3.2.0) + +#### Changes + +- Added: + - [Conv] bilinear (alpha beta) solvers. + - [Conv] enable bf16 for ck-based solvers. + - [Conv] Add split_k tuning to 2d wrw ck-based solver. + - [MHA] graph API fp8 fwd. + - [RNN] multi-stream as default solution. +- Added TunaNetv2.0 for MI300. +- Added Adam and AMP Adam optimizer. + +#### Resolved issues + +- Memory access fault caused by `GemmBwdRest`. +- Context configuration in `GetWorkSpaceSize`. +- Fixes to support huge tensors. + +#### Optimizations + +- Find: improved precision of benchmarking. + +### **MIVisionX** (3.0.0) + +#### Changes + +- Added support for: + - Advanced GPUs + - PreEmphasis Filter augmentation in openVX extensions + - Spectrogram augmentation in openVX extensions + - Downmix and ToDecibels augmentations in openVX extensions + - Resample augmentation and Operator overloading nodes in openVX extensions + - NonSilentRegion and Slice augmentations in openVX extensions + - Mel-Filter bank and Normalize augmentations in openVX extensions + +#### Removals + +- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL). + +#### Resolved issues + +- Fixed issues with dependencies. + +#### Known issues + +- MIVisionX package install requires manual prerequisites installation. + +### **Omniperf** (2.0.1) + +#### Known issues + +- Error when running Omniperf with an application with command line arguments. As a workaround, create an + intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This + issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347). + +- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: + "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the + environment variable `export ROCPROF=rocprofv2`. + +- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message: + + "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.* + + *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.* + + *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.* + + *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*" + + As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`. + +### **OpenMP** (17.0.0) + +#### Changes + +- Added basic experimental support for ``libc`` functions on the GPU via the + LLVM C Library for GPUs. +- Added minimal support for calling host functions from the device using the + `libc` interface. +- Added vendor agnostic OMPT callback support for OpenMP-based device offload. + +#### Removals + +- Removed the "old" device plugins along with support for the `remote` and + `ve` plugins. + +#### Resolved issues + +- Fixed the implementation of `omp_get_wtime` for AMDGPU targets. + +### **RCCL** (2.20.5) + +#### Changes + +- Added support for `fp8` and `rccl_bfloat8`. +- Added support for using HIP contiguous memory. +- Added ROC-TX for host-side profiling. +- Added new rome model. +- Added `fp16` and `fp8` cases to unit tests. +- Added a new unit test for main kernel stack size. +- Added the new `-n` option for `topo_expl` to override the number of nodes. +- Improved debug messages of memory allocations. +- Enabled static build. +- Enabled compatibility with: + - NCCL 2.20.5. + - NCCL 2.19.4. +- Performance tuning for some collective operations on MI300. +- Enabled NVTX code in RCCL. +- Replaced `rccl_bfloat16` with hip_bfloat16. +- NPKit updates: + - Removed warm-up iteration removal by default, need to opt in now. + - Doubled the size of buffers to accommodate for more channels. +- Modified rings to be rail-optimized topology friendly. + +#### Resolved issues + +- Fixed a bug when configuring RCCL for only LL128 protocol. +- Fixed scratch memory allocation after API change for MSCCL. + +### **rocAL** (1.0.0) + +#### Changes + +- Added tests and samples. + +#### Removals + +- Removed CuPy from `setup.py`. + + +#### Optimizations + +- Added setup and install updates. + +#### Resolved issues + +- Minor bug fixes. + +### **rocALUTION** (3.2.0) + +#### Changes + +* Added new file I/O based on rocSPARSE I/O format. +* Added `GetConvergenceHistory` for ItILU0 preconditioner. + +#### Removals + +* Deprecated the following: + * `LocalMatrix::ReadFileCSR` + * `LocalMatrix::WriteFileCSR` + * `GlobalMatrix::ReadFileCSR` + * `GlobalMatrix::WriteFileCSR` + +### **rocBLAS** (4.2.0) + +#### Changes + +* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. +* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`. +* Added Benchmark class for common timing code. +* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`. +* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`. + +* Updated Linux AOCL dependency to release 4.2 gcc build. +* Updated Windows vcpkg dependencies to release 2024.02.14. +* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40. + +#### Optimizations + +* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator. + +#### Removals + +* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next + major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage. + +### **ROCdbgapi** (0.75.0) + +#### Removals +- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`. + The old names are still accessible but deprecated. + +### **rocDecode** (0.6.0) + +#### Changes + +- Added full H.264 support and bug fixes. + +### **rocFFT** (1.0.28) + +#### Changes + +* Randomly generated accuracy tests are now disabled by default. They can be enabled using + the `--nrand` option (which defaults to 0). + +#### Optimizations + +* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks + are transformed locally, with global transposes to make remaining dimensions contiguous. + +### **rocm-cmake** (0.13.0) + +#### Changes + +- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds. + - Package names are no longer pulled from `CPACK__PACKAGE_NAME`. + - Runtime packages will no longer be generated for static builds. + +### **ROCm Data Center Tool** (1.0.0) + +#### Changes + +- Added ROCProfiler `dmon` metrics. +- Added new ECC metrics. +- Added ROCm Validation Suite diagnostic command. +- Fully migrated to AMD SMI. + +#### Removals + +- Removed RASLIB dependency and blobs. +- Removed `rocm_smi_lib` dependency due to migration to AMD SMI. + +### **ROCm Debugger (ROCgdb)** (14.2) + +#### Changes + +- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile. +- Added support for generating and opening core files for heterogeneous processes. + +### **ROCm SMI** (7.3.0) + +#### Changes + +- Added Partition ID API (`rsmi_dev_partition_id_get(..)`). + +#### Resolved issues + +- Fixed Partition ID CLI output. + +```{note} +See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md) +on GitHub for more information. +``` + +### **ROCm Validation Suite** (1.0.0) + +#### Changes + +* Added stress tests: + + * IET (power) stress test for MI300A. + + * IET (power transition) test for MI300X. + +* Added support: + + * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output. + + * Trignometric float and random integer matrix data initialization support. + +* Updated GST performance benchmark test for better numbers. + +### **rocPRIM** (3.2.0) + +#### Changes + +* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp. +* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter. + * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan). + This is the same behaviour as before this change. +* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element. +* Added new public APIs for deriving resulting type on device-only functions: + * `rocprim::invoke_result` + * `rocprim::invoke_result_t` + * `rocprim::invoke_result_binary_op` + * `rocprim::invoke_result_binary_op_t` +* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy. +* Added more test cases, to better cover supported data types. +* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. + To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple` + of references to arithmetic types from the key. +* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. + It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`. + It can be instantiated with: + * `rocprim::make_predicate_iterator` + * `rocprim::make_mask_iterator` +* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32. +* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed. +* Updated some tests to work with supported data types. + +#### Optimizations + +* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`. +* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread. + +#### Resolved issues + +* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is + larger than the logical warp size. The test suite has been updated with such cases. +* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`. +* Fixed incorrect output type used in `device_adjacent_difference`. +* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types. +* Fixed an issue with `device_adjacent_difference`. It now considers both the + input and the output type for selecting the appropriate kernel launch config. + Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage. +* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`. +* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64. + +#### Removals + +* Deprecated the internal header `detail/match_result_type.hpp`. +* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`. +* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name. + +### **ROCProfiler** (2.0.0) + +#### Removals + +- Removed `pcsampler` sample code due to deprecation from version 2. + +### **rocRAND** (3.1.0) + +#### Changes + +* Added `rocrand_create_generator_host`. + * The following generators are supported: + * `ROCRAND_RNG_PSEUDO_MRG31K3P` + * `ROCRAND_RNG_PSEUDO_MRG32K3A` + * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10` + * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20` + * `ROCRAND_RNG_PSEUDO_XORWOW` + * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32` + * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64` + * `ROCRAND_RNG_QUASI_SOBOL32` + * `ROCRAND_RNG_QUASI_SOBOL64` + * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives). + * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread. +* Added the option to create a host generator to the Python wrapper. +* Added the option to create a host generator to the Fortran wrapper +* Added dynamic ordering. This ordering is free to rearrange the produced numbers, + which can be specific to devices and distributions. It is implemented for: + * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry +* Added support for using Clang as the host compiler for alternative platform compilation. +* C++ wrapper: + * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`. + * Added optional order parameter to the constructor of `mt19937_engine`. +* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator: + * `rocrand_normal2` + * `rocrand_normal_double2` + * `rocrand_log_normal2` + * `rocrand_log_normal_double2` +* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics. +* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`. +* Added offset and skipahead functionality to LFSR113 generator. +* Added dynamic ordering for architecture `gfx1102`. + +* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few + things to be aware of: + - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph. + - After the generator is created, you may call API functions to set its seed, offset, and order. + - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph. + - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredicable behaviour. + - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs. + - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp` +* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features. +* Building rocRAND should be faster on machines with multiple CPU cores as the library has been + split to multiple compilation units. +* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`. +* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE` +* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. + +#### Removals + +* Deprecated the following typedefs. Please use the unified `state_type` alias instead. + * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state` + * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state` + * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state` + * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state` +* Deprecated the following internal headers: + * `src/rng/distribution/distributions.hpp`. + * `src/rng/device_engines.hpp`. +* Removed references to and workarounds for deprecated hcc. +* Removed support for HIP-CPU. + +#### Known Issues + +- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release. + +### **ROCr Runtime** (1.14.0) + +#### Changes + +- Added PC sampling feature (experimental feature). + +### **rocSOLVER** (3.26.0) + +#### Changes + +- Added 64-bit APIs for existing functions: + - GETF2_64 (with `batched` and `strided_batched` versions) + - GETRF_64 (with `batched` and `strided_batched` versions) + - GETRS_64 (with `batched` and `strided_batched` versions) +- Added gfx900 to default build targets. +- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection: + - SYEVDX (with `batched` and `strided_batched` versions) + - HEEVDX (with `batched` and `strided_batched` versions) +- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection: + - SYGVDX (with `batched` and `strided_batched` versions) + - HEGVDX (with `batched` and `strided_batched` versions) +- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for + backwards compatibility. +- Functions working with arrays of size n - 1 can now accept null pointers when n = 1. + +#### Optimizations + +- Improved performance of Cholesky factorization. +- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U. + +#### Resolved issues + +- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues. + +### **rocSPARSE** (3.2.0) + +#### Changes + +* Added a new Merge-Path algorithm to SpMM, supporting CSR format. +* Added support for row order to SpSM. +* Added rocsparseio I/O functionality to the library. +* Added `rocsparse_set_identity_permutation`. + +* Adjusted rocSPARSE dependencies to related HIP packages. +* Binary size has been reduced. +* A namespace has been wrapped around internal rocSPARSE functions and kernels. +* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0. +* gfx803 target has been removed from address sanitizer builds. + +#### Optimizations + +* SpMV adaptive and LRB algorithms have been further optimized on CSR format +* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format +* Improved documentation and contribution guidelines. + +#### Resolved issues + +* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`. + +### **rocThrust** (3.1.0) + +#### Changes + +* Added changes from upstream CCCL/thrust 2.2.0. + * Updated the contents of `system/hip` and `test` with the upstream changes. +* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`. +* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call. +* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`. +* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`. + +#### Known Issues + +* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device. +* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs. + +### **rocWMMA** (1.5.0) + +#### Changes + +* Added internal utilities for: + * Element-wise vector transforms. + * Cross-lane vector transforms. +* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16. +* Added tests for new internal transforms. + +* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`. +* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments. +* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary. +* Refactored entry-point of std library usage to improve hipRTC support. +* Updated installation, programmer's guide and API reference documentation. + +#### Resolved issues + +* Fixed the ordering of some header includes to improve portability. + +### **RPP** (1.8.0) + +#### Changes + +- Prerequisites - ROCm install requires only `--usecase=rocm`. +- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory. +- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP. +- RPP Tensor Audio support on HOST for Spectrogram. +- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version. +- RPP Tensor Audio support on HOST for Mel Filter Bank. +- RPP Tensor Normalize ND support on HOST and `HIP`. + +### **Tensile** (4.41.0) + +#### Changes + +- New tuning script to summarize rocBLAS log file +- New environment variable to test fixed grid size with Stream-K kernels +- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power +- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile +- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation)) +- Support NT flag for global load and store for gfx94x +- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload) +- Add option for rotating buffers timing with cache eviction +- Add predicate for arithmetic intensity +- Add DirectToVgpr + packing for f8/f16 + TLU cases +- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV +- Add test cases for DirectToVgpr + packing +- Add batch support for Stream-K kernels and new test cases +- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic +- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1) +- Support VectorWidthB (new parameter VectorWidthB) +- Support VectorWidth + non SourceSwap +- Add test cases for VectorWidthB, VectorWidth + non SourceSwap +- Add code owners file +- New environment variables to dynamically adjust number of CUs used in Stream-K +- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1)) +- Add xf32 option to rocblas-bench input creator + +- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag) +- Limit build threads based on CPUs/RAM available on system (for tests) +- Update required workspace size for Stream-K, skip kernel initialization when possible +- Use fallback libraries for archs without optimized logic +- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator) +- Remove OCL tests +- Disable HostLibraryTests +- Reduce extended test time by removing extra parameters in the test config files +- Disable InitAccVgprOpt for Stream-K +- Skip sgemm 64bit offset tests for gfx94x +- Skip DTV, DTL, LSU+MFMA tests for gfx908 +- Increase extended test timeout to 720 min +- Update xfail test (1sum tests only failing on gfx90a) +- Update lib logic convertor script +- Test limiting CI threads for only gfx11 +- wGM related kernargs are removed if they are not needed (WGM=-1,0,1) +- Cleanup on unused old code, mostly related to old client +- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it +- Update efficiency script for new architecture and xf32 datatype +- Re-enable negative values for WorkGroupMapping (asm kernel only) +- Disable HW monitor for aquvavanjaram941 +- Pre-apply offsets for strided batch kernels +- Update tensile build with 16 threads + +#### Optimizations + +- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py). +- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype) +- Reversed MFMA order in inner loop for odd outer iteration +- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync) +- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU) + +#### Resolved issues + +- Fixed `WorkspaceCheck` implementation when used in rocBLAS. +- Fixed Stream-K partials cache behavior. +- Fixed `MasterSolutionLibrary` indexing for multiple architecture build. +- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size). +- Fixed BufferLoad=False with Stream-K. +- Fixed mismatch issue with `GlobalReadCoalesceGroup`. +- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]). +- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2). +- Fixed inacurate vgpr allocation for ClusterLocalRead. +- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2. +- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0). +- Fixed HostLibraryTests on gfx942 and gfx941. +- Fixed LLVM crash issue. +- Fixed for newer windows vcpkg msgpack and vcpkg version package name. +- Fixed an error with DisableKernelPieces + 32bit ShadowLimit. +- Ignore asm cap check for kernel arg preload for rocm6.0 and older. + +## ROCm known issues + +ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known +issues related to individual components, review the [Detailed component changes](detailed-component-changes). + +### Default processor affinity behavior for helper threads + +Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm +helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to +threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if +the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If +`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when +creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads +by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment, +batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing +the setting may be advantageous to performance. + +To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the +`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows: + +```{code} shell +export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 +``` + +To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable +as follows: + +``` shell +export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1 +``` + +Or the default: + +``` shell + +unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG +``` + +If unsure of the default processor affinity settings for your environment, run the following command from the shell: + +``` shell + +bash -c "echo taskset -p \$\$" +``` +### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver + +AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video +output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox +G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions. + +The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems, +this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading +the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when +running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods. + +### KFDTest failure on Instinct MI300X with Oracle Linux 8.9 + +The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full +suite to not execute properly. This issue is suspected to be hardware-related. + +### Bandwidth limitation in gang and non-gang modes on Instinct MI300A + +Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang +and non-gang performance are observed to be limited at 45GB/s. + +This issue will be addressed in a future ROCm release. + +### rocm-llvm-alt + +ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for +users interested in additional closed-source CPU optimizations. This feature is not functional in +the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an +LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows +use of the closed-source compiler. It is recommended to compile using the default open-source +compiler, which generates high-quality AMD CPU and AMD GPU code. + +## ROCm upcoming changes + +The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review +the [Detailed component changes](detailed-component-changes). + +### rocm-llvm-alt + +The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the +functionality provided by the closed-source compiler should transition to the open-source compiler. +Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by +the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler +has been removed*". diff --git a/docs/conf.py b/docs/conf.py index 929046e75..70b642277 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -25,16 +25,16 @@ latex_elements = { project = "ROCm Documentation" author = "Advanced Micro Devices, Inc." copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." -version = "6.1.2" -release = "6.1.2" +version = "6.2.0" +release = "6.2.0" setting_all_article_info = True all_article_info_os = ["linux", "windows"] all_article_info_author = "" # pages with specific settings article_pages = [ - {"file": "about/release-notes", "os": ["linux", "windows"], "date": "2024-06-04"}, - {"file": "about/changelog", "os": ["linux", "windows"], "date": "2024-06-04"}, + {"file": "about/release-notes", "os": ["linux", "windows"], "date": "2024-08-02"}, + {"file": "about/changelog", "os": ["linux", "windows"], "date": "2024-08-02"}, {"file": "how-to/deep-learning-rocm", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/index", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/install", "os": ["linux"]}, @@ -104,7 +104,7 @@ html_theme = "rocm_docs_theme" html_theme_options = {"flavor": "rocm-docs-home"} html_static_path = ["sphinx/static/css"] -html_css_files = ["rocm_custom.css"] +html_css_files = ["rocm_custom.css", "rocm_rn.css"] html_title = "ROCm Documentation" diff --git a/docs/how-to/system-optimization/mi300a.rst b/docs/how-to/system-optimization/mi300a.rst index 0c19e3245..8fb65e6f8 100644 --- a/docs/how-to/system-optimization/mi300a.rst +++ b/docs/how-to/system-optimization/mi300a.rst @@ -191,6 +191,8 @@ This section describes performance-based settings. echo 20 > /proc/sys/vm/compaction_proactiveness echo 1 > /proc/sys/vm/compact_unevictable_allowed +.. _mi300a-processor-affinity: + * **Change affinity of ROCm helper threads** This change prevents internal ROCm threads from having their CPU core affinity mask diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index aefb9fb2c..f1c76c05a 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -9,12 +9,6 @@ subtrees: - file: what-is-rocm.rst - file: about/release-notes.md title: Release notes - subtrees: - - entries: - - file: about/changelog.md - title: Changelog - - url: https://github.com/ROCm/ROCm/labels/Verified%20Issue - title: Known issues - caption: Install entries: diff --git a/docs/sphinx/static/css/rocm_custom.css b/docs/sphinx/static/css/rocm_custom.css index 3649fe537..476a90041 100644 --- a/docs/sphinx/static/css/rocm_custom.css +++ b/docs/sphinx/static/css/rocm_custom.css @@ -1,6 +1,21 @@ +/* Override PyData Sphinx Theme default colors */ +html[data-theme='light'] { + --pst-color-table-row-hover-bg: #E2E8F0; +} + +html[data-theme='dark'] { + --pst-color-table-row-hover-bg: #1E293B; +} + +a svg { + color: var(--pst-color-text-base); +} + +a svg:hover { + color: var(--pst-color-link-hover); +} /* Adds container for big tables, used for Compatibility Matrix */ - .format-big-table { white-space: nowrap; - } +} diff --git a/docs/sphinx/static/css/rocm_rn.css b/docs/sphinx/static/css/rocm_rn.css new file mode 100644 index 000000000..dd82ed4b7 --- /dev/null +++ b/docs/sphinx/static/css/rocm_rn.css @@ -0,0 +1,126 @@ +#rocm-rn-components col { + width: 6rem; +} +#rocm-rn-components col:nth-child(2) { + width: 12rem; +} +#rocm-rn-components td { + white-space: nowrap; +} +#rocm-rn-components td:last-of-type { + text-align: center; +} +#rocm-rn-components a svg { + color: var(--pst-color-text-base); +} +#rocm-rn-components a svg:hover { + color: var(--pst-color-link-hover); +} +#rocm-rn-components .tbody-reverse-zebra tr:nth-child(2n + 1) td { + background-color: var(--pst-color-table-row-zebra-high-bg); +} +#rocm-rn-components .tbody-reverse-zebra tr:nth-child(2n) td { + background-color: var(--pst-color-table-row-zebra-low-bg); +} + +#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) .rocm-components-libs, +#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) .rocm-components-libs td, +#rocm-rn-components:has(tbody.rocm-components-libs th[rowspan]:first-of-type:hover) tbody.rocm-components-libs th { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools, +#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools td, +#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) tbody.rocm-components-tools th { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers, +#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes, +#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes td { + background-color: var(--pst-color-table-row-hover-bg); +} + +#rocm-rn-components:has(tbody.rocm-components-tools th[rowspan]:first-of-type:hover) .rocm-components-tools td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-compilers th[rowspan]:first-of-type:hover) .rocm-components-compilers td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-runtimes th[rowspan]:first-of-type:hover) .rocm-components-runtimes td { + background-color: var(--pst-color-table-row-hover-bg); +} + +#rocm-rn-components:has(tbody.rocm-components-ml th[rowspan]:nth-of-type(2):hover) .rocm-components-ml td, +#rocm-rn-components:has(tbody.rocm-components-ml th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-communication th[rowspan]:nth-of-type(2):hover) .rocm-components-communication td, +#rocm-rn-components:has(tbody.rocm-components-communication th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-math th[rowspan]:nth-of-type(2):hover) .rocm-components-math td, +#rocm-rn-components:has(tbody.rocm-components-math th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-primitives th[rowspan]:nth-of-type(2):hover) .rocm-components-primitives td, +#rocm-rn-components:has(tbody.rocm-components-primitives th[rowspan]:nth-of-type(2):hover) .rocm-components-libs th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-dev th[rowspan]:nth-of-type(2):hover) .rocm-components-dev td, +#rocm-rn-components:has(tbody.rocm-components-dev th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-perf th[rowspan]:nth-of-type(2):hover) .rocm-components-perf td, +#rocm-rn-components:has(tbody.rocm-components-perf th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-system th[rowspan]:nth-of-type(2):hover) .rocm-components-system td, +#rocm-rn-components:has(tbody.rocm-components-system th[rowspan]:nth-of-type(2):hover) .rocm-components-tools th:first-of-type { + background-color: var(--pst-color-table-row-hover-bg); +} + +#rocm-rn-components:has(tbody.rocm-components-ml td:hover) .rocm-components-ml th, +#rocm-rn-components:has(tbody.rocm-components-ml td:hover) .rocm-components-libs th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-ml td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-communication td:hover) .rocm-components-communication th, +#rocm-rn-components:has(tbody.rocm-components-communication td:hover) .rocm-components-libs th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-communication td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-math td:hover) .rocm-components-math th, +#rocm-rn-components:has(tbody.rocm-components-math td:hover) .rocm-components-libs th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-math td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) .rocm-components-primitives th, +#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) .rocm-components-libs th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-primitives td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-dev td:hover) .rocm-components-dev th, +#rocm-rn-components:has(tbody.rocm-components-dev td:hover) .rocm-components-tools th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-dev td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-perf td:hover) .rocm-components-perf th, +#rocm-rn-components:has(tbody.rocm-components-perf td:hover) .rocm-components-tools th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-perf td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-system td:hover) .rocm-components-system th, +#rocm-rn-components:has(tbody.rocm-components-system td:hover) .rocm-components-tools th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-system td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-compilers td:hover) .rocm-components-compilers th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-compilers td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} +#rocm-rn-components:has(tbody.rocm-components-runtimes td:hover) .rocm-components-runtimes th:first-of-type, +#rocm-rn-components:has(tbody.rocm-components-runtimes td:hover) tr:hover > td { + background-color: var(--pst-color-table-row-hover-bg); +} diff --git a/temp.md b/temp.md new file mode 100644 index 000000000..8234d4a67 --- /dev/null +++ b/temp.md @@ -0,0 +1,58 @@ +## Components + +The following table lists ROCm components and their individual versions for ROCm 6.2.0. Find an overview of officially +supported versions of ROCm components, third-party libraries, and frameworks in the +[Compatibility matrix](https://rocm.docs.amd.com/en/latest/release/docs/6.2.0/compatibility/compatibility-matrix). + +| Category | Group | Name | Version | | +|----------|-------|------|---------|:-:| +| **Libraries** | **Machine learning and computer vision** | [Composable Kernel](https://rocm.docs.amd.com/projects/composable_kernel/en/docs/6.2.0) | 1.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.2.0) | +| | | [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs/6.2.0) | 2.9 ⇒ [2.10](migraphx-2-10-0) | [{fab}`github fa-lg`](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.2.0) | +| | | [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](miopen-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.2.0) | +| | | [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/docs/6.2.0) | 2.5.0 ⇒ [3.0.0](mivisionx-3-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.2.0) | +| | | [rocAL](https://rocm.docs.amd.com/projects/rocAL/en/docs/6.2.0) | 2.0.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocAL/releases/tag/rocm-6.2.0) | +| | | [rocDecode](https://rocm.docs.amd.com/projects/rocDecode/en/docs/6.2.0) | 0.6.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.2.0) | +| | | [rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/docs/6.2.0) | 0.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocPyDecode/releases/tag/rocm-6.2.0) | +| | | [RPP](https://rocm.docs.amd.com/projects/rpp/en/docs/6.2.0) | 1.5.0 ⇒ [1.8.0](rpp-1-8-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rpp/releases/tag/rocm-6.2.0) | +| | **Communication** | [rccl](https://rocm.docs.amd.com/projects/rccl/en/docs/6.2.0) | 2.18.6 ⇒ [2.20.5](rccl-2-20-5) | [{fab}`github fa-lg`](https://github.com/ROCm/rccl/releases/tag/rocm-6.2.0) | +| | **Math** | [hipBLAS](https://rocm.docs.amd.com/projects/hipBLAS/en/docs/6.2.0) | 2.1.0 ⇒ [2.2.0](hipblas-2-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.2.0) | +| | | [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs/6.2.0) | 0.7.0 ⇒ [0.8.0](hipblaslt-0-8-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.2.0) | +| | | [hipFFT](https://rocm.docs.amd.com/projects/hipFFT/en/docs/6.2.0) | [1.0.14](hipfft-1-0-14) | [{fab}`github fa-lg`](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.2.0) | +| | | [hipfort](https://rocm.docs.amd.com/projects/hipfort/en/docs/6.2.0) | 0.4-0 | [{fab}`github fa-lg`](https://github.com/ROCm/hipfort/releases/tag/rocm-6.2.0) | +| | | [hipRAND](https://rocm.docs.amd.com/projects/hipRAND/en/docs/6.2.0) | 2.10.17 ⇒ [2.11.0](hiprand-2-11-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.2.0) | +| | | [hipSOLVER](https://rocm.docs.amd.com/projects/hipSOLVER/en/docs/6.2.0) | 2.1.1 ⇒ [2.2.0](hipsolver-2-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.2.0) | +| | | [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/docs/6.2.0) | 3.0.1 ⇒ [3.1.1](hipsparse-3-1-1) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.2.0) | +| | | [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/docs/6.2.0) | 0.2.0 ⇒ [0.2.1](hipsparselt-0-2-1) | [{fab}`github fa-lg`](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.2.0) | +| | | [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/docs/6.2.0) | 3.1.1 ⇒ [3.2.0](rocalution-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.2.0) | +| | | [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/docs/6.2.0) | 4.1.0 ⇒ [4.2.0](rocblas-4-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.2.0) | +| | | [rocFFT](https://rocm.docs.amd.com/projects/rocFFT/en/docs/6.2.0) | 1.0.27 ⇒ [1.0.28](rocfft-1-0-28) | [{fab}`github fa-lg`](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.2.0) | +| | | [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/docs/6.2.0) | 3.0.0 ⇒ [3.1.0](rocrand-3-1-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.2.0) | +| | | [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/docs/6.2.0) | 3.25.0 ⇒ [3.26.0](rocsolver-3-26-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.2.0) | +| | | [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/docs/6.2.0) | 3.1.1 ⇒ [3.2.0](rocsparse-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.2.0) | +| | | [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/docs/6.2.0) | 1.4.0 ⇒ [1.5.0](rocwmma-1-5-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.2.0) | +| | | [Tensile](https://rocm.docs.amd.com/projects/tensile/en/docs/6.2.0) | 4.40.0 ⇒ [4.41.0](tensile-4-41-0) | [{fab}`github fa-lg`](https://github.com/ROCm/tensile/releases/tag/rocm-6.2.0) | +| | **Primitives** | [hipCUB](https://rocm.docs.amd.com/projects/hipCUB/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](hipcub-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.2.0) | +| | | [hipTensor](https://rocm.docs.amd.com/projects/hipTensor/en/docs/6.2.0) | 1.2.0 ⇒ [1.3.0](hiptensor-1-3-0) | [{fab}`github fa-lg`](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.2.0) | +| | | [rocPRIM](https://rocm.docs.amd.com/projects/rocPRIM/en/docs/6.2.0) | 3.1.0 ⇒ [3.2.0](rocprim-3-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.2.0) | +| | | [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/docs/6.2.0) | 3.0.0 ⇒ [3.1.0](rocthrust-3-1-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.2.0) | +| **Tools** | **Development** | [HIPIFY](https://rocm.docs.amd.com/projects/HIPIFY/docs/6.2.0) | 17.0.0 ⇒ [18.0.0](hipify-18-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/HIPIFY/releases/tag/rocm-6.2.0) | +| | | [ROCdbgapi](https://rocm.docs.amd.com/projects/ROCdbgapi/en/docs/6.2.0) | 0.71.0 ⇒ [0.76.0](rocdbgapi-0-76-0) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCdbgapi/releases/tag/rocm-6.2.0) | +| | | [ROCm CMake](https://rocm.docs.amd.com/projects/rocm-cmake/en/docs/6.2.0) | 0.12.0 ⇒ [0.13.0](rocm-cmake-0-13-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.2.0) | +| | | [ROCm Debugger (ROCgdb)](https://rocm.docs.amd.com/projects/rocm-cmake/en/docs/6.2.0) | 13 ⇒ [15](rocgdb-15) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.2.0) | +| | | [ROCr Debug Agent](https://rocm.docs.amd.com/projects/rocr_debug_agent/en/docs/6.2.0) | 2.0.3 | [{fab}`github fa-lg`](https://github.com/ROCm/rocr_debug_agent/releases/tag/rocm-6.2.0) | +| | **Performance** | [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/docs/6.2.0) | 2.0.1 | [{fab}`github fa-lg`](https://github.com/ROCm/omniperf/releases/tag/rocm-6.2.0) | +| | | [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/docs/6.2.0) | 1.11.2 | [{fab}`github fa-lg`](https://github.com/ROCm/omnitrace/releases/tag/rocm-6.2.0) | +| | | [ROCm Bandwidth Test](https://rocm.docs.amd.com/projects/rocm_bandwidth_test/en/docs/6.2.0) | 1.4.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) | +| | | [ROCProfiler](https://rocm.docs.amd.com/projects/ROCProfiler/en/docs/6.2.0) | 2.0.0 ⇒ [2.0.0](rocprofiler-2-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) | +| | | [ROCProfiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs/6.2.0) | 0.4.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) | +| | | [ROCTracer](https://rocm.docs.amd.com/projects/ROCTracer/en/docs/6.2.0) | 4.1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0) | +| | **System** | [AMD SMI](https://rocm.docs.amd.com/projects/amdsmi/en/docs/6.2.0) | 24.5.2 ⇒ [24.6.1](amd-smi-24-6-1) | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | | [rocminfo](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.0.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | | [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 0.3.0 ⇒ [1.0.0](rocm-data-center-tool-1-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | | [ROCm SMI](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 7.2.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | | [ROCm Validation Suite](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | | [TransferBench](https://rocm.docs.amd.com/projects/rdc/en/docs/6.2.0) | 1.5.0 | [{fab}`github fa-lg`](https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0) | +| | **Compilers** | [hipCC](https://rocm.docs.amd.com/projects/hipCC/en/docs/6.2.0) | 1.0.0 ⇒ [1.1.1](hipcc-1-1-1) | [{fab}`github fa-lg`](https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0) | +| | | [llvm-project](https://rocm.docs.amd.com/projects/llvm-project/en/docs/6.2.0) | 17.0.0 ⇒ [18.0.0](llvm-project-18-0-0) | [{fab}`github fa-lg`](https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0) | +| **Runtimes** | | [HIP](https://rocm.docs.amd.com/projects/HIP/en/docs/6.2.0) | 6.1 ⇒ [6.2](hip-6-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/HIP/releases/tag/rocm-6.2.0) | +| | | [ROCr Runtime](https://rocm.docs.amd.com/projects/ROCr-Runtime/en/docs/6.2.0) | 6.1 ⇒ [6.2](hip-6-2-0) | [{fab}`github fa-lg`](https://github.com/ROCm/ROCR-Runtime/releases/tag/rocm-6.2.0) | diff --git a/tools/autotag/templates/extra_components/6.2.0.md b/tools/autotag/templates/extra_components/6.2.0.md new file mode 100644 index 000000000..c5d1467e9 --- /dev/null +++ b/tools/autotag/templates/extra_components/6.2.0.md @@ -0,0 +1,1307 @@ + +## Detailed component changes + +The following sections briefly describe key changes to ROCm components. + +### **AMD SMI** (24.6.2) + +#### Changes + +- Added the following functionality: + - `amd-smi dmon` is now available as an alias to `amd-smi monitor`. + - An optional process table under `amd-smi monitor -q`. + - Handling to detect VMs with passthrough configurations in CLI tool. + - Process Isolation and Clear SRAM functionality to the CLI tool for VMs. + - Added Ring Hang event. +- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`. +- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`. + +#### Removals + +- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported. +- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`. + +#### Optimizations + +- Updated CLI error strings to specify invalid device type queried. +- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest. + - Added `amdsmi.h` and `amdsmi_interface.py`. + - `amdsmi_clk_info_t` struct + - Added `AMDSMI` prefix to multiple structures. +- Updated `dpm_policy` references to `soc_pstate`. +- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file. +- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`. +- Updated CLI voltage curve command output. + +#### Resolved issues + +- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format. +- `amdsmi_get_gpu_board_info()` no longer returns junk character strings. +- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. +- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running + `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any + other subcommand that access the GPUs) a mutex would lock and not return requiring either a + clear of the mutex in `/dev/shm` or rebooting the machine. + +#### Known issues + +- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work. + They will be supported in a future release. + +```{note} +See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md) +on GitHub for more information. +``` + +### **Composable Kernel** (1.1.0) + +#### Changes + +- Added suppport for: + - Permute scale for any dimension (#1198). + - Combined elementwise op (#1217). + - Multi D in grouped convolution backward weight (#1280). + - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280). + - Large batch in grouped convolution forward (#1332). +- Added `CK_TILE` layernorm example (#1339). +- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend. + +#### Optimizations + +- Support universal GEMM in grouped convolution forward (#1320). +- Optimizations for low M and N in grouped convolution backward weight (#1303). +- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel. +- `FP8` GEMM performance optimization and tuning (#1384). +- Added FlashAttention backward pass performance optimization (#1397). + +### **HIP** (6.2.0) + +#### Changes + +- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take + a 64-bit integer as an explicit mask argument. + - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled + unconditionally in a future HIP release. + +- Added new HIP APIs: + - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol. + - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`. + - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph. + - `hipGraphInstantiateWithParams` creates an executable graph from a graph. + +- Added a new flag integrated support in device property + + - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver + detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for + `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device. + +- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `. + +- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`. + +#### Resolved issues + +- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime. +- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`. + +#### Upcoming changes + +- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be unconditionally in a future HIP release. + +### **hipBLAS** (2.2.0) + +#### Changes + +* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. +* Added a new ILP64 API for level 1 `_ex` functions. + +* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts. +* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages + `rocblas-dev` and `rocsolver-dev`. +* Updated Linux AOCL dependency to release 4.2 `gcc` build. +* Updated Windows `vcpkg` dependencies to release 2024.02.14. + +### **hipBLASLt** (0.8.0) + +#### Changes + +* Added extension APIs: + *`hipblasltExtAMaxWithScale`. + * `GemmTuning` extension parameter to set `wgm` by user. +* Added support for: + * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype. + * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only). + * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA. +* Added support for gfx110x. + +#### Optimizations + +* Improved library loading time. + +### **HIPCC** (1.1.1) + +#### Changes + +* Split `hipcc` package into 2 packages for different hardware platforms. + +* Cleaned up references to environment variables. + +* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by + default, instead of their Perl counterparts. + +* Enabled function calls. + +* Added support for generating packages for ROCm stack targeting static libraries. + +#### Resolved issues + +* Implemented numerous bug fixes and quality improvements. + +### **hipCUB** (3.2.0) + +#### Changes + +* Added `DeviceCopy` function for parity with CUB. +* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as + the new optional template argument for `WarpExchange`. + * The potential values for the enum are `WARP_EXCHANGE_SMEM` and + `WARP_EXCHANGE_SHUFFLE`. + * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while + `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations. + * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, + but the `ItemsPerThread` must be a divisor of `WarpSize`. +* Added `tuple.hpp` which defines templates `hipcub::tuple`, + `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`. +* Added new overloaded member functions to `BlockRadixSort` and + `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type + (`key_type`) can be sorted via these overloads, if an appropriate decomposer + is passed. The decomposer has to implement `operator(const key_type&)` which + returns a `hipcub::tuple` of references pointing to members of `key_type`. + +* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of + HIP graphs, with several exceptions: + * `CachingDeviceAllocator` + * `GridBarrierLifetime` + * `DeviceSegmentedRadixSort` + * `DeviceRunLengthEncode` + Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs. + +#### Removals + +* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous` + to the device functions results in a deprecation warning both at runtime and at compile time. + * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time + by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition. + * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition. + +#### Resolved issues + +* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. + It now derives the accumulator type as the result of the binary operator. + +### **hipFFT** (1.0.15) + +#### Resolved issues + +* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers. +* Prevented C++ exceptions leaking from public API functions. +* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs. + + +### **HIPIFY** (18.0.0) + +#### Changes + +- Added support for: + - NVIDIA CUDA 12.4.1 + - cuDNN 9.1.1 + - LLVM 18.1.6 +- Added full hipBLASLt support. + +#### Resolved issues + +- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object; + affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`. + +### **hipRAND** (2.11.0) + +#### Changes + +* Added support for setting generator output ordering in C and C++ API. +* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with + `uHIPRAND_STATUS_NOT_IMPLEMENTED`. +* Added options to create: + * A host generator to the Fortran wrapper. + * A host generator to the Python wrapper. +* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous + versions. The environment variable is now named `HIPRAND_USE_HMM`. +* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. +* Device API documentation is improved in this version. + +#### Removals + +* Removed the option to build hipRAND as a submodule to rocRAND. +* Removed references to, and workarounds for, the deprecated `hcc`. +* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`. + Use `ROCRAND_PATH` instead. + +#### Resolved issues + +* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`. + +### **hipSOLVER** (2.2.0) + +#### Changes + +- Added compatibility-only functions: + - `auxiliary` + - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions` + - `getrf` + - `hipsolverDnXgetrf_bufferSize` + - `hipsolverDnXgetrf` + - `getrs` + - `hipsolverDnXgetrs` +- Added support for building on Ubuntu 24.04 and CBL-Mariner. +- Added `hip::host` to `roc::hipsolver` usage requirements. +- Added functions + - `syevdx`/`heevdx` + - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize` + - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx` + - `sygvdx`/`hegvdx` + - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize` + - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx` + +- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU. +- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`. + +#### Removals + +- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients. + +### **hipSPARSE** (3.1.1) + +#### Changes + +* Added the missing `hipsparseCscGet()` routine. + +* All internal hipSPARSE functions now exist inside a namespace. +* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend. +* Improved the user manual and contribution guidelines. + +#### Resolved issues + +* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior. + +#### Known Issues + +* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API + and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be + ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`. + +### **hipSPARSELt** (0.2.1) + +#### Optimizations + +* Refined test cases. + +### **hipTensor** (1.3.0) + +#### Changes + +* Added support for: + * Tensor permutation of ranks of 2, 3, 4, 5, and 6 + * Tensor contraction of M6N6K6: M, N, K up to rank 6 +* Added tests for: + * Tensor permutation of ranks of 2, 3, 4, 5, and 6 + * Tensor contraction of M6N6K6: M, N, K up to rank 6 + * YAML parsing to support sequential parameters ordering. +* Prefer `amd-llvm-devel` package before system LLVM library. +* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`. +* Updated actor-critic selection for new contraction kernel additions. +* Updated installation, programmer's guide, and API reference documentation. + +#### Resolved issues + +* Fixed LLVM parsing crash. +* Fixed memory consumption issue in complex kernels. +* Workaround implemented for compiler crash during debug build. +* Allow random modes ordering for tensor contractions. + +### **llvm-project** (18.0.0) + +#### Changes + +* LLVM IR + + * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address + spaces. + + * Added `llvm.exp10` intrinsic. + +* LLVM infrastruture + + * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`. + +* TableGen + + * Added constructs for debugging TableGen files: + + * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793). + + * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716). + +* AArch64 backend + + * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs. + +* AMDGPU backend + + * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access. + + * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics. + + * Implemented `llvm.get.rounding`. + +* ARM backend + + * Added support for Cortex-M52 CPUs. + + * Added execute-only support for Armv6-M. + +* RISC-V backend + + * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental. + + * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental. + + * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See + [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification. + + * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64. + + * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`, + `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However, + the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions, + `-menable-experimental-extensions` needs to be passed to Clang. + + * `-mcpu=sifive-p450` was added. + + * CodeGen of `RV32E` and `RV64E` is supported experimentally. + + * CodeGen of `ilp32e` and `lp64e` is supported experimentally. + +* X86 backend + + * Added support for the RDMSRLIST and WRMSRLIST instructions. + + * Added support for the WRMSRNS instruction. + + * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction. + + * Support ISA of CMPCCXADD. + + * Support ISA of AVX-IFMA. + + * Support ISA of AVX-VNNI-INT8. + + * Support ISA of AVX-NE-CONVERT. + + * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported. + + * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported. + + * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now + respects this builtin and does not convert CMOVs to branches. + + * Add support for the PBNDKB instruction. + + * Support ISA of SHA512. + + * Support ISA of SM3. + + * Support ISA of SM4. + + * Support ISA of AVX-VNNI-INT16. + + * `-mcpu=graniterapids-d` is now supported. + + * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust + which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type + matched and called into `libgcc` helper functions. + + * Support ISA of USER_MSR. + + * Support ISA of AVX10.1-256 and AVX10.1-512. + + * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported. + + * `-mapxf` is supported. + + * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow + 32-bit relocations or require 64-bit relocations to the global variable. + + * The medium code model’s code generation was audited to be more similar to the small code model where possible. + +* C API + + * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions. + + * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an + alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target + ABI) and provides default values for unspecified settings. + + * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and + `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions. + + * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions + that use operand bundles: + + * `LLVMBuildCallWithOperandBundles` + + * `LLVMBuildInvokeWithOperandBundles` + + * `LLVMCreateOperandBundle` + + * `LLVMDisposeOperandBundle` + + * `LLVMGetNumOperandBundles` + + * `LLVMGetOperandBundleAtIndex` + + * `LLVMGetNumOperandBundleArgs` + + * `LLVMGetOperandBundleArgAtIndex` + + * `LLVMGetOperandBundleTag` + + * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an + instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag. + +* CodeGen infrastructure + + * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e. + `-debug-only=isel-dump`). This new debug type can be filtered by function names using + `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the + existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single + function regardless of `-filter-print-funcs`’s values. + +* Metadata info + + * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`. + +* LLVM tools + + * `llvm-symbolizer` now treats invalid input as an address for which source information is not found. + + * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing + symbols. + + * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each + object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now + `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each + object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers + accordingly. + + * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`. + + * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers. + + * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names. + + * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only. + +* LLDB + + * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type + when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`. + + * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying + functionality remains unchanged. + + * The `mte_ctrl` register can now be read from AArch64 Linux core files. + + * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 + (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation. + + * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism + of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users + can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`, + override or adding to the environment variable. + + * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers: + `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`. + +* Sanitizers + + * HWASan now defaults to detecting use-after-scope bugs. + +#### Removals + +* LLVM IR + + * The constant expression variants of the following instructions have been removed: + + * `and` + + * `or` + + * `lshr` + + * `ashr` + + * `zext` + + * `sext` + + * `fptrunc` + + * `fpext` + + * `fptoui` + + * `fptosi` + + * `uitofp` + + * `sitofp` + +* RISC-V backend + + * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and + instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with + SiFive to define and document real extension names for individual CSRs and instructions. + +* Python bindings + + * The Python bindings have been removed. + +* C API + + * The following functions for creating constant expressions have been removed, because the underlying constant + expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which + will constant fold the operands if possible and create an instruction otherwise: + + * `LLVMConstAnd` + + * `LLVMConstOr` + + * `LLVMConstLShr` + + * `LLVMConstAShr` + + * `LLVMConstZExt` + + * `LLVMConstSExt` + + * `LLVMConstZExtOrBitCast` + + * `LLVMConstSExtOrBitCast` + + * `LLVMConstIntCast` + + * `LLVMConstFPTrunc` + + * `LLVMConstFPExt` + + * `LLVMConstFPToUI` + + * `LLVMConstFPToSI` + + * `LLVMConstUIToFP` + + * `LLVMConstSIToFP` + + * `LLVMConstFPCast` + +* CodeGen infrastructure + + * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets + should use backwards frame index elimination instead. + + * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging + instead, which is preferred because it does not depend on accurate kill flags. + +* LLDB + + * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable. + + * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value + indicating failure. + +#### Resolved issues + +* AArch64 backend + + * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto + options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for + this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the + exception of crypto. + +* Windows target + + * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path + names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug + used to cause distinct files to be considered equivalent on some file systems. This change fixed the issues + [https://github.com/llvm/llvm-project/issues/61401]() and [https://github.com/llvm/llvm-project/issues/22079](). + +#### Known issues + +The compiler may incorrectly compile a program that uses the +``__shfl(var, srcLane, width)`` function when one of the parameters to +the function is undefined along some path to the function. For most functions, +uninitialized inputs cause undefined behavior. + +```{note} +The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path. +``` + +As a workaround, initialize the parameters to ``__shfl``. For example: + +```{code-block} cpp +unsigned long istring = 0 // Initialize the input to __shfl +return __shfl(istring, 0, 64) +``` + +### **MIGraphX** (2.10.0) + +#### Changes + +- Added support for ONNX Runtime MIGraphX EP on Windows. +- Added `FP8` Python API. +- Added examples for SD 2.1 and SDXL. +- Added support for BERT to Dynamic Batch. +- Added a `--test` flag in `migraphx-driver` to validate the installation. +- Added support for ONNX Operator: Einsum. +- Added `uint8` support in ONNX Operators. +- Added Split-k kernel configurations for performance improvements. +- Added fusion for group convolutions. +- Added rocMLIR conv3d support. +- Added rocgdb to the Dockerfile. +- Changed default location of libraries with release specific ABI changes. +- Reorganized documentation in GitHub. + +#### Removals + +- Removed the `--model` flag with `migraphx-driver`. + +#### Optimizations + +- Improved ONNX Model Zoo coverage. +- Reorganized `memcpys` with ONNX Runtime to improve performance. +- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast. +- Improved MLIR kernel selection for multibroadcasted GEMMs. +- Improved details of the perf report. +- Enable mlir by default for GEMMs with small K. +- Allow specifying dot or convolution fusion for mlir with environmental flag. +- Improve performance on small reductions by doing multiple reduction per wavefront. +- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants. +- Use MLIR attention kernels in more cases. +- Enables MIOpen and CK fusions for MI300 gfx arches. +- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs. +- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases. +- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout. +- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases. +- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained. +- Simplifies "slice + concat" pattern in SDXL UNet. +- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros. +- Improved inference performance by fusing Reduce to Broadcast. +- Added additional information when printing the perf report. +- Improve scalar fusions when not all strides are 0. +- Added support for multi outputs in pointwise ops. +- Improve reduction fusion with reshape operators. +- Use the quantized output when an operator is used again. +- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware. + +#### Resolved issues + +- Fixed Super Resolution model verification failed with `FP16`. +- Fixed confusing messages by suppressing them when compiling the model. +- Fixed an issue causing the mod operator with `int8` and `int32` inputs. +- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled. +- Fixed a bug when running `migraphx-driver` with the `--run 1` option. +- Fixed Layernorm accuracy: calculations in `FP32`. +- Fixed update Docker generator script to ROCm 6.1 to point at Jammy. +- Fixed a floating point exception for `dim (-1)` in the reshape operator. +- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input. +- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix. +- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns. +- Fixed an issues during compilation caused by the incorrect constructor being used at compile time. + Removed list initializer of `prefix_scan_sum` which was causing issues during compilation. +- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation. + +### **MIOpen** (3.2.0) + +#### Changes + +- Added: + - [Conv] bilinear (alpha beta) solvers. + - [Conv] enable bf16 for ck-based solvers. + - [Conv] Add split_k tuning to 2d wrw ck-based solver. + - [MHA] graph API fp8 fwd. + - [RNN] multi-stream as default solution. +- Added TunaNetv2.0 for MI300. +- Added Adam and AMP Adam optimizer. + +#### Resolved issues + +- Memory access fault caused by `GemmBwdRest`. +- Context configuration in `GetWorkSpaceSize`. +- Fixes to support huge tensors. + +#### Optimizations + +- Find: improved precision of benchmarking. + +### **MIVisionX** (3.0.0) + +#### Changes + +- Added support for: + - Advanced GPUs + - PreEmphasis Filter augmentation in openVX extensions + - Spectrogram augmentation in openVX extensions + - Downmix and ToDecibels augmentations in openVX extensions + - Resample augmentation and Operator overloading nodes in openVX extensions + - NonSilentRegion and Slice augmentations in openVX extensions + - Mel-Filter bank and Normalize augmentations in openVX extensions + +#### Removals + +- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL). + +#### Resolved issues + +- Fixed issues with dependencies. + +#### Known issues + +- MIVisionX package install requires manual prerequisites installation. + +### **Omniperf** (2.0.1) + +#### Known issues + +- Error when running Omniperf with an application with command line arguments. As a workaround, create an + intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This + issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347). + +- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: + "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the + environment variable `export ROCPROF=rocprofv2`. + +- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message: + + "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.* + + *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.* + + *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.* + + *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*" + + As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`. + +### **OpenMP** (17.0.0) + +#### Changes + +- Added basic experimental support for ``libc`` functions on the GPU via the + LLVM C Library for GPUs. +- Added minimal support for calling host functions from the device using the + `libc` interface. +- Added vendor agnostic OMPT callback support for OpenMP-based device offload. + +#### Removals + +- Removed the "old" device plugins along with support for the `remote` and + `ve` plugins. + +#### Resolved issues + +- Fixed the implementation of `omp_get_wtime` for AMDGPU targets. + +### **RCCL** (2.20.5) + +#### Changes + +- Added support for `fp8` and `rccl_bfloat8`. +- Added support for using HIP contiguous memory. +- Added ROC-TX for host-side profiling. +- Added new rome model. +- Added `fp16` and `fp8` cases to unit tests. +- Added a new unit test for main kernel stack size. +- Added the new `-n` option for `topo_expl` to override the number of nodes. +- Improved debug messages of memory allocations. +- Enabled static build. +- Enabled compatibility with: + - NCCL 2.20.5. + - NCCL 2.19.4. +- Performance tuning for some collective operations on MI300. +- Enabled NVTX code in RCCL. +- Replaced `rccl_bfloat16` with hip_bfloat16. +- NPKit updates: + - Removed warm-up iteration removal by default, need to opt in now. + - Doubled the size of buffers to accommodate for more channels. +- Modified rings to be rail-optimized topology friendly. + +#### Resolved issues + +- Fixed a bug when configuring RCCL for only LL128 protocol. +- Fixed scratch memory allocation after API change for MSCCL. + +### **rocAL** (1.0.0) + +#### Changes + +- Added tests and samples. + +#### Removals + +- Removed CuPy from `setup.py`. + + +#### Optimizations + +- Added setup and install updates. + +#### Resolved issues + +- Minor bug fixes. + +### **rocALUTION** (3.2.0) + +#### Changes + +* Added new file I/O based on rocSPARSE I/O format. +* Added `GetConvergenceHistory` for ItILU0 preconditioner. + +#### Removals + +* Deprecated the following: + * `LocalMatrix::ReadFileCSR` + * `LocalMatrix::WriteFileCSR` + * `GlobalMatrix::ReadFileCSR` + * `GlobalMatrix::WriteFileCSR` + +### **rocBLAS** (4.2.0) + +#### Changes + +* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. +* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`. +* Added Benchmark class for common timing code. +* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`. +* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`. + +* Updated Linux AOCL dependency to release 4.2 gcc build. +* Updated Windows vcpkg dependencies to release 2024.02.14. +* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40. + +#### Optimizations + +* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator. + +#### Removals + +* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next + major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage. + +### **ROCdbgapi** (0.75.0) + +#### Removals +- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`. + The old names are still accessible but deprecated. + +### **rocDecode** (0.6.0) + +#### Changes + +- Added full H.264 support and bug fixes. + +### **rocFFT** (1.0.28) + +#### Changes + +* Randomly generated accuracy tests are now disabled by default. They can be enabled using + the `--nrand` option (which defaults to 0). + +#### Optimizations + +* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks + are transformed locally, with global transposes to make remaining dimensions contiguous. + +### **rocm-cmake** (0.13.0) + +#### Changes + +- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds. + - Package names are no longer pulled from `CPACK__PACKAGE_NAME`. + - Runtime packages will no longer be generated for static builds. + +### **ROCm Data Center Tool** (1.0.0) + +#### Changes + +- Added ROCProfiler `dmon` metrics. +- Added new ECC metrics. +- Added ROCm Validation Suite diagnostic command. +- Fully migrated to AMD SMI. + +#### Removals + +- Removed RASLIB dependency and blobs. +- Removed `rocm_smi_lib` dependency due to migration to AMD SMI. + +### **ROCm Debugger (ROCgdb)** (14.2) + +#### Changes + +- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile. +- Added support for generating and opening core files for heterogeneous processes. + +### **ROCm SMI** (7.3.0) + +#### Changes + +- Added Partition ID API (`rsmi_dev_partition_id_get(..)`). + +#### Resolved issues + +- Fixed Partition ID CLI output. + +```{note} +See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md) +on GitHub for more information. +``` + +### **ROCm Validation Suite** (1.0.0) + +#### Changes + +* Added stress tests: + + * IET (power) stress test for MI300A. + + * IET (power transition) test for MI300X. + +* Added support: + + * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output. + + * Trignometric float and random integer matrix data initialization support. + +* Updated GST performance benchmark test for better numbers. + +### **rocPRIM** (3.2.0) + +#### Changes + +* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp. +* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter. + * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan). + This is the same behaviour as before this change. +* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element. +* Added new public APIs for deriving resulting type on device-only functions: + * `rocprim::invoke_result` + * `rocprim::invoke_result_t` + * `rocprim::invoke_result_binary_op` + * `rocprim::invoke_result_binary_op_t` +* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy. +* Added more test cases, to better cover supported data types. +* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. + To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple` + of references to arithmetic types from the key. +* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. + It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`. + It can be instantiated with: + * `rocprim::make_predicate_iterator` + * `rocprim::make_mask_iterator` +* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32. +* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed. +* Updated some tests to work with supported data types. + +#### Optimizations + +* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`. +* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread. + +#### Resolved issues + +* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is + larger than the logical warp size. The test suite has been updated with such cases. +* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`. +* Fixed incorrect output type used in `device_adjacent_difference`. +* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types. +* Fixed an issue with `device_adjacent_difference`. It now considers both the + input and the output type for selecting the appropriate kernel launch config. + Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage. +* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`. +* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64. + +#### Removals + +* Deprecated the internal header `detail/match_result_type.hpp`. +* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`. +* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name. + +### **ROCProfiler** (2.0.0) + +#### Removals + +- Removed `pcsampler` sample code due to deprecation from version 2. + +### **rocRAND** (3.1.0) + +#### Changes + +* Added `rocrand_create_generator_host`. + * The following generators are supported: + * `ROCRAND_RNG_PSEUDO_MRG31K3P` + * `ROCRAND_RNG_PSEUDO_MRG32K3A` + * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10` + * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20` + * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20` + * `ROCRAND_RNG_PSEUDO_XORWOW` + * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32` + * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64` + * `ROCRAND_RNG_QUASI_SOBOL32` + * `ROCRAND_RNG_QUASI_SOBOL64` + * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives). + * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread. +* Added the option to create a host generator to the Python wrapper. +* Added the option to create a host generator to the Fortran wrapper +* Added dynamic ordering. This ordering is free to rearrange the produced numbers, + which can be specific to devices and distributions. It is implemented for: + * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry +* Added support for using Clang as the host compiler for alternative platform compilation. +* C++ wrapper: + * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`. + * Added optional order parameter to the constructor of `mt19937_engine`. +* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator: + * `rocrand_normal2` + * `rocrand_normal_double2` + * `rocrand_log_normal2` + * `rocrand_log_normal_double2` +* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics. +* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`. +* Added offset and skipahead functionality to LFSR113 generator. +* Added dynamic ordering for architecture `gfx1102`. + +* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few + things to be aware of: + - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph. + - After the generator is created, you may call API functions to set its seed, offset, and order. + - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph. + - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredicable behaviour. + - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs. + - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp` +* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features. +* Building rocRAND should be faster on machines with multiple CPU cores as the library has been + split to multiple compilation units. +* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`. +* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE` +* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. + +#### Removals + +* Deprecated the following typedefs. Please use the unified `state_type` alias instead. + * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state` + * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state` + * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state` + * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state` +* Deprecated the following internal headers: + * `src/rng/distribution/distributions.hpp`. + * `src/rng/device_engines.hpp`. +* Removed references to and workarounds for deprecated hcc. +* Removed support for HIP-CPU. + +#### Known Issues + +- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release. + +### **ROCr Runtime** (1.14.0) + +#### Changes + +- Added PC sampling feature (experimental feature). + +### **rocSOLVER** (3.26.0) + +#### Changes + +- Added 64-bit APIs for existing functions: + - GETF2_64 (with `batched` and `strided_batched` versions) + - GETRF_64 (with `batched` and `strided_batched` versions) + - GETRS_64 (with `batched` and `strided_batched` versions) +- Added gfx900 to default build targets. +- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection: + - SYEVDX (with `batched` and `strided_batched` versions) + - HEEVDX (with `batched` and `strided_batched` versions) +- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection: + - SYGVDX (with `batched` and `strided_batched` versions) + - HEGVDX (with `batched` and `strided_batched` versions) +- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for + backwards compatibility. +- Functions working with arrays of size n - 1 can now accept null pointers when n = 1. + +#### Optimizations + +- Improved performance of Cholesky factorization. +- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U. + +#### Resolved issues + +- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues. + +### **rocSPARSE** (3.2.0) + +#### Changes + +* Added a new Merge-Path algorithm to SpMM, supporting CSR format. +* Added support for row order to SpSM. +* Added rocsparseio I/O functionality to the library. +* Added `rocsparse_set_identity_permutation`. + +* Adjusted rocSPARSE dependencies to related HIP packages. +* Binary size has been reduced. +* A namespace has been wrapped around internal rocSPARSE functions and kernels. +* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0. +* gfx803 target has been removed from address sanitizer builds. + +#### Optimizations + +* SpMV adaptive and LRB algorithms have been further optimized on CSR format +* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format +* Improved documentation and contribution guidelines. + +#### Resolved issues + +* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`. + +### **rocThrust** (3.1.0) + +#### Changes + +* Added changes from upstream CCCL/thrust 2.2.0. + * Updated the contents of `system/hip` and `test` with the upstream changes. +* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`. +* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call. +* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`. +* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`. + +#### Known Issues + +* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device. +* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs. + +### **rocWMMA** (1.5.0) + +#### Changes + +* Added internal utilities for: + * Element-wise vector transforms. + * Cross-lane vector transforms. +* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16. +* Added tests for new internal transforms. + +* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`. +* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments. +* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary. +* Refactored entry-point of std library usage to improve hipRTC support. +* Updated installation, programmer's guide and API reference documentation. + +#### Resolved issues + +* Fixed the ordering of some header includes to improve portability. + +### **RPP** (1.8.0) + +#### Changes + +- Prerequisites - ROCm install requires only `--usecase=rocm`. +- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory. +- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP. +- RPP Tensor Audio support on HOST for Spectrogram. +- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version. +- RPP Tensor Audio support on HOST for Mel Filter Bank. +- RPP Tensor Normalize ND support on HOST and `HIP`. + +### **Tensile** (4.41.0) + +#### Changes + +- New tuning script to summarize rocBLAS log file +- New environment variable to test fixed grid size with Stream-K kernels +- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power +- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile +- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation)) +- Support NT flag for global load and store for gfx94x +- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload) +- Add option for rotating buffers timing with cache eviction +- Add predicate for arithmetic intensity +- Add DirectToVgpr + packing for f8/f16 + TLU cases +- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV +- Add test cases for DirectToVgpr + packing +- Add batch support for Stream-K kernels and new test cases +- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic +- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1) +- Support VectorWidthB (new parameter VectorWidthB) +- Support VectorWidth + non SourceSwap +- Add test cases for VectorWidthB, VectorWidth + non SourceSwap +- Add code owners file +- New environment variables to dynamically adjust number of CUs used in Stream-K +- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1)) +- Add xf32 option to rocblas-bench input creator + +- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag) +- Limit build threads based on CPUs/RAM available on system (for tests) +- Update required workspace size for Stream-K, skip kernel initialization when possible +- Use fallback libraries for archs without optimized logic +- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator) +- Remove OCL tests +- Disable HostLibraryTests +- Reduce extended test time by removing extra parameters in the test config files +- Disable InitAccVgprOpt for Stream-K +- Skip sgemm 64bit offset tests for gfx94x +- Skip DTV, DTL, LSU+MFMA tests for gfx908 +- Increase extended test timeout to 720 min +- Update xfail test (1sum tests only failing on gfx90a) +- Update lib logic convertor script +- Test limiting CI threads for only gfx11 +- wGM related kernargs are removed if they are not needed (WGM=-1,0,1) +- Cleanup on unused old code, mostly related to old client +- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it +- Update efficiency script for new architecture and xf32 datatype +- Re-enable negative values for WorkGroupMapping (asm kernel only) +- Disable HW monitor for aquvavanjaram941 +- Pre-apply offsets for strided batch kernels +- Update tensile build with 16 threads + +#### Optimizations + +- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py). +- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype) +- Reversed MFMA order in inner loop for odd outer iteration +- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync) +- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU) + +#### Resolved issues + +- Fixed `WorkspaceCheck` implementation when used in rocBLAS. +- Fixed Stream-K partials cache behavior. +- Fixed `MasterSolutionLibrary` indexing for multiple architecture build. +- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size). +- Fixed BufferLoad=False with Stream-K. +- Fixed mismatch issue with `GlobalReadCoalesceGroup`. +- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]). +- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2). +- Fixed inacurate vgpr allocation for ClusterLocalRead. +- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2. +- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0). +- Fixed HostLibraryTests on gfx942 and gfx941. +- Fixed LLVM crash issue. +- Fixed for newer windows vcpkg msgpack and vcpkg version package name. +- Fixed an error with DisableKernelPieces + 32bit ShadowLimit. +- Ignore asm cap check for kernel arg preload for rocm6.0 and older. diff --git a/tools/autotag/templates/highlights/6.2.0.md b/tools/autotag/templates/highlights/6.2.0.md new file mode 100644 index 000000000..cdb355185 --- /dev/null +++ b/tools/autotag/templates/highlights/6.2.0.md @@ -0,0 +1,223 @@ + +The release notes provide a comprehensive summary of changes since the previous ROCm release. + +- [Release highlights](release-highlights) + +- [Operating system and hardware support changes](operating-system-and-hardware-support-changes) + +- [ROCm components versioning](rocm-components) + +- [Detailed component changes](detailed-component-changes) + +- [ROCm known issues](rocm-known-issues) + +- [ROCm upcoming changes](rocm-upcoming-changes) + +The [Compatibility matrix](https://rocm.docs.amd.com/en/latest/release/docs/6.2.0/compatibility/compatibility-matrix) +provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases. + +Release notes for previous ROCm releases are available in earlier versions of the documentation. +See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions). + +## Release highlights + +This section introduces notable new features and improvements in ROCm 6.2. See the +[Detailed component changes](#detailed-component-changes) for individual component changes. + +### New components + +ROCm 6.2.0 introduces the following new components to the ROCm software stack. + +- **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads + running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line + or a GUI dashboard. For more information, see + [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest). + +- **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU. + It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining + which function and line number are executing. For more information, see + [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest). + +- **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries, + enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses + rocDecode APIs written primarily in C/C++ within Python. For more information, see + [rocPyDecode](https://rocm.docs.amd.com/projects/rocpydecode/en/latest). + +- **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software + used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more + efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use. + ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library + to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on. + For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest). + + ```{note} + ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change. + ``` + +### ROCm Offline Installer Creator introduced + +The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU +driver, or a combination of the two on a target system without network access. This new tool customizes +multiple unique configurations for use when installing ROCm on a target. Other notable features include: + +* A lightweight, easy-to-use user interface for configuring the creation of the installer + +* Support for multiple Linux distributions + +* Installer support for different ROCm releases and specific ROCm components + +* Optional driver or driver-only installer creation + +* Optional post-install preferences + +* Lightweight installer packages, which are unique to the preconfigured ROCm setup + +* Resolution and inclusion of dependency packages for offline installation + +For more information, see +[ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/rocm-install-on-linux/en/latest/install/rocm-offline-installer.html). + +### Math libraries default to Clang instead of HIPCC + +The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`. +Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when +building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as +described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/latest/reference/rocmcc.html). +The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX` +environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or +functionality. + +### Framework and library changes + +This section highlights updates to supported deep learning frameworks and notable third-party library optimizations. + +#### Additional PyTorch and TensorFlow support + +ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16. + +See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html) +and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html) +for installation instructions. + +Refer to the +[Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html#deep-learning) +for a comprehensive list of third-party frameworks and libraries suppported by ROCm. + +#### Optimized framework support for OpenXLA + +PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler +ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see +[Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html) +and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/tensorflow-install.html). + +#### PyTorch support for Autocast (automatic mixed precision) + +PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational +workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the +original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see +[Automatic mixed precision](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/train-a-model#automatic-mixed-precision-amp). + +#### Memory savings for bitsandbytes model quantization + +The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP +custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. +ROCm 6.2.0 introduces the following bitsandbytes changes: + +- `Int8` matrix multiplication is enabled, and it includes the following functions: + - `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization. + - `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation. + - `igemmlt` – new function for GEMM computation A*B^T. It uses + [hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations. + - `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization. +- Blockwise quantization – input tensors are quantized for a fixed block size. +- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled. +- 8-bit and 32-bit optimizers are enabled. + +```{note} +These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and +features to run them. +``` + +For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-quantization.html). + +#### Improved vLLM support + +ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding +capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama. +ROCm 6.2.0 adds support for the following vLLM features: + +- MP: + + Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP, + use `--distributed-executor-backed=mp`. The default depends on the commit in flux. + +- FP8 KV cache: + + Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements. + The QUARK quantizer currently only supports Llama. + +- Triton Flash Attention: + + ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this + setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable. + +- PyTorch TunableOp: + + Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later. + +For more information about enabling these features, see +[vLLM inference](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). + +ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing. +These features include: + +- FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used + in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama. + +- Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention + computation in large-scale models. This benefits all workloads in `FP16` configurations. + +To enable these experimental new features, see +[vLLM inference](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). +Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines +all the accessible features, and the `vllm/Dockerfile.rocm` file can be used. + +### Enhanced performance tuning on AMD Instinct accelerators + +ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing. +The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes +detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these +accelerators for optimal performance. For more information, see +[AMD MI300X tuning guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) and +[AMD MI300A system optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html). + +### Removed clang-ocl + +As of version 6.2, ROCm no longer provides the `clang-ocl` package. The project will be archived in the future. +See the [clang-ocl README](https://github.com/ROCm/clang-ocl). + +### ROCm documentation changes + +The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This +improves the usability and readability of the documentation. For more information about the ROCm components, see +[What is ROCm?](https://rocm.docs.amd.com/en/latest/what-is-rocm.html). + +Since the release of ROCm 6.1, the documentation has added some key topics including: + +- [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html) +- [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html) +- [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300a.html) +- [Using ROCm for AI](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/index.html) +- [Using ROCm for HPC](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-hpc/index.html) +- [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/index.html) +- [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/latest/) + +The following topics have been significantly improved, expanded, or both: + +- [HIP programming manual](https://rocm.docs.amd.com/projects/HIP/en/latest/) +- [Compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html) + +```{note} +All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the +[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html). +``` diff --git a/tools/autotag/templates/support/6.2.0.md b/tools/autotag/templates/support/6.2.0.md new file mode 100644 index 000000000..1585ef2b6 --- /dev/null +++ b/tools/autotag/templates/support/6.2.0.md @@ -0,0 +1,27 @@ + +## Operating system and hardware support changes + +ROCm 6.2.0 adds support for the following operating system and kernel versions. + +- Ubuntu 24.04 LTS (kernel: 6.8 [GA]) + +- RHEL 8.10 (kernel: 4.18.0-544) + +- SLES 15 SP6 (kernel: 6.4) + +ROCm 6.2.0 marks the end of support (EoS) for: + +- Ubuntu 22.04.3 + +- RHEL 9.2 + +- RHEL 8.8 + +- SLES 15 SP 4 + +- CentOS 7.9 + +ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]). + +See the [Compatibility matrix](https://rocm-stg.amd.com/en/docs/6.2.0/compatibility/compatibility-matrix.html) for an +overview of supported operating systems and hardware architectures. diff --git a/tools/autotag/templates/upcoming_changes/6.2.0.md b/tools/autotag/templates/upcoming_changes/6.2.0.md new file mode 100644 index 000000000..191951f81 --- /dev/null +++ b/tools/autotag/templates/upcoming_changes/6.2.0.md @@ -0,0 +1,79 @@ + +## ROCm known issues + +ROCm known issues are noted on [{fab}`github` GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known +issues related to individual components, review the [Detailed component changes](detailed-component-changes). + +### Default processor affinity behavior for helper threads + +Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm +helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to +threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if +the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If +`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when +creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads +by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment, +batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing +the setting may be advantageous to performance. + +To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the +`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows: + +```{code} shell +export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 +``` + +To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable +as follows: + +``` shell +export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1 +``` + +Or the default: + +``` shell + +unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG +``` + +If unsure of the default processor affinity settings for your environment, run the following command from the shell: + +``` shell + +bash -c "echo taskset -p \$\$" +``` + +### KFDTest failure on Instinct MI300X with Oracle Linux 8.9 + +The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full +suite to not execute properly. This issue is suspected to be hardware-related. + +### Bandwidth limitation in gang and non-gang modes on Instinct MI300A + +Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang +and non-gang performance are observed to be limited at 45GB/s. + +This issue will be addressed in a future ROCm release. + +### rocm-llvm-alt + +ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for +users interested in additional closed-source CPU optimizations. This feature is not functional in +the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an +LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows +use of the closed-source compiler. It is recommended to compile using the default open-source +compiler, which generates high-quality AMD CPU and AMD GPU code. + +## ROCm upcoming changes + +The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review +the [Detailed component changes](detailed-component-changes). + +### rocm-llvm-alt + +The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the +functionality provided by the closed-source compiler should transition to the open-source compiler. +Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by +the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler +has been removed*".