# ROCm 6.2.0 release notes The release notes provide a comprehensive summary of changes since the previous ROCm release. - [Release highlights](release-highlights) - [Operating system and hardware support changes](operating-system-and-hardware-support-changes) - [ROCm components versioning](rocm-components) - [Detailed component changes](detailed-component-changes) - [ROCm known issues](rocm-known-issues) - [ROCm upcoming changes](rocm-upcoming-changes) The [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases. Release notes for previous ROCm releases are available in earlier versions of the documentation. See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions.html). ## Release highlights This section introduces notable new features and improvements in ROCm 6.2. See the [Detailed component changes](#detailed-component-changes) for individual component changes. ### New components ROCm 6.2.0 introduces the following new components to the ROCm software stack. - **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line or a GUI dashboard. For more information, see [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest). - **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU. It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining which function and line number are executing. For more information, see [Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest). - **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries, enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses rocDecode APIs written primarily in C/C++ within Python. For more information, see [rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/latest). - **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use. ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on. For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest). ```{note} ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change. ``` ### ROCm Offline Installer Creator introduced The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU driver, or a combination of the two on a target system without network access. This new tool customizes multiple unique configurations for use when installing ROCm on a target. Other notable features include: * A lightweight, easy-to-use user interface for configuring the creation of the installer * Support for multiple Linux distributions * Installer support for different ROCm releases and specific ROCm components * Optional driver or driver-only installer creation * Optional post-install preferences * Lightweight installer packages, which are unique to the preconfigured ROCm setup * Resolution and inclusion of dependency packages for offline installation For more information, see [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/install/rocm-offline-installer.html). ### Math libraries default to Clang instead of HIPCC The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`. Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/reference/rocmcc.html). The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX` environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or functionality. ### Framework and library changes This section highlights updates to supported deep learning frameworks and notable third-party library optimizations. #### Additional PyTorch and TensorFlow support ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16. See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html) and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html) for installation instructions. Refer to the [Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/reference/3rd-party-support-matrix.html#deep-learning) for a comprehensive list of third-party frameworks and libraries supported by ROCm. #### Optimized framework support for OpenXLA PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html) and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html). #### PyTorch support for Autocast (automatic mixed precision) PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see [Automatic mixed precision](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/train-a-model.html#automatic-mixed-precision-amp). #### Memory savings for bitsandbytes model quantization The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. ROCm 6.2.0 introduces the following bitsandbytes changes: - `Int8` matrix multiplication is enabled, and it includes the following functions: - `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization. - `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after `matmul` computation. - `igemmlt` – new function for GEMM computation A*B^T. It uses [hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations. - `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization. - Blockwise quantization – input tensors are quantized for a fixed block size. - 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled. - 8-bit and 32-bit optimizers are enabled. ```{note} These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and features to run them. ``` For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/model-quantization.html). #### Improved vLLM support ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama. ROCm 6.2.0 adds support for the following vLLM features: - MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP, use `--distributed-executor-backed=mp`. The default depends on the commit in flux. - FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements. The QUARK quantizer currently only supports Llama. - Triton Flash Attention: ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable. - PyTorch TunableOp: Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later. For more information about enabling these features, see [vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing. These features include: - FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama. - Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention computation in large-scale models. This benefits all workloads in `FP16` configurations. To enable these experimental new features, see [vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference). Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines all the accessible features, and the `vllm/Dockerfile.rocm` file can be used. ### Enhanced performance tuning on AMD Instinct accelerators ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing. The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these accelerators for optimal performance. For more information, see [AMD MI300X tuning guides](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/index.html) and [AMD MI300A system optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html). ### Removed clang-ocl As of version 6.2, ROCm no longer provides the `clang-ocl` package. See the [clang-ocl README](https://github.com/ROCm/clang-ocl). ### ROCm documentation changes The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This improves the usability and readability of the documentation. For more information about the ROCm components, see [What is ROCm?](https://rocm.docs.amd.com/en/docs-6.2.0/what-is-rocm.html). Since the release of ROCm 6.1, the documentation has added some key topics including: - [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/workload.html) - [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html) - [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300a.html) - [Using ROCm for AI](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/index.html) - [Using ROCm for HPC](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-hpc/index.html) - [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/index.html) - [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/) The following topics have been significantly improved, expanded, or both: - [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-6.2.0/) - [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) ```{note} All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the [ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html). ``` ## Operating system and hardware support changes ROCm 6.2.0 adds support for the following operating system and kernel versions. - Ubuntu 24.04 LTS (kernel: 6.8 [GA]) - RHEL 8.10 (kernel: 4.18.0-544) - SLES 15 SP6 (kernel: 6.4) ROCm 6.2.0 marks the end of support (EoS) for: - Ubuntu 22.04.3 - RHEL 9.2 - RHEL 8.8 - SLES 15 SP 4 - CentOS 7.9 ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]). See the [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) for an overview of supported operating systems and hardware architectures. ## ROCm components The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the links in the *Version* column to go to the detailed component changelogs.
Category Group Name Version
Libraries Machine learning and computer vision Composable Kernel 1.1.0 ⇒ 1.1.0
MIGraphX 2.9 ⇒ 2.10
MIOpen 3.1.0 ⇒ 3.2.0
MIVisionX 2.5.0 ⇒ 3.0.0
rocAL 1.0.0 ⇒ 1.0.0
rocDecode 0.6.0 ⇒ 0.6.0
rocPyDecode 0.1.0
RPP 1.5.0 ⇒ 1.8.0
Communication RCCL 2.18.6 ⇒ 2.20.5
Math hipBLAS 2.1.0 ⇒ 2.2.0
hipBLASLt 0.7.0 ⇒ 0.8.0
hipFFT 1.0.14 ⇒ 1.0.15
hipfort 0.4.0
hipRAND 2.10.17 ⇒ 2.11.0
hipSOLVER 2.1.1 ⇒ 2.2.0
hipSPARSE 3.0.1 ⇒ 3.1.1
hipSPARSELt 0.2.0 ⇒ 0.2.1
rocALUTION 3.1.1 ⇒ 3.2.0
rocBLAS 4.1.2 ⇒ 4.2.0
rocFFT 1.0.27 ⇒ 1.0.28
rocRAND 3.0.1 ⇒ 3.1.0
rocSOLVER 3.25.0 ⇒ 3.26.0
rocSPARSE 3.1.2 ⇒ 3.2.0
rocWMMA 1.4.0 ⇒ 1.5.0
Tensile 4.40.0 ⇒ 4.41.0
Primitives hipCUB 3.1.0 ⇒ 3.2.0
hipTensor 1.2.0 ⇒ 1.3.0
rocPRIM 3.1.0 ⇒ 3.2.0
rocThrust 3.0.0 ⇒ 3.1.0
Tools System management AMD SMI 24.5.2 ⇒ 24.6.2
rocminfo 1.0.0
ROCm Data Center Tool 0.3.0 ⇒ 1.0.0
ROCm SMI 7.2.0 ⇒ 7.3.0
ROCm Validation Suite 1.0.0 ⇒ 1.0.0
Performance Omniperf 2.0.1
Omnitrace 1.11.2
ROCm Bandwidth Test 1.4.0
ROCProfiler 2.0.0 ⇒ 2.0.0
ROCprofiler-SDK 0.4.0
ROCTracer 4.1.0
Development HIPIFY 17.0.0 ⇒ 18.0.0
ROCdbgapi 0.71.0 ⇒ 0.76.0
ROCm CMake 0.12.0 ⇒ 0.13.0
ROCm Debugger (ROCgdb) 14.1 ⇒ 14.2
ROCr Debug Agent 2.0.3
Compilers HIPCC 1.0.0 ⇒ 1.1.1
llvm-project 17.0.0 ⇒ 18.0.0
Runtimes HIP 6.1 ⇒ 6.2.0
ROCr Runtime 1.13.0 ⇒ 1.14.0
## Detailed component changes The following sections describe key changes to ROCm components. ### **AMD SMI** (24.6.2) #### Changes - Added the following functionality: - `amd-smi dmon` is now available as an alias to `amd-smi monitor`. - An optional process table under `amd-smi monitor -q`. - Handling to detect VMs with passthrough configurations in CLI tool. - Process Isolation and Clear SRAM functionality to the CLI tool for VMs. - Added Ring Hang event. - Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`. - Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`. #### Removals - Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported. - Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`. #### Optimizations - Updated CLI error strings to specify invalid device type queried. - Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest. - Added `amdsmi.h` and `amdsmi_interface.py`. - `amdsmi_clk_info_t` struct - Added `AMDSMI` prefix to multiple structures. - Updated `dpm_policy` references to `soc_pstate`. - Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file. - Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`. - Updated CLI voltage curve command output. #### Resolved issues - Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format. - `amdsmi_get_gpu_board_info()` no longer returns junk character strings. - Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. - Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in `/dev/shm` or rebooting the machine. #### Known issues - `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work. They will be supported in a future release. See [issue #3500](https://github.com/ROCm/ROCm/issues/3500) on GitHub. ```{note} See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md) on GitHub for more information. ``` ### **Composable Kernel** (1.1.0) #### Changes - Added support for: - Permute scale for any dimension (#1198). - Combined elementwise op (#1217). - Multi D in grouped convolution backward weight (#1280). - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280). - Large batch in grouped convolution forward (#1332). - Added `CK_TILE` layernorm example (#1339). - `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend. #### Optimizations - Support universal GEMM in grouped convolution forward (#1320). - Optimizations for low M and N in grouped convolution backward weight (#1303). - Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel. - `FP8` GEMM performance optimization and tuning (#1384). - Added FlashAttention backward pass performance optimization (#1397). ### **HIP** (6.2.0) #### Changes - Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take a 64-bit integer as an explicit mask argument. - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled unconditionally in a future HIP release. - Added new HIP APIs: - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol. - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`. - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph. - `hipGraphInstantiateWithParams` creates an executable graph from a graph. - Added a new flag `integrated` -- supported in device property. - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device. - Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `. - Added UUID support for environment variable `HIP_VISIBLE_DEVICES`. #### Resolved issues - Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime. - Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`. #### Upcoming changes - The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be enabled unconditionally in a future HIP release. ### **hipBLAS** (2.2.0) #### Changes * Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. * Added a new ILP64 API for level 1 `_ex` functions. * The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts. * Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages `rocblas-dev` and `rocsolver-dev`. * Updated Linux AOCL dependency to release 4.2 `gcc` build. * Updated Windows `vcpkg` dependencies to release 2024.02.14. ### **hipBLASLt** (0.8.0) #### Changes * Added extension APIs: *`hipblasltExtAMaxWithScale`. * `GemmTuning` extension parameter to set `wgm` by user. * Added support for: * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype. * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only). * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA. * Added support for gfx110x. #### Optimizations * Improved library loading time. ### **HIPCC** (1.1.1) #### Changes * Split `hipcc` package into two packages for different hardware platforms. * Cleaned up references to environment variables. * Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by default, instead of their Perl counterparts. * Enabled function calls. * Added support for generating packages for ROCm stack targeting static libraries. #### Resolved issues * Implemented numerous bug fixes and quality improvements. ### **hipCUB** (3.2.0) #### Changes * Added `DeviceCopy` function for parity with CUB. * Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as the new optional template argument for `WarpExchange`. * The potential values for the enum are `WARP_EXCHANGE_SMEM` and `WARP_EXCHANGE_SHUFFLE`. * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations. * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, but the `ItemsPerThread` must be a divisor of `WarpSize`. * Added `tuple.hpp` which defines templates `hipcub::tuple`, `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`. * Added new overloaded member functions to `BlockRadixSort` and `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type (`key_type`) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement `operator(const key_type&)` which returns a `hipcub::tuple` of references pointing to members of `key_type`. * On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of HIP graphs, with several exceptions: * `CachingDeviceAllocator` * `GridBarrierLifetime` * `DeviceSegmentedRadixSort` * `DeviceRunLengthEncode` Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs. #### Removals * Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous` to the device functions results in a deprecation warning both at runtime and at compile time. * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition. * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition. #### Resolved issues * Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. It now derives the accumulator type as the result of the binary operator. ### **hipFFT** (1.0.15) #### Resolved issues * Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers. * Prevented C++ exceptions leaking from public API functions. * Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs. ### **HIPIFY** (18.0.0) #### Changes - Added support for: - NVIDIA CUDA 12.4.1 - cuDNN 9.1.1 - LLVM 18.1.6 - Added full hipBLASLt support. #### Resolved issues - HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object; affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`. ### **hipRAND** (2.11.0) #### Changes * Added support for setting generator output ordering in C and C++ API. * `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with `uHIPRAND_STATUS_NOT_IMPLEMENTED`. * Added options to create: * A host generator to the Fortran wrapper. * A host generator to the Python wrapper. * Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous versions. The environment variable is now named `HIPRAND_USE_HMM`. * Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. * Device API documentation is improved in this version. #### Removals * Removed the option to build hipRAND as a submodule to rocRAND. * Removed references to, and workarounds for, the deprecated `hcc`. * Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`. Use `ROCRAND_PATH` instead. #### Resolved issues * Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`. ### **hipSOLVER** (2.2.0) #### Changes - Added compatibility-only functions: - `auxiliary` - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions` - `getrf` - `hipsolverDnXgetrf_bufferSize` - `hipsolverDnXgetrf` - `getrs` - `hipsolverDnXgetrs` - Added support for building on Ubuntu 24.04 and CBL-Mariner. - Added `hip::host` to `roc::hipsolver` usage requirements. - Added functions - `syevdx`/`heevdx` - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize` - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx` - `sygvdx`/`hegvdx` - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize` - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx` - Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU. - Renamed `hipsolver-compat.h` to `hipsolver-dense.h`. #### Removals - Removed dependency on `cblas` from the hipSOLVER test and benchmark clients. ### **hipSPARSE** (3.1.1) #### Changes * Added the missing `hipsparseCscGet()` routine. * All internal hipSPARSE functions now exist inside a namespace. * Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend. * Improved the user manual and contribution guidelines. #### Resolved issues * Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior. #### Known Issues * In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`. ### **hipSPARSELt** (0.2.1) #### Optimizations * Refined test cases. ### **hipTensor** (1.3.0) #### Changes * Added support for: * Tensor permutation of ranks of 2, 3, 4, 5, and 6 * Tensor contraction of M6N6K6: M, N, K up to rank 6 * Added tests for: * Tensor permutation of ranks of 2, 3, 4, 5, and 6 * Tensor contraction of M6N6K6: M, N, K up to rank 6 * YAML parsing to support sequential parameters ordering. * Prefer `amd-llvm-devel` package before system LLVM library. * Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`. * Updated actor-critic selection for new contraction kernel additions. * Updated installation, programmer's guide, and API reference documentation. #### Resolved issues * Fixed LLVM parsing crash. * Fixed memory consumption issue in complex kernels. * Workaround implemented for compiler crash during debug build. * Allow random modes ordering for tensor contractions. ### **llvm-project** (18.0.0) #### Changes * LLVM IR * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address spaces. * Added `llvm.exp10` intrinsic. * LLVM infrastructure * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`. * TableGen * Added constructs for debugging TableGen files: * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793). * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716). * AArch64 backend * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs. * AMDGPU backend * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access. * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics. * Implemented `llvm.get.rounding`. * ARM backend * Added support for Cortex-M52 CPUs. * Added execute-only support for Armv6-M. * RISC-V backend * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental. * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental. * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification. * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64. * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`, `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However, the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions, `-menable-experimental-extensions` needs to be passed to Clang. * `-mcpu=sifive-p450` was added. * CodeGen of `RV32E` and `RV64E` is supported experimentally. * CodeGen of `ilp32e` and `lp64e` is supported experimentally. * X86 backend * Added support for the RDMSRLIST and WRMSRLIST instructions. * Added support for the WRMSRNS instruction. * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction. * Support ISA of CMPCCXADD. * Support ISA of AVX-IFMA. * Support ISA of AVX-VNNI-INT8. * Support ISA of AVX-NE-CONVERT. * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported. * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported. * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now respects this builtin and does not convert CMOVs to branches. * Add support for the PBNDKB instruction. * Support ISA of SHA512. * Support ISA of SM3. * Support ISA of SM4. * Support ISA of AVX-VNNI-INT16. * `-mcpu=graniterapids-d` is now supported. * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type matched and called into `libgcc` helper functions. * Support ISA of USER_MSR. * Support ISA of AVX10.1-256 and AVX10.1-512. * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported. * `-mapxf` is supported. * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow 32-bit relocations or require 64-bit relocations to the global variable. * The medium code model’s code generation was audited to be more similar to the small code model where possible. * C API * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions. * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target ABI) and provides default values for unspecified settings. * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions. * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions that use operand bundles: * `LLVMBuildCallWithOperandBundles` * `LLVMBuildInvokeWithOperandBundles` * `LLVMCreateOperandBundle` * `LLVMDisposeOperandBundle` * `LLVMGetNumOperandBundles` * `LLVMGetOperandBundleAtIndex` * `LLVMGetNumOperandBundleArgs` * `LLVMGetOperandBundleArgAtIndex` * `LLVMGetOperandBundleTag` * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag. * CodeGen infrastructure * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e. `-debug-only=isel-dump`). This new debug type can be filtered by function names using `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single function regardless of `-filter-print-funcs`’s values. * Metadata info * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`. * LLVM tools * `llvm-symbolizer` now treats invalid input as an address for which source information is not found. * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing symbols. * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers accordingly. * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`. * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers. * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names. * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only. * LLDB * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`. * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying functionality remains unchanged. * The `mte_ctrl` register can now be read from AArch64 Linux core files. * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation. * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`, override or adding to the environment variable. * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers: `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`. * Sanitizers * HWASan now defaults to detecting use-after-scope bugs. #### Removals * LLVM IR * The constant expression variants of the following instructions have been removed: * `and` * `or` * `lshr` * `ashr` * `zext` * `sext` * `fptrunc` * `fpext` * `fptoui` * `fptosi` * `uitofp` * `sitofp` * RISC-V backend * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with SiFive to define and document real extension names for individual CSRs and instructions. * Python bindings * The Python bindings have been removed. * C API * The following functions for creating constant expressions have been removed, because the underlying constant expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which will constant fold the operands if possible and create an instruction otherwise: * `LLVMConstAnd` * `LLVMConstOr` * `LLVMConstLShr` * `LLVMConstAShr` * `LLVMConstZExt` * `LLVMConstSExt` * `LLVMConstZExtOrBitCast` * `LLVMConstSExtOrBitCast` * `LLVMConstIntCast` * `LLVMConstFPTrunc` * `LLVMConstFPExt` * `LLVMConstFPToUI` * `LLVMConstFPToSI` * `LLVMConstUIToFP` * `LLVMConstSIToFP` * `LLVMConstFPCast` * CodeGen infrastructure * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets should use backwards frame index elimination instead. * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging instead, which is preferred because it does not depend on accurate kill flags. * LLDB * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable. * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value indicating failure. #### Resolved issues * AArch64 backend * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the exception of crypto. * Windows target * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug used to cause distinct files to be considered equivalent on some file systems. This change fixed the GitHub issues [#61401](https://github.com/llvm/llvm-project/issues/61401) and [#22079](https://github.com/llvm/llvm-project/issues/22079). #### Known issues The compiler may incorrectly compile a program that uses the ``__shfl(var, srcLane, width)`` function when one of the parameters to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior. ```{note} The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path. ``` As a workaround, initialize the parameters to ``__shfl``. For example: ```{code-block} cpp unsigned long istring = 0 // Initialize the input to __shfl return __shfl(istring, 0, 64) ``` See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub. ### **MIGraphX** (2.10.0) #### Changes - Added support for ONNX Runtime MIGraphX EP on Windows. - Added `FP8` Python API. - Added examples for SD 2.1 and SDXL. - Added support for BERT to Dynamic Batch. - Added a `--test` flag in `migraphx-driver` to validate the installation. - Added support for ONNX Operator: Einsum. - Added `uint8` support in ONNX Operators. - Added Split-k kernel configurations for performance improvements. - Added fusion for group convolutions. - Added rocMLIR conv3d support. - Added rocgdb to the Dockerfile. - Changed default location of libraries with release specific ABI changes. - Reorganized documentation in GitHub. #### Removals - Removed the `--model` flag with `migraphx-driver`. #### Optimizations - Improved ONNX Model Zoo coverage. - Reorganized `memcpys` with ONNX Runtime to improve performance. - Replaced scaler multibroadcast + unsqueeze with just a multibroadcast. - Improved MLIR kernel selection for multibroadcasted GEMMs. - Improved details of the perf report. - Enable mlir by default for GEMMs with small K. - Allow specifying dot or convolution fusion for mlir with environmental flag. - Improve performance on small reductions by doing multiple reduction per wavefront. - Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants. - Use MLIR attention kernels in more cases. - Enables MIOpen and CK fusions for MI300 gfx arches. - Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs. - Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases. - Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout. - Added Fusion for "pointwise + concat" operation which may help in performance in certain cases. - Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained. - Simplifies "slice + concat" pattern in SDXL UNet. - Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros. - Improved inference performance by fusing Reduce to Broadcast. - Added additional information when printing the perf report. - Improve scalar fusions when not all strides are 0. - Added support for multi outputs in pointwise ops. - Improve reduction fusion with reshape operators. - Use the quantized output when an operator is used again. - Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware. #### Resolved issues - Fixed Super Resolution model verification failed with `FP16`. - Fixed confusing messages by suppressing them when compiling the model. - Fixed an issue causing the mod operator with `int8` and `int32` inputs. - Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled. - Fixed a bug when running `migraphx-driver` with the `--run 1` option. - Fixed Layernorm accuracy: calculations in `FP32`. - Fixed update Docker generator script to ROCm 6.1 to point at Jammy. - Fixed a floating point exception for `dim (-1)` in the reshape operator. - Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input. - Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix. - Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns. - Fixed an issues during compilation caused by the incorrect constructor being used at compile time. Removed list initializer of `prefix_scan_sum` which was causing issues during compilation. - Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation. ### **MIOpen** (3.2.0) #### Changes - Added: - [Conv] bilinear (alpha beta) solvers. - [Conv] enable bf16 for ck-based solvers. - [Conv] Add split_k tuning to 2d wrw ck-based solver. - [MHA] graph API fp8 fwd. - [RNN] multi-stream as default solution. - Added TunaNetv2.0 for MI300. - Added Adam and AMP Adam optimizer. #### Resolved issues - Memory access fault caused by `GemmBwdRest`. - Context configuration in `GetWorkSpaceSize`. - Fixes to support huge tensors. #### Optimizations - Find: improved precision of benchmarking. ### **MIVisionX** (3.0.0) #### Changes - Added support for: - Advanced GPUs - PreEmphasis Filter augmentation in openVX extensions - Spectrogram augmentation in openVX extensions - Downmix and ToDecibels augmentations in openVX extensions - Resample augmentation and Operator overloading nodes in openVX extensions - NonSilentRegion and Slice augmentations in openVX extensions - Mel-Filter bank and Normalize augmentations in openVX extensions #### Removals - Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL). #### Resolved issues - Fixed issues with dependencies. #### Known issues - MIVisionX package install requires manual prerequisites installation. ### **Omniperf** (2.0.1) #### Known issues - Error when running Omniperf with an application with command line arguments. As a workaround, create an intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347). - Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the environment variable `export ROCPROF=rocprofv2`. - Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message: "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.* *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.* *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.* *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*" As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`. See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub. ### **OpenMP** (17.0.0) #### Changes - Added basic experimental support for ``libc`` functions on the GPU via the LLVM C Library for GPUs. - Added minimal support for calling host functions from the device using the `libc` interface. - Added vendor agnostic OMPT callback support for OpenMP-based device offload. #### Removals - Removed the "old" device plugins along with support for the `remote` and `ve` plugins. #### Resolved issues - Fixed the implementation of `omp_get_wtime` for AMDGPU targets. ### **RCCL** (2.20.5) #### Changes - Added support for `fp8` and `rccl_bfloat8`. - Added support for using HIP contiguous memory. - Added ROC-TX for host-side profiling. - Added new rome model. - Added `fp16` and `fp8` cases to unit tests. - Added a new unit test for main kernel stack size. - Added the new `-n` option for `topo_expl` to override the number of nodes. - Improved debug messages of memory allocations. - Enabled static build. - Enabled compatibility with: - NCCL 2.20.5. - NCCL 2.19.4. - Performance tuning for some collective operations on MI300. - Enabled NVTX code in RCCL. - Replaced `rccl_bfloat16` with hip_bfloat16. - NPKit updates: - Removed warm-up iteration removal by default, need to opt in now. - Doubled the size of buffers to accommodate for more channels. - Modified rings to be rail-optimized topology friendly. #### Resolved issues - Fixed a bug when configuring RCCL for only LL128 protocol. - Fixed scratch memory allocation after API change for MSCCL. ### **rocAL** (1.0.0) #### Changes - Added tests and samples. #### Removals - Removed CuPy from `setup.py`. #### Optimizations - Added setup and install updates. #### Resolved issues - Minor bug fixes. ### **rocALUTION** (3.2.0) #### Changes * Added new file I/O based on rocSPARSE I/O format. * Added `GetConvergenceHistory` for ItILU0 preconditioner. #### Removals * Deprecated the following: * `LocalMatrix::ReadFileCSR` * `LocalMatrix::WriteFileCSR` * `GlobalMatrix::ReadFileCSR` * `GlobalMatrix::WriteFileCSR` ### **rocBLAS** (4.2.0) #### Changes * Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. * Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`. * Added Benchmark class for common timing code. * Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`. * Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`. * Updated Linux AOCL dependency to release 4.2 gcc build. * Updated Windows vcpkg dependencies to release 2024.02.14. * Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40. #### Optimizations * Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator. #### Removals * Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage. ### **ROCdbgapi** (0.76.0) #### Removals - Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`. The old names are still accessible but deprecated. ### **rocDecode** (0.6.0) #### Changes - Added full H.264 support and bug fixes. ### **rocFFT** (1.0.28) #### Changes * Randomly generated accuracy tests are now disabled by default. They can be enabled using the `--nrand` option (which defaults to 0). #### Optimizations * Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks are transformed locally, with global transposes to make remaining dimensions contiguous. ### **rocm-cmake** (0.13.0) #### Changes - `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds. - Package names are no longer pulled from `CPACK__PACKAGE_NAME`. - Runtime packages will no longer be generated for static builds. ### **ROCm Data Center Tool** (1.0.0) #### Changes - Added ROCProfiler `dmon` metrics. - Added new ECC metrics. - Added ROCm Validation Suite diagnostic command. - Fully migrated to AMD SMI. #### Removals - Removed RASLIB dependency and blobs. - Removed `rocm_smi_lib` dependency due to migration to AMD SMI. ### **ROCm Debugger (ROCgdb)** (14.2) #### Changes - Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile. - Added support for generating and opening core files for heterogeneous processes. ### **ROCm SMI** (7.3.0) #### Changes - Added Partition ID API (`rsmi_dev_partition_id_get(..)`). #### Resolved issues - Fixed Partition ID CLI output. ```{note} See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md) on GitHub for more information. ``` ### **ROCm Validation Suite** (1.0.0) #### Changes * Added stress tests: * IET (power) stress test for MI300A. * IET (power transition) test for MI300X. * Added support: * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output. * Trignometric float and random integer matrix data initialization support. * Updated GST performance benchmark test for better numbers. ### **rocPRIM** (3.2.0) #### Changes * Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp. * The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter. * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan). This is the same behaviour as before this change. * Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element. * Added new public APIs for deriving resulting type on device-only functions: * `rocprim::invoke_result` * `rocprim::invoke_result_t` * `rocprim::invoke_result_binary_op` * `rocprim::invoke_result_binary_op_t` * Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy. * Added more test cases, to better cover supported data types. * Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple` of references to arithmetic types from the key. * Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`. It can be instantiated with: * `rocprim::make_predicate_iterator` * `rocprim::make_mask_iterator` * Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32. * Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed. * Updated some tests to work with supported data types. #### Optimizations * Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`. * Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread. #### Resolved issues * Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is larger than the logical warp size. The test suite has been updated with such cases. * Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`. * Fixed incorrect output type used in `device_adjacent_difference`. * Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types. * Fixed an issue with `device_adjacent_difference`. It now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage. * Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`. * Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64. #### Removals * Deprecated the internal header `detail/match_result_type.hpp`. * Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`. * Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name. ### **ROCProfiler** (2.0.0) #### Removals - Removed `pcsampler` sample code due to deprecation from version 2. ### **rocRAND** (3.1.0) #### Changes * Added `rocrand_create_generator_host`. * The following generators are supported: * `ROCRAND_RNG_PSEUDO_MRG31K3P` * `ROCRAND_RNG_PSEUDO_MRG32K3A` * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10` * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20` * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20` * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20` * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20` * `ROCRAND_RNG_PSEUDO_XORWOW` * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32` * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64` * `ROCRAND_RNG_QUASI_SOBOL32` * `ROCRAND_RNG_QUASI_SOBOL64` * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives). * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread. * Added the option to create a host generator to the Python wrapper. * Added the option to create a host generator to the Fortran wrapper * Added dynamic ordering. This ordering is free to rearrange the produced numbers, which can be specific to devices and distributions. It is implemented for: * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry * Added support for using Clang as the host compiler for alternative platform compilation. * C++ wrapper: * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`. * Added optional order parameter to the constructor of `mt19937_engine`. * Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator: * `rocrand_normal2` * `rocrand_normal_double2` * `rocrand_log_normal2` * `rocrand_log_normal_double2` * Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics. * Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`. * Added offset and skipahead functionality to LFSR113 generator. * Added dynamic ordering for architecture `gfx1102`. * For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few things to be aware of: - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph. - After the generator is created, you may call API functions to set its seed, offset, and order. - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph. - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredictable behaviour. - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs. - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp` * Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features. * Building rocRAND should be faster on machines with multiple CPU cores as the library has been split to multiple compilation units. * C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`. * Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE` * Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. #### Removals * Deprecated the following typedefs. Please use the unified `state_type` alias instead. * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state` * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state` * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state` * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state` * Deprecated the following internal headers: * `src/rng/distribution/distributions.hpp`. * `src/rng/device_engines.hpp`. * Removed references to and workarounds for deprecated hcc. * Removed support for HIP-CPU. #### Known Issues - `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release. ### **ROCr Runtime** (1.14.0) #### Changes - Added PC sampling feature (experimental feature). ### **rocSOLVER** (3.26.0) #### Changes - Added 64-bit APIs for existing functions: - GETF2_64 (with `batched` and `strided_batched` versions) - GETRF_64 (with `batched` and `strided_batched` versions) - GETRS_64 (with `batched` and `strided_batched` versions) - Added gfx900 to default build targets. - Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection: - SYEVDX (with `batched` and `strided_batched` versions) - HEEVDX (with `batched` and `strided_batched` versions) - Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection: - SYGVDX (with `batched` and `strided_batched` versions) - HEGVDX (with `batched` and `strided_batched` versions) - Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for backwards compatibility. - Functions working with arrays of size n - 1 can now accept null pointers when n = 1. #### Optimizations - Improved performance of Cholesky factorization. - Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U. #### Resolved issues - Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues. ### **rocSPARSE** (3.2.0) #### Changes * Added a new Merge-Path algorithm to SpMM, supporting CSR format. * Added support for row order to SpSM. * Added rocsparseio I/O functionality to the library. * Added `rocsparse_set_identity_permutation`. * Adjusted rocSPARSE dependencies to related HIP packages. * Binary size has been reduced. * A namespace has been wrapped around internal rocSPARSE functions and kernels. * `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0. * gfx803 target has been removed from address sanitizer builds. #### Optimizations * SpMV adaptive and LRB algorithms have been further optimized on CSR format * Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format * Improved documentation and contribution guidelines. #### Resolved issues * Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`. ### **rocThrust** (3.1.0) #### Changes * Added changes from upstream CCCL/thrust 2.2.0. * Updated the contents of `system/hip` and `test` with the upstream changes. * Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`. * Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call. * Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`. * Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`. #### Known Issues * `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device. * Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs. ### **rocWMMA** (1.5.0) #### Changes * Added internal utilities for: * Element-wise vector transforms. * Cross-lane vector transforms. * Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16. * Added tests for new internal transforms. * Improved loading layouts by increasing vector width for fragments with `blockDim > 32`. * API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments. * API `applyDataLayout` transform now physically applies aos<->soa transform as necessary. * Refactored entry-point of std library usage to improve hipRTC support. * Updated installation, programmer's guide and API reference documentation. #### Resolved issues * Fixed the ordering of some header includes to improve portability. ### **RPP** (1.8.0) #### Changes - Prerequisites - ROCm install requires only `--usecase=rocm`. - Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory. - Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP. - RPP Tensor Audio support on HOST for Spectrogram. - RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version. - RPP Tensor Audio support on HOST for Mel Filter Bank. - RPP Tensor Normalize ND support on HOST and `HIP`. ### **Tensile** (4.41.0) #### Changes - New tuning script to summarize rocBLAS log file - New environment variable to test fixed grid size with Stream-K kernels - New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power - Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile - Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation)) - Support NT flag for global load and store for gfx94x - New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload) - Add option for rotating buffers timing with cache eviction - Add predicate for arithmetic intensity - Add DirectToVgpr + packing for f8/f16 + TLU cases - Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV - Add test cases for DirectToVgpr + packing - Add batch support for Stream-K kernels and new test cases - New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic - Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1) - Support VectorWidthB (new parameter VectorWidthB) - Support VectorWidth + non SourceSwap - Add test cases for VectorWidthB, VectorWidth + non SourceSwap - Add code owners file - New environment variables to dynamically adjust number of CUs used in Stream-K - Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1)) - Add xf32 option to rocblas-bench input creator - Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag) - Limit build threads based on CPUs/RAM available on system (for tests) - Update required workspace size for Stream-K, skip kernel initialization when possible - Use fallback libraries for archs without optimized logic - Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator) - Remove OCL tests - Disable HostLibraryTests - Reduce extended test time by removing extra parameters in the test config files - Disable InitAccVgprOpt for Stream-K - Skip sgemm 64bit offset tests for gfx94x - Skip DTV, DTL, LSU+MFMA tests for gfx908 - Increase extended test timeout to 720 min - Update xfail test (1sum tests only failing on gfx90a) - Update lib logic convertor script - Test limiting CI threads for only gfx11 - wGM related kernargs are removed if they are not needed (WGM=-1,0,1) - Cleanup on unused old code, mostly related to old client - Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it - Update efficiency script for new architecture and xf32 datatype - Re-enable negative values for WorkGroupMapping (asm kernel only) - Disable HW monitor for aquvavanjaram941 - Pre-apply offsets for strided batch kernels - Update tensile build with 16 threads #### Optimizations - Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py). - Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype) - Reversed MFMA order in inner loop for odd outer iteration - Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync) - Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU) #### Resolved issues - Fixed `WorkspaceCheck` implementation when used in rocBLAS. - Fixed Stream-K partials cache behavior. - Fixed `MasterSolutionLibrary` indexing for multiple architecture build. - Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size). - Fixed BufferLoad=False with Stream-K. - Fixed mismatch issue with `GlobalReadCoalesceGroup`. - Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]). - Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2). - Fixed inaccurate vgpr allocation for ClusterLocalRead. - Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2. - Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0). - Fixed HostLibraryTests on gfx942 and gfx941. - Fixed LLVM crash issue. - Fixed for newer windows vcpkg msgpack and vcpkg version package name. - Fixed an error with DisableKernelPieces + 32bit ShadowLimit. - Ignore asm cap check for kernel arg preload for rocm6.0 and older. ## ROCm known issues ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known issues related to individual components, review the [Detailed component changes](detailed-component-changes). ### Default processor affinity behavior for helper threads Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment, batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing the setting may be advantageous to performance. To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows: ```{code} shell export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 ``` To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows: ``` shell export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1 ``` Or the default: ``` shell unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG ``` If unsure of the default processor affinity settings for your environment, run the following command from the shell: ``` shell bash -c "echo taskset -p \$\$" ``` See [issue #3493](https://github.com/ROCm/ROCm/issues/3493) on GitHub. ### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions. The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems, this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods. See [issue #3494](https://github.com/ROCm/ROCm/issues/3494) on GitHub. ### KFDTest failure on Instinct MI300X with Oracle Linux 8.9 The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full suite to not execute properly. This issue is suspected to be hardware-related. See [issue #3495](https://github.com/ROCm/ROCm/issues/3495) on GitHub. ### Bandwidth limitation in gang and non-gang modes on Instinct MI300A Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang and non-gang performance are observed to be limited at 45GB/s. This issue will be addressed in a future ROCm release. See [issue #3496](https://github.com/ROCm/ROCm/issues/3496) on GitHub. ### rocm-llvm-alt ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for users interested in additional closed-source CPU optimizations. This feature is not functional in the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows use of the closed-source compiler. It is recommended to compile using the default open-source compiler, which generates high-quality AMD CPU and AMD GPU code. See [issue #3492](https://github.com/ROCm/ROCm/issues/3492) on GitHub. ## ROCm upcoming changes The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review the [Detailed component changes](detailed-component-changes). ### rocm-llvm-alt The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the functionality provided by the closed-source compiler should transition to the open-source compiler. Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler has been removed*".