mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-21 03:00:39 -05:00
2026 lines
95 KiB
Markdown
2026 lines
95 KiB
Markdown
# ROCm 6.2.0 release notes
|
||
<!-- Do not edit this file! This file is autogenerated with -->
|
||
<!-- tools/autotag/tag_script.py -->
|
||
|
||
<!-- Disable lints since this is an auto-generated file. -->
|
||
<!-- markdownlint-disable blanks-around-headers -->
|
||
<!-- markdownlint-disable no-duplicate-header -->
|
||
<!-- markdownlint-disable no-blanks-blockquote -->
|
||
<!-- markdownlint-disable ul-indent -->
|
||
<!-- markdownlint-disable no-trailing-spaces -->
|
||
|
||
<!-- spellcheck-disable -->
|
||
|
||
The release notes provide a comprehensive summary of changes since the previous ROCm release.
|
||
|
||
- [Release highlights](release-highlights)
|
||
|
||
- [Operating system and hardware support changes](operating-system-and-hardware-support-changes)
|
||
|
||
- [ROCm components versioning](rocm-components)
|
||
|
||
- [Detailed component changes](detailed-component-changes)
|
||
|
||
- [ROCm known issues](rocm-known-issues)
|
||
|
||
- [ROCm upcoming changes](rocm-upcoming-changes)
|
||
|
||
The [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html)
|
||
provides an overview of operating system, hardware, ecosystem, and ROCm component support across ROCm releases.
|
||
|
||
Release notes for previous ROCm releases are available in earlier versions of the documentation.
|
||
See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest/release/versions.html).
|
||
|
||
## Release highlights
|
||
|
||
This section introduces notable new features and improvements in ROCm 6.2. See the
|
||
[Detailed component changes](#detailed-component-changes) for individual component changes.
|
||
|
||
### New components
|
||
|
||
ROCm 6.2.0 introduces the following new components to the ROCm software stack.
|
||
|
||
- **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads
|
||
running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line
|
||
or a GUI dashboard. For more information, see
|
||
[Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest).
|
||
|
||
- **Omnitrace** -- A multi-purpose analysis tool for profiling and tracing applications running on the CPU or the CPU and GPU.
|
||
It supports dynamic binary instrumentation, call-stack sampling, causal profiling, and other features for determining
|
||
which function and line number are executing. For more information, see
|
||
[Omnitrace](https://rocm.docs.amd.com/projects/omnitrace/en/latest).
|
||
|
||
- **rocPyDecode** -- A tool to access rocDecode APIs in Python. It connects Python and C/C++ libraries,
|
||
enabling function calling and data passing between the two languages. The `rocpydecode.so` library, a wrapper, uses
|
||
rocDecode APIs written primarily in C/C++ within Python. For more information, see
|
||
[rocPyDecode](https://rocm.docs.amd.com/projects/rocPyDecode/en/latest).
|
||
|
||
- **ROCprofiler-SDK** -- ROCprofiler-SDK is a profiling and tracing library for HIP and ROCm applications on AMD ROCm software
|
||
used to identify application performance bottlenecks and optimize their performance. The new APIs add restrictions for more
|
||
efficient implementations and improved thread safety. A new window restriction specifies the services the tool can use.
|
||
ROCprofiler-SDK also provides a tool library to help you write your tool implementations. `rocprofv3` uses this tool library
|
||
to profile and trace applications for performance bottlenecks. Examples include API tracing, kernel tracing, and so on.
|
||
For more information, see [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest).
|
||
|
||
```{note}
|
||
ROCprofiler-SDK for ROCm 6.2.0 is a beta release and subject to change.
|
||
```
|
||
|
||
### ROCm Offline Installer Creator introduced
|
||
|
||
The new ROCm Offline Installer Creator creates an installation package for a preconfigured setup of ROCm, the AMDGPU
|
||
driver, or a combination of the two on a target system without network access. This new tool customizes
|
||
multiple unique configurations for use when installing ROCm on a target. Other notable features include:
|
||
|
||
* A lightweight, easy-to-use user interface for configuring the creation of the installer
|
||
|
||
* Support for multiple Linux distributions
|
||
|
||
* Installer support for different ROCm releases and specific ROCm components
|
||
|
||
* Optional driver or driver-only installer creation
|
||
|
||
* Optional post-install preferences
|
||
|
||
* Lightweight installer packages, which are unique to the preconfigured ROCm setup
|
||
|
||
* Resolution and inclusion of dependency packages for offline installation
|
||
|
||
For more information, see
|
||
[ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/install/rocm-offline-installer.html).
|
||
|
||
### Math libraries default to Clang instead of HIPCC
|
||
|
||
The default compiler used to build the math libraries on Linux changes from `hipcc` to `amdclang++`.
|
||
Appropriate compiler flags are added to ensure these compilations build correctly. This change only applies when
|
||
building the libraries. Applications using the libraries can continue to be compiled using `hipcc` or `amdclang++` as
|
||
described in [ROCm compiler reference](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/reference/rocmcc.html).
|
||
The math libraries can also be built with `hipcc` using any of the previously available methods (for example, the `CXX`
|
||
environment variable, the `CMAKE_CXX_COMPILER` CMake variable, and so on). This change shouldn't affect performance or
|
||
functionality.
|
||
|
||
### Framework and library changes
|
||
|
||
This section highlights updates to supported deep learning frameworks and notable third-party library optimizations.
|
||
|
||
#### Additional PyTorch and TensorFlow support
|
||
|
||
ROCm 6.2.0 supports PyTorch versions 2.2 and 2.3 and TensorFlow version 2.16.
|
||
|
||
See [Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html)
|
||
and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html)
|
||
for installation instructions.
|
||
|
||
Refer to the
|
||
[Third-party support matrix](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/reference/3rd-party-support-matrix.html#deep-learning)
|
||
for a comprehensive list of third-party frameworks and libraries supported by ROCm.
|
||
|
||
#### Optimized framework support for OpenXLA
|
||
|
||
PyTorch for ROCm and TensorFlow for ROCm now provide native support for OpenXLA. OpenXLA is an open-source ML compiler
|
||
ecosystem that enables developers to compile and optimize models from all leading ML frameworks. For more information, see
|
||
[Installing PyTorch for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/pytorch-install.html)
|
||
and [Installing TensorFlow for ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.0/how-to/3rd-party/tensorflow-install.html).
|
||
|
||
#### PyTorch support for Autocast (automatic mixed precision)
|
||
|
||
PyTorch now supports Autocast for recurrent neural networks (RNNs) on ROCm. This can help to reduce computational
|
||
workloads and improve performance. Based on the information about the magnitude of values, Autocast can substitute the
|
||
original `float32` linear layers and convolutions with their `float16` or `bfloat16` variants. For more information, see
|
||
[Automatic mixed precision](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/train-a-model.html#automatic-mixed-precision-amp).
|
||
|
||
#### Memory savings for bitsandbytes model quantization
|
||
|
||
The [ROCm-aware bitsandbytes library](https://github.com/ROCm/bitsandbytes) is a lightweight Python wrapper around HIP
|
||
custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions.
|
||
ROCm 6.2.0 introduces the following bitsandbytes changes:
|
||
|
||
- `Int8` matrix multiplication is enabled, and it includes the following functions:
|
||
- `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
|
||
- `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after `matmul` computation.
|
||
- `igemmlt` – new function for GEMM computation A*B^T. It uses
|
||
[hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
|
||
- `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
|
||
- Blockwise quantization – input tensors are quantized for a fixed block size.
|
||
- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled.
|
||
- 8-bit and 32-bit optimizers are enabled.
|
||
|
||
```{note}
|
||
These functions are included in bitsandbytes. They are not part of ROCm. However, ROCm 6.2.0 has enabled the fixes and
|
||
features to run them.
|
||
```
|
||
|
||
For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/model-quantization.html).
|
||
|
||
#### Improved vLLM support
|
||
|
||
ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding
|
||
capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama.
|
||
ROCm 6.2.0 adds support for the following vLLM features:
|
||
|
||
- MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
|
||
use `--distributed-executor-backed=mp`. The default depends on the commit in flux.
|
||
|
||
- FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
|
||
The QUARK quantizer currently only supports Llama.
|
||
|
||
- Triton Flash Attention:
|
||
|
||
ROCm supports both Triton and Composable Kernel Flash Attention 2 in vLLM. The default is Triton, but you can change this
|
||
setting using the `VLLM_USE_FLASH_ATTN_TRITON=False` environment variable.
|
||
|
||
- PyTorch TunableOp:
|
||
|
||
Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
|
||
|
||
For more information about enabling these features, see
|
||
[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
|
||
|
||
ROCm has a vLLM branch for experimental features. This includes performance improvements, accuracy, and correctness testing.
|
||
These features include:
|
||
|
||
- FP8 GEMMs: To improve the performance of FP8 quantization, work is underway on tuning the GEMM using the shapes used
|
||
in the model's execution. It only supports LLAMA because the QUARK quantizer currently only supports Llama.
|
||
|
||
- Custom decode paged attention: Improves performance by efficiently managing memory and enabling faster attention
|
||
computation in large-scale models. This benefits all workloads in `FP16` configurations.
|
||
|
||
To enable these experimental new features, see
|
||
[vLLM inference](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/llm-inference-frameworks.html#vllm-inference).
|
||
Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines
|
||
all the accessible features, and the `vllm/Dockerfile.rocm` file can be used.
|
||
|
||
### Enhanced performance tuning on AMD Instinct accelerators
|
||
|
||
ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
|
||
The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
|
||
detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
|
||
accelerators for optimal performance. For more information, see
|
||
[AMD MI300X tuning guides](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/index.html) and
|
||
[AMD MI300A system optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html).
|
||
|
||
### Removed clang-ocl
|
||
|
||
As of version 6.2, ROCm no longer provides the `clang-ocl` package.
|
||
See the [clang-ocl README](https://github.com/ROCm/clang-ocl).
|
||
|
||
### ROCm documentation changes
|
||
|
||
The documentation for the ROCm components has been reorganized and reformatted in a standard look and feel. This
|
||
improves the usability and readability of the documentation. For more information about the ROCm components, see
|
||
[What is ROCm?](https://rocm.docs.amd.com/en/docs-6.2.0/what-is-rocm.html).
|
||
|
||
Since the release of ROCm 6.1, the documentation has added some key topics including:
|
||
|
||
- [AMD Instinct MI300X workload tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/workload.html)
|
||
- [AMD Instinct MI300X system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html)
|
||
- [AMD Instinct MI300A system tuning guide](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300a.html)
|
||
- [Using ROCm for AI](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-ai/index.html)
|
||
- [Using ROCm for HPC](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/rocm-for-hpc/index.html)
|
||
- [Fine-tuning LLMs and inference optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/index.html)
|
||
- [LLVM reference documentation](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0/)
|
||
|
||
The following topics have been significantly improved, expanded, or both:
|
||
|
||
- [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-6.2.0/)
|
||
- [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html)
|
||
|
||
```{note}
|
||
All ROCm projects are open source and available on GitHub. To contribute to ROCm documentation, see the
|
||
[ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
|
||
```
|
||
|
||
## Operating system and hardware support changes
|
||
|
||
ROCm 6.2.0 adds support for the following operating system and kernel versions.
|
||
|
||
- Ubuntu 24.04 LTS (kernel: 6.8 [GA])
|
||
|
||
- RHEL 8.10 (kernel: 4.18.0-544)
|
||
|
||
- SLES 15 SP6 (kernel: 6.4)
|
||
|
||
ROCm 6.2.0 marks the end of support (EoS) for:
|
||
|
||
- Ubuntu 22.04.3
|
||
|
||
- RHEL 9.2
|
||
|
||
- RHEL 8.8
|
||
|
||
- SLES 15 SP 4
|
||
|
||
- CentOS 7.9
|
||
|
||
ROCm 6.2.0 has been tested against pre-release Ubuntu 22.04.5 (kernel: 6.5 [HWE]).
|
||
|
||
See the [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html) for an
|
||
overview of supported operating systems and hardware architectures.
|
||
|
||
## ROCm components
|
||
|
||
The following table lists ROCm components and their individual versions for ROCm 6.2.0. Follow the
|
||
links in the *Version* column to go to the detailed component changelogs.
|
||
|
||
<div class="pst-scrollable-table-container">
|
||
<table id="rocm-rn-components" class="table">
|
||
<thead>
|
||
<tr>
|
||
<th>Category</th>
|
||
<th>Group</th>
|
||
<th>Name</th>
|
||
<th>Version</th>
|
||
<th></th>
|
||
</tr>
|
||
</thead>
|
||
<colgroup>
|
||
<col span="1">
|
||
<col span="1">
|
||
</colgroup>
|
||
<tbody class="rocm-components-libs rocm-components-ml">
|
||
<tr>
|
||
<th rowspan="8">Libraries</th>
|
||
<th rowspan="8">Machine learning and computer vision</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/composable_kernel/en/docs-6.2.0">Composable Kernel</a>
|
||
</td>
|
||
<td>1.1.0 ⇒ <a href="#composable-kernel-1-1-0">1.1.0</a></td>
|
||
<td><a href="https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs-6.2.0">MIGraphX</a></td>
|
||
<td>2.9 ⇒ <a href="#migraphx-2-10-0">2.10</a></td>
|
||
<td><a href="https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.2.0"><i class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/MIOpen/en/docs-6.2.0">MIOpen</a></td>
|
||
<td>3.1.0 ⇒ <a href="#miopen-3-2-0">3.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/MIOpen/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/MIVisionX/en/docs-6.2.0">MIVisionX</a></td>
|
||
<td>2.5.0 ⇒ <a href="#mivisionx-3-0-0">3.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocAL/en/docs-6.2.0">rocAL</a></td>
|
||
<td>1.0.0 ⇒ <a href="#rocal-1-0-0">1.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocAL/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocDecode/en/docs-6.2.0">rocDecode</a></td>
|
||
<td>0.6.0 ⇒ <a href="#rocdecode-0-6-0">0.6.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocDecode/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocPyDecode/en/docs-6.2.0">rocPyDecode</a></td>
|
||
<td>0.1.0</td>
|
||
<td><a href="https://github.com/ROCm/rocPyDecode/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rpp/en/docs-6.2.0">RPP</a></td>
|
||
<td>1.5.0 ⇒ <a href="#rpp-1-8-0">1.8.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rpp/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-libs rocm-components-communication">
|
||
<tr>
|
||
<th rowspan="1"></th>
|
||
<th rowspan="1">Communication</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rccl/en/docs-6.2.0">RCCL</a></td>
|
||
<td>2.18.6 ⇒ <a href="#rccl-2-20-5">2.20.5</a></td>
|
||
<td><a href="https://github.com/ROCm/rccl/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-libs rocm-components-math tbody-reverse-zebra">
|
||
<tr>
|
||
<th rowspan="16"></th>
|
||
<th rowspan="16">Math</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipBLAS/en/docs-6.2.0">hipBLAS</a></td>
|
||
<td>2.1.0 ⇒ <a href="#hipblas-2-2-0">2.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0">hipBLASLt</a></td>
|
||
<td>0.7.0 ⇒ <a href="#hipblaslt-0-8-0">0.8.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipFFT/en/docs-6.2.0">hipFFT</a></td>
|
||
<td>1.0.14 ⇒ <a href="#hipfft-1-0-15">1.0.15</a></td>
|
||
<td><a href="https://github.com/ROCm/hipFFT/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipfort/en/docs-6.2.0">hipfort</a></td>
|
||
<td>0.4.0</td>
|
||
<td><a href="https://github.com/ROCm/hipfort/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipRAND/en/docs-6.2.0">hipRAND</a></td>
|
||
<td>2.10.17 ⇒ <a href="#hiprand-2-11-0">2.11.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipRAND/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipSOLVER/en/docs-6.2.0">hipSOLVER</a></td>
|
||
<td>2.1.1 ⇒ <a href="#hipsolver-2-2-0">2.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSE/en/docs-6.2.0">hipSPARSE</a></td>
|
||
<td>3.0.1 ⇒ <a href="#hipsparse-3-1-1">3.1.1</a></td>
|
||
<td><a href="https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSELt/en/docs-6.2.0">hipSPARSELt</a></td>
|
||
<td>0.2.0 ⇒ <a href="#hipsparselt-0-2-1">0.2.1</a></td>
|
||
<td><a href="https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocALUTION/en/docs-6.2.0">rocALUTION</a></td>
|
||
<td>3.1.1 ⇒ <a href="#rocalution-3-2-0">3.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.2.0">rocBLAS</a></td>
|
||
<td>4.1.2 ⇒ <a href="#rocblas-4-2-0">4.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocFFT/en/docs-6.2.0">rocFFT</a></td>
|
||
<td>1.0.27 ⇒ <a href="#rocfft-1-0-28">1.0.28</a></td>
|
||
<td><a href="https://github.com/ROCm/rocFFT/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocRAND/en/docs-6.2.0">rocRAND</a></td>
|
||
<td>3.0.1 ⇒ <a href="#rocrand-3-1-0">3.1.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocRAND/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocSOLVER/en/docs-6.2.0">rocSOLVER</a></td>
|
||
<td>3.25.0 ⇒ <a href="#rocsolver-3-26-0">3.26.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocSPARSE/en/docs-6.2.0">rocSPARSE</a></td>
|
||
<td>3.1.2 ⇒ <a href="#rocsparse-3-2-0">3.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocWMMA/en/docs-6.2.0">rocWMMA</a></td>
|
||
<td>1.4.0 ⇒ <a href="#rocwmma-1-5-0">1.5.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://github.com/ROCm/tensile/">Tensile</a></td>
|
||
<td>4.40.0 ⇒ <a href="#tensile-4-41-0">4.41.0</a></td>
|
||
<td><a href="https://github.com/ROCm/tensile/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-libs rocm-components-primitives tbody-reverse-zebra">
|
||
<tr>
|
||
<th rowspan="4"></th>
|
||
<th rowspan="4">Primitives</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipCUB/en/docs-6.2.0">hipCUB</a></td>
|
||
<td>3.1.0 ⇒ <a href="#hipcub-3-2-0">3.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipCUB/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/hipTensor/en/docs-6.2.0">hipTensor</a></td>
|
||
<td>1.2.0 ⇒ <a href="#hiptensor-1-3-0">1.3.0</a></td>
|
||
<td><a href="https://github.com/ROCm/hipTensor/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocPRIM/en/docs-6.2.0">rocPRIM</a></td>
|
||
<td>3.1.0 ⇒ <a href="#rocprim-3-2-0">3.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocThrust/en/docs-6.2.0">rocThrust</a></td>
|
||
<td>3.0.0 ⇒ <a href="#rocthrust-3-1-0">3.1.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocThrust/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-tools rocm-components-system tbody-reverse-zebra">
|
||
<tr>
|
||
<th rowspan="6">Tools</th>
|
||
<th rowspan="6">System management</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/amdsmi/en/docs-6.2.0">AMD SMI</a></td>
|
||
<td>24.5.2 ⇒ <a href="#amd-smi-24-6-2">24.6.2</a></td>
|
||
<td><a href="https://github.com/ROCm/amdsmi/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocminfo/en/docs-6.2.0">rocminfo</a></td>
|
||
<td>1.0.0</td>
|
||
<td><a href="https://github.com/ROCm/rocminfo/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rdc/en/docs-6.2.0">ROCm Data Center Tool</a></td>
|
||
<td>0.3.0 ⇒ <a href="#rocm-data-center-tool-1-0-0">1.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rdc/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocm_smi_lib/en/docs-6.2.0">ROCm SMI</a></td>
|
||
<td>7.2.0 ⇒ <a href="#rocm-smi-7-3-0">7.3.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocm_smi_lib/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/docs-6.2.0">ROCm Validation Suite</a></td>
|
||
<td>1.0.0 ⇒ <a href="#rocm-validation-suite-1-0-0">1.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/ROCmValidationSuite/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-tools rocm-components-perf">
|
||
<tr>
|
||
<th rowspan="6"></th>
|
||
<th rowspan="6">Performance</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/omniperf/en/docs-6.2.0">Omniperf</a></td>
|
||
<td><a href="#omniperf-2-0-1">2.0.1</a></td>
|
||
<td><a href="https://github.com/ROCm/omniperf/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/omnitrace/en/docs-6.2.0">Omnitrace</a></td>
|
||
<td>1.11.2</td>
|
||
<td><a href="https://github.com/ROCm/omnitrace/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocm_bandwidth_test/en/docs-6.2.0">ROCm Bandwidth
|
||
Test</a></td>
|
||
<td>1.4.0</td>
|
||
<td><a href="https://github.com/ROCm/rocm_bandwidth_test/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler/en/docs-6.2.0/">ROCProfiler</a></td>
|
||
<td>2.0.0 ⇒ <a href="#rocprofiler-2-0-0">2.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/ROCProfiler/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-6.2.0">ROCprofiler-SDK</a></td>
|
||
<td>0.4.0</td>
|
||
<td><a href="https://github.com/ROCm/rocprofiler-sdk/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr >
|
||
<td><a href="https://rocm.docs.amd.com/projects/roctracer/en/docs-6.2.0/">ROCTracer</a></td>
|
||
<td>4.1.0</td>
|
||
<td><a href="https://github.com/ROCm/ROCTracer/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-tools rocm-components-dev">
|
||
<tr>
|
||
<th rowspan="5"></th>
|
||
<th rowspan="5">Development</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/HIPIFY/en/docs-6.2.0/">HIPIFY</a></td>
|
||
<td>17.0.0 ⇒ <a href="#hipify-18-0-0">18.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/HIPIFY/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/ROCdbgapi/en/docs-6.2.0">ROCdbgapi</a></td>
|
||
<td>0.71.0 ⇒ <a href="#rocdbgapi-0-76-0">0.76.0</a></td>
|
||
<td><a href="https://github.com/ROCm/ROCdbgapi/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/ROCmCMakeBuildTools/en/docs-6.2.0/">ROCm CMake</a></td>
|
||
<td>0.12.0 ⇒ <a href="#rocm-cmake-0-13-0">0.13.0</a></td>
|
||
<td><a href="https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/ROCgdb/en/docs-6.2.0">ROCm Debugger (ROCgdb)</a>
|
||
</td>
|
||
<td>14.1 ⇒ <a href="#rocm-debugger-rocgdb-14-2">14.2</a></td>
|
||
<td><a href="https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/rocr_debug_agent/en/docs-6.2.0">ROCr Debug Agent</a>
|
||
</td>
|
||
<td>2.0.3</td>
|
||
<td><a href="https://github.com/ROCm/rocr_debug_agent/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-compilers tbody-reverse-zebra">
|
||
<tr>
|
||
<th rowspan="2" colspan="2">Compilers</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/HIPCC/en/docs-6.2.0">HIPCC</a></td>
|
||
<td>1.0.0 ⇒ <a href="#hipcc-1-1-1">1.1.1</a></td>
|
||
<td><a href="https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.2.0">llvm-project</a></td>
|
||
<td>17.0.0 ⇒ <a href="#llvm-project-18-0-0">18.0.0</a></td>
|
||
<td><a href="https://github.com/ROCm/llvm-project/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
<tbody class="rocm-components-runtimes tbody-reverse-zebra">
|
||
<tr>
|
||
<th rowspan="2" colspan="2">Runtimes</th>
|
||
<td><a href="https://rocm.docs.amd.com/projects/HIP/en/docs-6.2.0">HIP</a></td>
|
||
<td>6.1 ⇒ <a href="#hip-6-2-0">6.2.0</a></td>
|
||
<td><a href="https://github.com/ROCm/HIP/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="https://rocm.docs.amd.com/projects/ROCR-Runtime/en/docs-6.2.0">ROCr Runtime</a></td>
|
||
<td>1.13.0 ⇒ <a href="#rocr-runtime-1-14-0">1.14.0</a></td>
|
||
<td><a href="https://github.com/ROCm/ROCR-Runtime/releases/tag/rocm-6.2.0"><i
|
||
class="fab fa-github fa-lg"></i></a></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</div>
|
||
|
||
## Detailed component changes
|
||
|
||
The following sections describe key changes to ROCm components.
|
||
|
||
### **AMD SMI** (24.6.2)
|
||
|
||
#### Changes
|
||
|
||
- Added the following functionality:
|
||
- `amd-smi dmon` is now available as an alias to `amd-smi monitor`.
|
||
- An optional process table under `amd-smi monitor -q`.
|
||
- Handling to detect VMs with passthrough configurations in CLI tool.
|
||
- Process Isolation and Clear SRAM functionality to the CLI tool for VMs.
|
||
- Added Ring Hang event.
|
||
- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`.
|
||
- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`.
|
||
|
||
#### Removals
|
||
|
||
- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported.
|
||
- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`.
|
||
|
||
#### Optimizations
|
||
|
||
- Updated CLI error strings to specify invalid device type queried.
|
||
- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest.
|
||
- Added `amdsmi.h` and `amdsmi_interface.py`.
|
||
- `amdsmi_clk_info_t` struct
|
||
- Added `AMDSMI` prefix to multiple structures.
|
||
- Updated `dpm_policy` references to `soc_pstate`.
|
||
- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file.
|
||
- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`.
|
||
- Updated CLI voltage curve command output.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format.
|
||
- `amdsmi_get_gpu_board_info()` no longer returns junk character strings.
|
||
- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`.
|
||
- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running
|
||
`amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any
|
||
other subcommand that access the GPUs) a mutex would lock and not return requiring either a
|
||
clear of the mutex in `/dev/shm` or rebooting the machine.
|
||
|
||
#### Known issues
|
||
|
||
- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work.
|
||
They will be supported in a future release.
|
||
|
||
See [issue #3500](https://github.com/ROCm/ROCm/issues/3500) on GitHub.
|
||
|
||
```{note}
|
||
See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md)
|
||
on GitHub for more information.
|
||
```
|
||
|
||
### **Composable Kernel** (1.1.0)
|
||
|
||
#### Changes
|
||
|
||
- Added support for:
|
||
- Permute scale for any dimension (#1198).
|
||
- Combined elementwise op (#1217).
|
||
- Multi D in grouped convolution backward weight (#1280).
|
||
- K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280).
|
||
- Large batch in grouped convolution forward (#1332).
|
||
- Added `CK_TILE` layernorm example (#1339).
|
||
- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend.
|
||
|
||
#### Optimizations
|
||
|
||
- Support universal GEMM in grouped convolution forward (#1320).
|
||
- Optimizations for low M and N in grouped convolution backward weight (#1303).
|
||
- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel.
|
||
- `FP8` GEMM performance optimization and tuning (#1384).
|
||
- Added FlashAttention backward pass performance optimization (#1397).
|
||
|
||
### **HIP** (6.2.0)
|
||
|
||
#### Changes
|
||
|
||
- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take
|
||
a 64-bit integer as an explicit mask argument.
|
||
- In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled
|
||
unconditionally in a future HIP release.
|
||
|
||
- Added new HIP APIs:
|
||
- `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol.
|
||
- `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`.
|
||
- `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph.
|
||
- `hipGraphInstantiateWithParams` creates an executable graph from a graph.
|
||
|
||
- Added a new flag `integrated` -- supported in device property.
|
||
|
||
- The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver
|
||
detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for
|
||
`hipDeviceAttributeIntegrated` as value 1, for integrated GPU device.
|
||
|
||
- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include <hip/hip_fp8.h>`.
|
||
|
||
- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime.
|
||
- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`.
|
||
|
||
#### Upcoming changes
|
||
|
||
- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be enabled unconditionally in a future HIP release.
|
||
|
||
### **hipBLAS** (2.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
|
||
* Added a new ILP64 API for level 1 `_ex` functions.
|
||
|
||
* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts.
|
||
* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages
|
||
`rocblas-dev` and `rocsolver-dev`.
|
||
* Updated Linux AOCL dependency to release 4.2 `gcc` build.
|
||
* Updated Windows `vcpkg` dependencies to release 2024.02.14.
|
||
|
||
### **hipBLASLt** (0.8.0)
|
||
|
||
#### Changes
|
||
|
||
* Added extension APIs:
|
||
*`hipblasltExtAMaxWithScale`.
|
||
* `GemmTuning` extension parameter to set `wgm` by user.
|
||
* Added support for:
|
||
* `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype.
|
||
* `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only).
|
||
* `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA.
|
||
* Added support for gfx110x.
|
||
|
||
#### Optimizations
|
||
|
||
* Improved library loading time.
|
||
|
||
### **HIPCC** (1.1.1)
|
||
|
||
#### Changes
|
||
|
||
* Split `hipcc` package into two packages for different hardware platforms.
|
||
|
||
* Cleaned up references to environment variables.
|
||
|
||
* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by
|
||
default, instead of their Perl counterparts.
|
||
|
||
* Enabled function calls.
|
||
|
||
* Added support for generating packages for ROCm stack targeting static libraries.
|
||
|
||
#### Resolved issues
|
||
|
||
* Implemented numerous bug fixes and quality improvements.
|
||
|
||
### **hipCUB** (3.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added `DeviceCopy` function for parity with CUB.
|
||
* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as
|
||
the new optional template argument for `WarpExchange`.
|
||
* The potential values for the enum are `WARP_EXCHANGE_SMEM` and
|
||
`WARP_EXCHANGE_SHUFFLE`.
|
||
* `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while
|
||
`WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations.
|
||
* `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory,
|
||
but the `ItemsPerThread` must be a divisor of `WarpSize`.
|
||
* Added `tuple.hpp` which defines templates `hipcub::tuple`,
|
||
`hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`.
|
||
* Added new overloaded member functions to `BlockRadixSort` and
|
||
`DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type
|
||
(`key_type`) can be sorted via these overloads, if an appropriate decomposer
|
||
is passed. The decomposer has to implement `operator(const key_type&)` which
|
||
returns a `hipcub::tuple` of references pointing to members of `key_type`.
|
||
|
||
* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of
|
||
HIP graphs, with several exceptions:
|
||
* `CachingDeviceAllocator`
|
||
* `GridBarrierLifetime`
|
||
* `DeviceSegmentedRadixSort`
|
||
* `DeviceRunLengthEncode`
|
||
Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs.
|
||
|
||
#### Removals
|
||
|
||
* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous`
|
||
to the device functions results in a deprecation warning both at runtime and at compile time.
|
||
* The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time
|
||
by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition.
|
||
* The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
|
||
It now derives the accumulator type as the result of the binary operator.
|
||
|
||
### **hipFFT** (1.0.15)
|
||
|
||
#### Resolved issues
|
||
|
||
* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers.
|
||
* Prevented C++ exceptions leaking from public API functions.
|
||
* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs.
|
||
|
||
|
||
### **HIPIFY** (18.0.0)
|
||
|
||
#### Changes
|
||
|
||
- Added support for:
|
||
- NVIDIA CUDA 12.4.1
|
||
- cuDNN 9.1.1
|
||
- LLVM 18.1.6
|
||
- Added full hipBLASLt support.
|
||
|
||
#### Resolved issues
|
||
|
||
- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object;
|
||
affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`.
|
||
|
||
### **hipRAND** (2.11.0)
|
||
|
||
#### Changes
|
||
|
||
* Added support for setting generator output ordering in C and C++ API.
|
||
* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with
|
||
`uHIPRAND_STATUS_NOT_IMPLEMENTED`.
|
||
* Added options to create:
|
||
* A host generator to the Fortran wrapper.
|
||
* A host generator to the Python wrapper.
|
||
* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous
|
||
versions. The environment variable is now named `HIPRAND_USE_HMM`.
|
||
* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
|
||
* Device API documentation is improved in this version.
|
||
|
||
#### Removals
|
||
|
||
* Removed the option to build hipRAND as a submodule to rocRAND.
|
||
* Removed references to, and workarounds for, the deprecated `hcc`.
|
||
* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`.
|
||
Use `ROCRAND_PATH` instead.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`.
|
||
|
||
### **hipSOLVER** (2.2.0)
|
||
|
||
#### Changes
|
||
|
||
- Added compatibility-only functions:
|
||
- `auxiliary`
|
||
- `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions`
|
||
- `getrf`
|
||
- `hipsolverDnXgetrf_bufferSize`
|
||
- `hipsolverDnXgetrf`
|
||
- `getrs`
|
||
- `hipsolverDnXgetrs`
|
||
- Added support for building on Ubuntu 24.04 and CBL-Mariner.
|
||
- Added `hip::host` to `roc::hipsolver` usage requirements.
|
||
- Added functions
|
||
- `syevdx`/`heevdx`
|
||
- `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize`
|
||
- `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx`
|
||
- `sygvdx`/`hegvdx`
|
||
- `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize`
|
||
- `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx`
|
||
|
||
- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU.
|
||
- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`.
|
||
|
||
#### Removals
|
||
|
||
- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients.
|
||
|
||
### **hipSPARSE** (3.1.1)
|
||
|
||
#### Changes
|
||
|
||
* Added the missing `hipsparseCscGet()` routine.
|
||
|
||
* All internal hipSPARSE functions now exist inside a namespace.
|
||
* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend.
|
||
* Improved the user manual and contribution guidelines.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior.
|
||
|
||
#### Known Issues
|
||
|
||
* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API
|
||
and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be
|
||
ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`.
|
||
|
||
### **hipSPARSELt** (0.2.1)
|
||
|
||
#### Optimizations
|
||
|
||
* Refined test cases.
|
||
|
||
### **hipTensor** (1.3.0)
|
||
|
||
#### Changes
|
||
|
||
* Added support for:
|
||
* Tensor permutation of ranks of 2, 3, 4, 5, and 6
|
||
* Tensor contraction of M6N6K6: M, N, K up to rank 6
|
||
* Added tests for:
|
||
* Tensor permutation of ranks of 2, 3, 4, 5, and 6
|
||
* Tensor contraction of M6N6K6: M, N, K up to rank 6
|
||
* YAML parsing to support sequential parameters ordering.
|
||
* Prefer `amd-llvm-devel` package before system LLVM library.
|
||
* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`.
|
||
* Updated actor-critic selection for new contraction kernel additions.
|
||
* Updated installation, programmer's guide, and API reference documentation.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed LLVM parsing crash.
|
||
* Fixed memory consumption issue in complex kernels.
|
||
* Workaround implemented for compiler crash during debug build.
|
||
* Allow random modes ordering for tensor contractions.
|
||
|
||
### **llvm-project** (18.0.0)
|
||
|
||
#### Changes
|
||
|
||
* LLVM IR
|
||
|
||
* The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address
|
||
spaces.
|
||
|
||
* Added `llvm.exp10` intrinsic.
|
||
|
||
* LLVM infrastructure
|
||
|
||
* The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`.
|
||
|
||
* TableGen
|
||
|
||
* Added constructs for debugging TableGen files:
|
||
|
||
* `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793).
|
||
|
||
* `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716).
|
||
|
||
* AArch64 backend
|
||
|
||
* Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs.
|
||
|
||
* AMDGPU backend
|
||
|
||
* `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access.
|
||
|
||
* Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics.
|
||
|
||
* Implemented `llvm.get.rounding`.
|
||
|
||
* ARM backend
|
||
|
||
* Added support for Cortex-M52 CPUs.
|
||
|
||
* Added execute-only support for Armv6-M.
|
||
|
||
* RISC-V backend
|
||
|
||
* The `Zfa` extension version was upgraded to 1.0 and is no longer experimental.
|
||
|
||
* `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental.
|
||
|
||
* Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See
|
||
[Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/src/c-api.adoc#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification.
|
||
|
||
* Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64.
|
||
|
||
* The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`,
|
||
`Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However,
|
||
the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions,
|
||
`-menable-experimental-extensions` needs to be passed to Clang.
|
||
|
||
* `-mcpu=sifive-p450` was added.
|
||
|
||
* CodeGen of `RV32E` and `RV64E` is supported experimentally.
|
||
|
||
* CodeGen of `ilp32e` and `lp64e` is supported experimentally.
|
||
|
||
* X86 backend
|
||
|
||
* Added support for the RDMSRLIST and WRMSRLIST instructions.
|
||
|
||
* Added support for the WRMSRNS instruction.
|
||
|
||
* Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction.
|
||
|
||
* Support ISA of CMPCCXADD.
|
||
|
||
* Support ISA of AVX-IFMA.
|
||
|
||
* Support ISA of AVX-VNNI-INT8.
|
||
|
||
* Support ISA of AVX-NE-CONVERT.
|
||
|
||
* `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported.
|
||
|
||
* `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported.
|
||
|
||
* `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now
|
||
respects this builtin and does not convert CMOVs to branches.
|
||
|
||
* Add support for the PBNDKB instruction.
|
||
|
||
* Support ISA of SHA512.
|
||
|
||
* Support ISA of SM3.
|
||
|
||
* Support ISA of SM4.
|
||
|
||
* Support ISA of AVX-VNNI-INT16.
|
||
|
||
* `-mcpu=graniterapids-d` is now supported.
|
||
|
||
* The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust
|
||
which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type
|
||
matched and called into `libgcc` helper functions.
|
||
|
||
* Support ISA of USER_MSR.
|
||
|
||
* Support ISA of AVX10.1-256 and AVX10.1-512.
|
||
|
||
* `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported.
|
||
|
||
* `-mapxf` is supported.
|
||
|
||
* Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow
|
||
32-bit relocations or require 64-bit relocations to the global variable.
|
||
|
||
* The medium code model’s code generation was audited to be more similar to the small code model where possible.
|
||
|
||
* C API
|
||
|
||
* Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions.
|
||
|
||
* Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an
|
||
alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target
|
||
ABI) and provides default values for unspecified settings.
|
||
|
||
* Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and
|
||
`LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions.
|
||
|
||
* Added the following functions for manipulating operand bundles, as well as building call and invoke instructions
|
||
that use operand bundles:
|
||
|
||
* `LLVMBuildCallWithOperandBundles`
|
||
|
||
* `LLVMBuildInvokeWithOperandBundles`
|
||
|
||
* `LLVMCreateOperandBundle`
|
||
|
||
* `LLVMDisposeOperandBundle`
|
||
|
||
* `LLVMGetNumOperandBundles`
|
||
|
||
* `LLVMGetOperandBundleAtIndex`
|
||
|
||
* `LLVMGetNumOperandBundleArgs`
|
||
|
||
* `LLVMGetOperandBundleArgAtIndex`
|
||
|
||
* `LLVMGetOperandBundleTag`
|
||
|
||
* Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an
|
||
instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag.
|
||
|
||
* CodeGen infrastructure
|
||
|
||
* A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e.
|
||
`-debug-only=isel-dump`). This new debug type can be filtered by function names using
|
||
`-filter-print-funcs=<function names>`, the same flag used to filter IR dumps after each Pass. Note that the
|
||
existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single
|
||
function regardless of `-filter-print-funcs`’s values.
|
||
|
||
* Metadata info
|
||
|
||
* Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`.
|
||
|
||
* LLVM tools
|
||
|
||
* `llvm-symbolizer` now treats invalid input as an address for which source information is not found.
|
||
|
||
* `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing
|
||
symbols.
|
||
|
||
* `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each
|
||
object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now
|
||
`{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each
|
||
object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers
|
||
accordingly.
|
||
|
||
* `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`.
|
||
|
||
* `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers.
|
||
|
||
* `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names.
|
||
|
||
* `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only.
|
||
|
||
* LLDB
|
||
|
||
* `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type
|
||
when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`.
|
||
|
||
* Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying
|
||
functionality remains unchanged.
|
||
|
||
* The `mte_ctrl` register can now be read from AArch64 Linux core files.
|
||
|
||
* LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2
|
||
(SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation.
|
||
|
||
* LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism
|
||
of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users
|
||
can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`,
|
||
override or adding to the environment variable.
|
||
|
||
* When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers:
|
||
`cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`.
|
||
|
||
* Sanitizers
|
||
|
||
* HWASan now defaults to detecting use-after-scope bugs.
|
||
|
||
#### Removals
|
||
|
||
* LLVM IR
|
||
|
||
* The constant expression variants of the following instructions have been removed:
|
||
|
||
* `and`
|
||
|
||
* `or`
|
||
|
||
* `lshr`
|
||
|
||
* `ashr`
|
||
|
||
* `zext`
|
||
|
||
* `sext`
|
||
|
||
* `fptrunc`
|
||
|
||
* `fpext`
|
||
|
||
* `fptoui`
|
||
|
||
* `fptosi`
|
||
|
||
* `uitofp`
|
||
|
||
* `sitofp`
|
||
|
||
* RISC-V backend
|
||
|
||
* XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and
|
||
instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with
|
||
SiFive to define and document real extension names for individual CSRs and instructions.
|
||
|
||
* Python bindings
|
||
|
||
* The Python bindings have been removed.
|
||
|
||
* C API
|
||
|
||
* The following functions for creating constant expressions have been removed, because the underlying constant
|
||
expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which
|
||
will constant fold the operands if possible and create an instruction otherwise:
|
||
|
||
* `LLVMConstAnd`
|
||
|
||
* `LLVMConstOr`
|
||
|
||
* `LLVMConstLShr`
|
||
|
||
* `LLVMConstAShr`
|
||
|
||
* `LLVMConstZExt`
|
||
|
||
* `LLVMConstSExt`
|
||
|
||
* `LLVMConstZExtOrBitCast`
|
||
|
||
* `LLVMConstSExtOrBitCast`
|
||
|
||
* `LLVMConstIntCast`
|
||
|
||
* `LLVMConstFPTrunc`
|
||
|
||
* `LLVMConstFPExt`
|
||
|
||
* `LLVMConstFPToUI`
|
||
|
||
* `LLVMConstFPToSI`
|
||
|
||
* `LLVMConstUIToFP`
|
||
|
||
* `LLVMConstSIToFP`
|
||
|
||
* `LLVMConstFPCast`
|
||
|
||
* CodeGen infrastructure
|
||
|
||
* `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets
|
||
should use backwards frame index elimination instead.
|
||
|
||
* `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging
|
||
instead, which is preferred because it does not depend on accurate kill flags.
|
||
|
||
* LLDB
|
||
|
||
* `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable.
|
||
|
||
* Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value
|
||
indicating failure.
|
||
|
||
#### Resolved issues
|
||
|
||
* AArch64 backend
|
||
|
||
* Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto
|
||
options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for
|
||
this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the
|
||
exception of crypto.
|
||
|
||
* Windows target
|
||
|
||
* The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path
|
||
names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug
|
||
used to cause distinct files to be considered equivalent on some file systems. This change fixed the GitHub issues
|
||
[#61401](https://github.com/llvm/llvm-project/issues/61401) and [#22079](https://github.com/llvm/llvm-project/issues/22079).
|
||
|
||
#### Known issues
|
||
|
||
The compiler may incorrectly compile a program that uses the
|
||
``__shfl(var, srcLane, width)`` function when one of the parameters to
|
||
the function is undefined along some path to the function. For most functions,
|
||
uninitialized inputs cause undefined behavior.
|
||
|
||
```{note}
|
||
The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path.
|
||
```
|
||
|
||
As a workaround, initialize the parameters to ``__shfl``. For example:
|
||
|
||
```{code-block} cpp
|
||
unsigned long istring = 0 // Initialize the input to __shfl
|
||
return __shfl(istring, 0, 64)
|
||
```
|
||
|
||
See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub.
|
||
|
||
### **MIGraphX** (2.10.0)
|
||
|
||
#### Changes
|
||
|
||
- Added support for ONNX Runtime MIGraphX EP on Windows.
|
||
- Added `FP8` Python API.
|
||
- Added examples for SD 2.1 and SDXL.
|
||
- Added support for BERT to Dynamic Batch.
|
||
- Added a `--test` flag in `migraphx-driver` to validate the installation.
|
||
- Added support for ONNX Operator: Einsum.
|
||
- Added `uint8` support in ONNX Operators.
|
||
- Added Split-k kernel configurations for performance improvements.
|
||
- Added fusion for group convolutions.
|
||
- Added rocMLIR conv3d support.
|
||
- Added rocgdb to the Dockerfile.
|
||
- Changed default location of libraries with release specific ABI changes.
|
||
- Reorganized documentation in GitHub.
|
||
|
||
#### Removals
|
||
|
||
- Removed the `--model` flag with `migraphx-driver`.
|
||
|
||
#### Optimizations
|
||
|
||
- Improved ONNX Model Zoo coverage.
|
||
- Reorganized `memcpys` with ONNX Runtime to improve performance.
|
||
- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast.
|
||
- Improved MLIR kernel selection for multibroadcasted GEMMs.
|
||
- Improved details of the perf report.
|
||
- Enable mlir by default for GEMMs with small K.
|
||
- Allow specifying dot or convolution fusion for mlir with environmental flag.
|
||
- Improve performance on small reductions by doing multiple reduction per wavefront.
|
||
- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants.
|
||
- Use MLIR attention kernels in more cases.
|
||
- Enables MIOpen and CK fusions for MI300 gfx arches.
|
||
- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs.
|
||
- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases.
|
||
- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout.
|
||
- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases.
|
||
- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained.
|
||
- Simplifies "slice + concat" pattern in SDXL UNet.
|
||
- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros.
|
||
- Improved inference performance by fusing Reduce to Broadcast.
|
||
- Added additional information when printing the perf report.
|
||
- Improve scalar fusions when not all strides are 0.
|
||
- Added support for multi outputs in pointwise ops.
|
||
- Improve reduction fusion with reshape operators.
|
||
- Use the quantized output when an operator is used again.
|
||
- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed Super Resolution model verification failed with `FP16`.
|
||
- Fixed confusing messages by suppressing them when compiling the model.
|
||
- Fixed an issue causing the mod operator with `int8` and `int32` inputs.
|
||
- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled.
|
||
- Fixed a bug when running `migraphx-driver` with the `--run 1` option.
|
||
- Fixed Layernorm accuracy: calculations in `FP32`.
|
||
- Fixed update Docker generator script to ROCm 6.1 to point at Jammy.
|
||
- Fixed a floating point exception for `dim (-1)` in the reshape operator.
|
||
- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input.
|
||
- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix.
|
||
- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns.
|
||
- Fixed an issues during compilation caused by the incorrect constructor being used at compile time.
|
||
Removed list initializer of `prefix_scan_sum` which was causing issues during compilation.
|
||
- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation.
|
||
|
||
### **MIOpen** (3.2.0)
|
||
|
||
#### Changes
|
||
|
||
- Added:
|
||
- [Conv] bilinear (alpha beta) solvers.
|
||
- [Conv] enable bf16 for ck-based solvers.
|
||
- [Conv] Add split_k tuning to 2d wrw ck-based solver.
|
||
- [MHA] graph API fp8 fwd.
|
||
- [RNN] multi-stream as default solution.
|
||
- Added TunaNetv2.0 for MI300.
|
||
- Added Adam and AMP Adam optimizer.
|
||
|
||
#### Resolved issues
|
||
|
||
- Memory access fault caused by `GemmBwdRest`.
|
||
- Context configuration in `GetWorkSpaceSize`.
|
||
- Fixes to support huge tensors.
|
||
|
||
#### Optimizations
|
||
|
||
- Find: improved precision of benchmarking.
|
||
|
||
### **MIVisionX** (3.0.0)
|
||
|
||
#### Changes
|
||
|
||
- Added support for:
|
||
- Advanced GPUs
|
||
- PreEmphasis Filter augmentation in openVX extensions
|
||
- Spectrogram augmentation in openVX extensions
|
||
- Downmix and ToDecibels augmentations in openVX extensions
|
||
- Resample augmentation and Operator overloading nodes in openVX extensions
|
||
- NonSilentRegion and Slice augmentations in openVX extensions
|
||
- Mel-Filter bank and Normalize augmentations in openVX extensions
|
||
|
||
#### Removals
|
||
|
||
- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL).
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed issues with dependencies.
|
||
|
||
#### Known issues
|
||
|
||
- MIVisionX package install requires manual prerequisites installation.
|
||
|
||
### **Omniperf** (2.0.1)
|
||
|
||
#### Known issues
|
||
|
||
- Error when running Omniperf with an application with command line arguments. As a workaround, create an
|
||
intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This
|
||
issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/omniperf/issues/347).
|
||
|
||
- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error:
|
||
"*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the
|
||
environment variable `export ROCPROF=rocprofv2`.
|
||
|
||
- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message:
|
||
|
||
"*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.*
|
||
|
||
*[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.*
|
||
|
||
*Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.*
|
||
|
||
*See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*"
|
||
|
||
As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`.
|
||
|
||
See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub.
|
||
|
||
### **OpenMP** (17.0.0)
|
||
|
||
#### Changes
|
||
|
||
- Added basic experimental support for ``libc`` functions on the GPU via the
|
||
LLVM C Library for GPUs.
|
||
- Added minimal support for calling host functions from the device using the
|
||
`libc` interface.
|
||
- Added vendor agnostic OMPT callback support for OpenMP-based device offload.
|
||
|
||
#### Removals
|
||
|
||
- Removed the "old" device plugins along with support for the `remote` and
|
||
`ve` plugins.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed the implementation of `omp_get_wtime` for AMDGPU targets.
|
||
|
||
### **RCCL** (2.20.5)
|
||
|
||
#### Changes
|
||
|
||
- Added support for `fp8` and `rccl_bfloat8`.
|
||
- Added support for using HIP contiguous memory.
|
||
- Added ROC-TX for host-side profiling.
|
||
- Added new rome model.
|
||
- Added `fp16` and `fp8` cases to unit tests.
|
||
- Added a new unit test for main kernel stack size.
|
||
- Added the new `-n` option for `topo_expl` to override the number of nodes.
|
||
- Improved debug messages of memory allocations.
|
||
- Enabled static build.
|
||
- Enabled compatibility with:
|
||
- NCCL 2.20.5.
|
||
- NCCL 2.19.4.
|
||
- Performance tuning for some collective operations on MI300.
|
||
- Enabled NVTX code in RCCL.
|
||
- Replaced `rccl_bfloat16` with hip_bfloat16.
|
||
- NPKit updates:
|
||
- Removed warm-up iteration removal by default, need to opt in now.
|
||
- Doubled the size of buffers to accommodate for more channels.
|
||
- Modified rings to be rail-optimized topology friendly.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed a bug when configuring RCCL for only LL128 protocol.
|
||
- Fixed scratch memory allocation after API change for MSCCL.
|
||
|
||
### **rocAL** (1.0.0)
|
||
|
||
#### Changes
|
||
|
||
- Added tests and samples.
|
||
|
||
#### Removals
|
||
|
||
- Removed CuPy from `setup.py`.
|
||
|
||
|
||
#### Optimizations
|
||
|
||
- Added setup and install updates.
|
||
|
||
#### Resolved issues
|
||
|
||
- Minor bug fixes.
|
||
|
||
### **rocALUTION** (3.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added new file I/O based on rocSPARSE I/O format.
|
||
* Added `GetConvergenceHistory` for ItILU0 preconditioner.
|
||
|
||
#### Removals
|
||
|
||
* Deprecated the following:
|
||
* `LocalMatrix::ReadFileCSR`
|
||
* `LocalMatrix::WriteFileCSR`
|
||
* `GlobalMatrix::ReadFileCSR`
|
||
* `GlobalMatrix::WriteFileCSR`
|
||
|
||
### **rocBLAS** (4.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments.
|
||
* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`.
|
||
* Added Benchmark class for common timing code.
|
||
* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`.
|
||
* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`.
|
||
|
||
* Updated Linux AOCL dependency to release 4.2 gcc build.
|
||
* Updated Windows vcpkg dependencies to release 2024.02.14.
|
||
* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40.
|
||
|
||
#### Optimizations
|
||
|
||
* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
|
||
|
||
#### Removals
|
||
|
||
* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next
|
||
major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage.
|
||
|
||
### **ROCdbgapi** (0.76.0)
|
||
|
||
#### Removals
|
||
- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`.
|
||
The old names are still accessible but deprecated.
|
||
|
||
### **rocDecode** (0.6.0)
|
||
|
||
#### Changes
|
||
|
||
- Added full H.264 support and bug fixes.
|
||
|
||
### **rocFFT** (1.0.28)
|
||
|
||
#### Changes
|
||
|
||
* Randomly generated accuracy tests are now disabled by default. They can be enabled using
|
||
the `--nrand` option (which defaults to 0).
|
||
|
||
#### Optimizations
|
||
|
||
* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks
|
||
are transformed locally, with global transposes to make remaining dimensions contiguous.
|
||
|
||
### **rocm-cmake** (0.13.0)
|
||
|
||
#### Changes
|
||
|
||
- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds.
|
||
- Package names are no longer pulled from `CPACK_<GEN>_PACKAGE_NAME`.
|
||
- Runtime packages will no longer be generated for static builds.
|
||
|
||
### **ROCm Data Center Tool** (1.0.0)
|
||
|
||
#### Changes
|
||
|
||
- Added ROCProfiler `dmon` metrics.
|
||
- Added new ECC metrics.
|
||
- Added ROCm Validation Suite diagnostic command.
|
||
- Fully migrated to AMD SMI.
|
||
|
||
#### Removals
|
||
|
||
- Removed RASLIB dependency and blobs.
|
||
- Removed `rocm_smi_lib` dependency due to migration to AMD SMI.
|
||
|
||
### **ROCm Debugger (ROCgdb)** (14.2)
|
||
|
||
#### Changes
|
||
|
||
- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile.
|
||
- Added support for generating and opening core files for heterogeneous processes.
|
||
|
||
### **ROCm SMI** (7.3.0)
|
||
|
||
#### Changes
|
||
|
||
- Added Partition ID API (`rsmi_dev_partition_id_get(..)`).
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed Partition ID CLI output.
|
||
|
||
```{note}
|
||
See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md)
|
||
on GitHub for more information.
|
||
```
|
||
|
||
### **ROCm Validation Suite** (1.0.0)
|
||
|
||
#### Changes
|
||
|
||
* Added stress tests:
|
||
|
||
* IET (power) stress test for MI300A.
|
||
|
||
* IET (power transition) test for MI300X.
|
||
|
||
* Added support:
|
||
|
||
* GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output.
|
||
|
||
* Trignometric float and random integer matrix data initialization support.
|
||
|
||
* Updated GST performance benchmark test for better numbers.
|
||
|
||
### **rocPRIM** (3.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp.
|
||
* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter.
|
||
* The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
|
||
This is the same behaviour as before this change.
|
||
* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element.
|
||
* Added new public APIs for deriving resulting type on device-only functions:
|
||
* `rocprim::invoke_result`
|
||
* `rocprim::invoke_result_t`
|
||
* `rocprim::invoke_result_binary_op`
|
||
* `rocprim::invoke_result_binary_op_t`
|
||
* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
|
||
* Added more test cases, to better cover supported data types.
|
||
* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
|
||
To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
|
||
of references to arithmetic types from the key.
|
||
* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
|
||
It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
|
||
It can be instantiated with:
|
||
* `rocprim::make_predicate_iterator`
|
||
* `rocprim::make_mask_iterator`
|
||
* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32.
|
||
* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
|
||
* Updated some tests to work with supported data types.
|
||
|
||
#### Optimizations
|
||
|
||
* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
|
||
* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is
|
||
larger than the logical warp size. The test suite has been updated with such cases.
|
||
* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`.
|
||
* Fixed incorrect output type used in `device_adjacent_difference`.
|
||
* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types.
|
||
* Fixed an issue with `device_adjacent_difference`. It now considers both the
|
||
input and the output type for selecting the appropriate kernel launch config.
|
||
Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.
|
||
* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`.
|
||
* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64.
|
||
|
||
#### Removals
|
||
|
||
* Deprecated the internal header `detail/match_result_type.hpp`.
|
||
* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`.
|
||
* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name.
|
||
|
||
### **ROCProfiler** (2.0.0)
|
||
|
||
#### Removals
|
||
|
||
- Removed `pcsampler` sample code due to deprecation from version 2.
|
||
|
||
### **rocRAND** (3.1.0)
|
||
|
||
#### Changes
|
||
|
||
* Added `rocrand_create_generator_host`.
|
||
* The following generators are supported:
|
||
* `ROCRAND_RNG_PSEUDO_MRG31K3P`
|
||
* `ROCRAND_RNG_PSEUDO_MRG32K3A`
|
||
* `ROCRAND_RNG_PSEUDO_PHILOX4_32_10`
|
||
* `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20`
|
||
* `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20`
|
||
* `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20`
|
||
* `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20`
|
||
* `ROCRAND_RNG_PSEUDO_XORWOW`
|
||
* `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32`
|
||
* `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64`
|
||
* `ROCRAND_RNG_QUASI_SOBOL32`
|
||
* `ROCRAND_RNG_QUASI_SOBOL64`
|
||
* The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives).
|
||
* If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread.
|
||
* Added the option to create a host generator to the Python wrapper.
|
||
* Added the option to create a host generator to the Fortran wrapper
|
||
* Added dynamic ordering. This ordering is free to rearrange the produced numbers,
|
||
which can be specific to devices and distributions. It is implemented for:
|
||
* XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry
|
||
* Added support for using Clang as the host compiler for alternative platform compilation.
|
||
* C++ wrapper:
|
||
* Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`.
|
||
* Added optional order parameter to the constructor of `mt19937_engine`.
|
||
* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator:
|
||
* `rocrand_normal2`
|
||
* `rocrand_normal_double2`
|
||
* `rocrand_log_normal2`
|
||
* `rocrand_log_normal_double2`
|
||
* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics.
|
||
* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`.
|
||
* Added offset and skipahead functionality to LFSR113 generator.
|
||
* Added dynamic ordering for architecture `gfx1102`.
|
||
|
||
* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few
|
||
things to be aware of:
|
||
- Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph.
|
||
- After the generator is created, you may call API functions to set its seed, offset, and order.
|
||
- After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph.
|
||
- A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredictable behaviour.
|
||
- API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs.
|
||
- For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp`
|
||
* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features.
|
||
* Building rocRAND should be faster on machines with multiple CPU cores as the library has been
|
||
split to multiple compilation units.
|
||
* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`.
|
||
* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE`
|
||
* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking.
|
||
|
||
#### Removals
|
||
|
||
* Deprecated the following typedefs. Please use the unified `state_type` alias instead.
|
||
* `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state`
|
||
* `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state`
|
||
* `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state`
|
||
* `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state`
|
||
* Deprecated the following internal headers:
|
||
* `src/rng/distribution/distributions.hpp`.
|
||
* `src/rng/device_engines.hpp`.
|
||
* Removed references to and workarounds for deprecated hcc.
|
||
* Removed support for HIP-CPU.
|
||
|
||
#### Known Issues
|
||
|
||
- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release.
|
||
|
||
### **ROCr Runtime** (1.14.0)
|
||
|
||
#### Changes
|
||
|
||
- Added PC sampling feature (experimental feature).
|
||
|
||
### **rocSOLVER** (3.26.0)
|
||
|
||
#### Changes
|
||
|
||
- Added 64-bit APIs for existing functions:
|
||
- GETF2_64 (with `batched` and `strided_batched` versions)
|
||
- GETRF_64 (with `batched` and `strided_batched` versions)
|
||
- GETRS_64 (with `batched` and `strided_batched` versions)
|
||
- Added gfx900 to default build targets.
|
||
- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection:
|
||
- SYEVDX (with `batched` and `strided_batched` versions)
|
||
- HEEVDX (with `batched` and `strided_batched` versions)
|
||
- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection:
|
||
- SYGVDX (with `batched` and `strided_batched` versions)
|
||
- HEGVDX (with `batched` and `strided_batched` versions)
|
||
- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for
|
||
backwards compatibility.
|
||
- Functions working with arrays of size n - 1 can now accept null pointers when n = 1.
|
||
|
||
#### Optimizations
|
||
|
||
- Improved performance of Cholesky factorization.
|
||
- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U.
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues.
|
||
|
||
### **rocSPARSE** (3.2.0)
|
||
|
||
#### Changes
|
||
|
||
* Added a new Merge-Path algorithm to SpMM, supporting CSR format.
|
||
* Added support for row order to SpSM.
|
||
* Added rocsparseio I/O functionality to the library.
|
||
* Added `rocsparse_set_identity_permutation`.
|
||
|
||
* Adjusted rocSPARSE dependencies to related HIP packages.
|
||
* Binary size has been reduced.
|
||
* A namespace has been wrapped around internal rocSPARSE functions and kernels.
|
||
* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0.
|
||
* gfx803 target has been removed from address sanitizer builds.
|
||
|
||
#### Optimizations
|
||
|
||
* SpMV adaptive and LRB algorithms have been further optimized on CSR format
|
||
* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format
|
||
* Improved documentation and contribution guidelines.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`.
|
||
|
||
### **rocThrust** (3.1.0)
|
||
|
||
#### Changes
|
||
|
||
* Added changes from upstream CCCL/thrust 2.2.0.
|
||
* Updated the contents of `system/hip` and `test` with the upstream changes.
|
||
* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`.
|
||
* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call.
|
||
* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`.
|
||
* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`.
|
||
|
||
#### Known Issues
|
||
|
||
* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device.
|
||
* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs.
|
||
|
||
### **rocWMMA** (1.5.0)
|
||
|
||
#### Changes
|
||
|
||
* Added internal utilities for:
|
||
* Element-wise vector transforms.
|
||
* Cross-lane vector transforms.
|
||
* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16.
|
||
* Added tests for new internal transforms.
|
||
|
||
* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`.
|
||
* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments.
|
||
* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary.
|
||
* Refactored entry-point of std library usage to improve hipRTC support.
|
||
* Updated installation, programmer's guide and API reference documentation.
|
||
|
||
#### Resolved issues
|
||
|
||
* Fixed the ordering of some header includes to improve portability.
|
||
|
||
### **RPP** (1.8.0)
|
||
|
||
#### Changes
|
||
|
||
- Prerequisites - ROCm install requires only `--usecase=rocm`.
|
||
- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory.
|
||
- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP.
|
||
- RPP Tensor Audio support on HOST for Spectrogram.
|
||
- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version.
|
||
- RPP Tensor Audio support on HOST for Mel Filter Bank.
|
||
- RPP Tensor Normalize ND support on HOST and `HIP`.
|
||
|
||
### **Tensile** (4.41.0)
|
||
|
||
#### Changes
|
||
|
||
- New tuning script to summarize rocBLAS log file
|
||
- New environment variable to test fixed grid size with Stream-K kernels
|
||
- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
|
||
- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
|
||
- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
|
||
- Support NT flag for global load and store for gfx94x
|
||
- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
|
||
- Add option for rotating buffers timing with cache eviction
|
||
- Add predicate for arithmetic intensity
|
||
- Add DirectToVgpr + packing for f8/f16 + TLU cases
|
||
- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
|
||
- Add test cases for DirectToVgpr + packing
|
||
- Add batch support for Stream-K kernels and new test cases
|
||
- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
|
||
- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
|
||
- Support VectorWidthB (new parameter VectorWidthB)
|
||
- Support VectorWidth + non SourceSwap
|
||
- Add test cases for VectorWidthB, VectorWidth + non SourceSwap
|
||
- Add code owners file
|
||
- New environment variables to dynamically adjust number of CUs used in Stream-K
|
||
- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
|
||
- Add xf32 option to rocblas-bench input creator
|
||
|
||
- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
|
||
- Limit build threads based on CPUs/RAM available on system (for tests)
|
||
- Update required workspace size for Stream-K, skip kernel initialization when possible
|
||
- Use fallback libraries for archs without optimized logic
|
||
- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
|
||
- Remove OCL tests
|
||
- Disable HostLibraryTests
|
||
- Reduce extended test time by removing extra parameters in the test config files
|
||
- Disable InitAccVgprOpt for Stream-K
|
||
- Skip sgemm 64bit offset tests for gfx94x
|
||
- Skip DTV, DTL, LSU+MFMA tests for gfx908
|
||
- Increase extended test timeout to 720 min
|
||
- Update xfail test (1sum tests only failing on gfx90a)
|
||
- Update lib logic convertor script
|
||
- Test limiting CI threads for only gfx11
|
||
- wGM related kernargs are removed if they are not needed (WGM=-1,0,1)
|
||
- Cleanup on unused old code, mostly related to old client
|
||
- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
|
||
- Update efficiency script for new architecture and xf32 datatype
|
||
- Re-enable negative values for WorkGroupMapping (asm kernel only)
|
||
- Disable HW monitor for aquvavanjaram941
|
||
- Pre-apply offsets for strided batch kernels
|
||
- Update tensile build with 16 threads
|
||
|
||
#### Optimizations
|
||
|
||
- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py).
|
||
- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
|
||
- Reversed MFMA order in inner loop for odd outer iteration
|
||
- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
|
||
- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
|
||
|
||
#### Resolved issues
|
||
|
||
- Fixed `WorkspaceCheck` implementation when used in rocBLAS.
|
||
- Fixed Stream-K partials cache behavior.
|
||
- Fixed `MasterSolutionLibrary` indexing for multiple architecture build.
|
||
- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size).
|
||
- Fixed BufferLoad=False with Stream-K.
|
||
- Fixed mismatch issue with `GlobalReadCoalesceGroup`.
|
||
- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]).
|
||
- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2).
|
||
- Fixed inaccurate vgpr allocation for ClusterLocalRead.
|
||
- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2.
|
||
- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0).
|
||
- Fixed HostLibraryTests on gfx942 and gfx941.
|
||
- Fixed LLVM crash issue.
|
||
- Fixed for newer windows vcpkg msgpack and vcpkg version package name.
|
||
- Fixed an error with DisableKernelPieces + 32bit ShadowLimit.
|
||
- Ignore asm cap check for kernel arg preload for rocm6.0 and older.
|
||
|
||
## ROCm known issues
|
||
|
||
ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
|
||
issues related to individual components, review the [Detailed component changes](detailed-component-changes).
|
||
|
||
### Default processor affinity behavior for helper threads
|
||
|
||
Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
|
||
helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to
|
||
threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if
|
||
the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If
|
||
`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when
|
||
creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads
|
||
by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment,
|
||
batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing
|
||
the setting may be advantageous to performance.
|
||
|
||
To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the
|
||
`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows:
|
||
|
||
```{code} shell
|
||
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
|
||
```
|
||
|
||
To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable
|
||
as follows:
|
||
|
||
``` shell
|
||
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1
|
||
```
|
||
|
||
Or the default:
|
||
|
||
``` shell
|
||
|
||
unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG
|
||
```
|
||
|
||
If unsure of the default processor affinity settings for your environment, run the following command from the shell:
|
||
|
||
``` shell
|
||
|
||
bash -c "echo taskset -p \$\$"
|
||
```
|
||
|
||
See [issue #3493](https://github.com/ROCm/ROCm/issues/3493) on GitHub.
|
||
|
||
### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver
|
||
|
||
AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video
|
||
output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox
|
||
G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions.
|
||
|
||
The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems,
|
||
this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading
|
||
the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when
|
||
running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods.
|
||
|
||
See [issue #3494](https://github.com/ROCm/ROCm/issues/3494) on GitHub.
|
||
|
||
### KFDTest failure on Instinct MI300X with Oracle Linux 8.9
|
||
|
||
The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full
|
||
suite to not execute properly. This issue is suspected to be hardware-related.
|
||
|
||
See [issue #3495](https://github.com/ROCm/ROCm/issues/3495) on GitHub.
|
||
|
||
### Bandwidth limitation in gang and non-gang modes on Instinct MI300A
|
||
|
||
Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang
|
||
and non-gang performance are observed to be limited at 45GB/s.
|
||
|
||
This issue will be addressed in a future ROCm release.
|
||
|
||
See [issue #3496](https://github.com/ROCm/ROCm/issues/3496) on GitHub.
|
||
|
||
### rocm-llvm-alt
|
||
|
||
ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for
|
||
users interested in additional closed-source CPU optimizations. This feature is not functional in
|
||
the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an
|
||
LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows
|
||
use of the closed-source compiler. It is recommended to compile using the default open-source
|
||
compiler, which generates high-quality AMD CPU and AMD GPU code.
|
||
|
||
See [issue #3492](https://github.com/ROCm/ROCm/issues/3492) on GitHub.
|
||
|
||
## ROCm upcoming changes
|
||
|
||
The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review
|
||
the [Detailed component changes](detailed-component-changes).
|
||
|
||
### rocm-llvm-alt
|
||
|
||
The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the
|
||
functionality provided by the closed-source compiler should transition to the open-source compiler.
|
||
Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by
|
||
the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler
|
||
has been removed*".
|