Add links to GH issues in 6.2.1 release notes (#3769)

* add MAD page

* link to GitHub issues in release notes known issues

* update templates for 6.2.1

* Revert "add MAD page"

This reverts commit 9cce72bba3.

* update wordlist for spellcheck linter

* add rccl note

* update rocal version change heading to be more obvious

* make rocal note more specific

* fix missing space

* fix capitalization
This commit is contained in:
Peter Park
2024-09-20 19:41:48 -04:00
committed by GitHub
parent 16de13162e
commit 1e0d3da98c
6 changed files with 76 additions and 98 deletions

View File

@@ -53,6 +53,7 @@ CSC
CSE
CSV
CSn
CTest
CTests
CU
CUDA
@@ -387,6 +388,7 @@ UAC
UC
UCC
UCX
UE
UIF
UMC
USM
@@ -653,6 +655,7 @@ quasirandom
queueing
rccl
rdc
rdma
reStructuredText
redirections
refactorization

View File

@@ -24,9 +24,12 @@ See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest
The following are notable new features and improvements in ROCm 6.2.1. For changes to individual components, see [Detailed component changes](#detailed-component-changes).
### rocAL version change
### rocAL major version change
The version of rocAL has been updated to 2.0.0. Applications built using rocAL 1.0.0 must be recompiled to work with rocAL 2.0.0. See [the rocAL detailed changes](#rocal-2-0-0) for more information.
The new version of rocAL introduces many new features, but does not modify any of the existing public API functions. However, the version number was incremented from 1.3 to 2.0.
Applications linked to version 1.3 must be recompiled to link against version 2.0.
See [the rocAL detailed changes](#rocal-2-0-0) for more information.
### New support for FBGEMM (Facebook General Matrix Multiplication)
@@ -140,7 +143,7 @@ Click the component's updated version to go to a detailed list of its changes. C
<th rowspan="1"></th>
<th rowspan="1">Communication</th>
<td><a href="https://rocm.docs.amd.com/projects/rccl/en/docs-6.2.1">RCCL</a></td>
<td>2.20.5</td>
<td>2.20.5&nbsp;&Rightarrow;&nbsp;<a href="#rccl-2-20-5">2.20.5</a></td>
<td><a href="https://github.com/ROCm/rccl/releases/tag/rocm-6.2.1"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
@@ -457,15 +460,30 @@ The following sections describe key changes to ROCm components.
### **Omnitrace** (1.11.2)
#### Known Issues
#### Known issues
* Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/).
Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/).
See [issue #3767](https://github.com/ROCm/ROCm/issues/3767) on GitHub.
### **RCCL** (2.20.5)
#### Known issues
On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, GPUDirect RDMA is disabled and impacts multi-node RCCL performance.
This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer.
Older RCCL versions are also impacted.
This issue will be addressed in a future ROCm release.
See [issue #3772](https://github.com/ROCm/ROCm/issues/3772) on GitHub.
### **rocAL** (2.0.0)
#### Changes
* Version updated from 1.0.0 to 2.0.0. Applications built using rocAL 1.0.0 must be recompiled to work with rocAL 2.0.0.
* The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0.
Applications linked to version 1.3 must be recompiled to link against version 2.0.
* Added development and test packages.
* Added C++ rocAL audio unit test and Python script to run and compare the outputs.
* Added Python support for audio decoders.
@@ -540,6 +558,8 @@ this state. Additionally, error logging might fail in these situations, hinderin
This issue is under investigation and will be resolved in a future ROCm release.
See [issue #3766](https://github.com/ROCm/ROCm/issues/3766) on GitHub.
## ROCm upcoming changes
The following changes to the ROCm software stack are anticipated for future releases.

View File

@@ -1 +1,34 @@
### Highlights will go here
### rocAL major version change
The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0.
Applications linked to version 1.3 must be recompiled to link against version 2.0.
See [the rocAL detailed changes](#rocal-2-0-0) for more information.
### New support for FBGEMM (Facebook General Matrix Multiplication)
As of ROCm 6.2.1, ROCm supports Facebook General Matrix Multiplication (FBGEMM) and the related FBGEMM_GPU library.
FBGEMM is a low-precision, high-performance CPU kernel library for convolution and matrix multiplication. It is used for server-side inference and as a back end for PyTorch quantized operators. FBGEMM_GPU includes a collection of PyTorch GPU operator libraries for training and inference. For more information, see the ROCm [Model acceleration libraries guide](https://rocm.docs.amd.com/en/6.2.1/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.html)
and [PyTorch's FBGEMM GitHub repository](https://github.com/pytorch/FBGEMM).
### ROCm Offline Installer Creator changes
The [ROCm Offline Installer Creator 6.2.1](https://rocm.docs.amd.com/projects/install-on-linux/en/6.2.1/install/rocm-offline-installer.html) introduces several new features and improvements including:
* Logging support for create and install logs
* More stringent checks for Linux versions and distributions
* Updated prerequisite repositories
* Fixed CTest issues
### ROCm documentation changes
There have been no changes to supported hardware or operating systems from ROCm 6.2.0 to ROCm 6.2.1.
* The Programming Model Reference and Understanding the Programming Model topics in HIP have been consolidated into one topic,
[HIP programming model (conceptual)](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/understand/programming_model.html).
* The [HIP virtual memory management](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/how-to/virtual_memory.html) and [HIP virtual memory management API](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/reference/virtual_memory_reference.html) topics have been added.
```{note}
The ROCm documentation, like all ROCm projects, is open source and available on GitHub. To contribute to ROCm documentation, see the [ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
```

View File

@@ -0,0 +1,9 @@
### Instinct MI300X GPU recovery failure on uncorrectable errors
For the AMD Instinct MI300X accelerator, GPU recovery resets triggered by uncorrectable errors (UE) might not complete
successfully, which can result in the system being left in an undefined state. A system reboot is needed to recover from
this state. Additionally, error logging might fail in these situations, hindering diagnostics.
This issue is under investigation and will be resolved in a future ROCm release.
See [issue #3766](https://github.com/ROCm/ROCm/issues/3766) on GitHub.

View File

@@ -1,2 +1,2 @@
## Operating system and hardware support changes
There are no changes to supported hardware or operating systems from ROCm 6.2.0 to ROCm 6.2.1.

View File

@@ -1,94 +1,7 @@
### Default processor affinity behavior for helper threads
Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
helper threads are spawned on all available cores, ignoring the parent threads processor affinity. This can lead to
threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if
the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If
`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when
creating helper threads. The parents affinity mask should then be set to account for the presence of additional threads
by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment,
batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing
the setting may be advantageous to performance.
To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the
`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows:
```{code} shell
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
```
To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable
as follows:
``` shell
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1
```
Or the default:
``` shell
unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG
```
If unsure of the default processor affinity settings for your environment, run the following command from the shell:
``` shell
bash -c "echo taskset -p \$\$"
```
See [issue #3493](https://github.com/ROCm/ROCm/issues/3493) on GitHub.
### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver
AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video
output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox
G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions.
The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems,
this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading
the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when
running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods.
See [issue #3494](https://github.com/ROCm/ROCm/issues/3494) on GitHub.
### KFDTest failure on Instinct MI300X with Oracle Linux 8.9
The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full
suite to not execute properly. This issue is suspected to be hardware-related.
See [issue #3495](https://github.com/ROCm/ROCm/issues/3495) on GitHub.
### Bandwidth limitation in gang and non-gang modes on Instinct MI300A
Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang
and non-gang performance are observed to be limited at 45GB/s.
This issue will be addressed in a future ROCm release.
See [issue #3496](https://github.com/ROCm/ROCm/issues/3496) on GitHub.
### rocm-llvm-alt
ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for
users interested in additional closed-source CPU optimizations. This feature is not functional in
the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an
LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows
use of the closed-source compiler. It is recommended to compile using the default open-source
compiler, which generates high-quality AMD CPU and AMD GPU code.
The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the functionality provided by the closed-source compiler should transition to the open-source compiler. Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler has been removed*".
See [issue #3492](https://github.com/ROCm/ROCm/issues/3492) on GitHub.
### rccl-rdma-sharp-plugins
## ROCm upcoming changes
The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review
the [Detailed component changes](detailed-component-changes).
### rocm-llvm-alt
The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the
functionality provided by the closed-source compiler should transition to the open-source compiler.
Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by
the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler
has been removed*".
The RCCL plugin package, `rccl-rdma-sharp-plugins`, will be removed in an upcoming ROCm release.