From 1e0d3da98cb0e3652e382335c4af74f2f3e89035 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Fri, 20 Sep 2024 19:41:48 -0400 Subject: [PATCH] Add links to GH issues in 6.2.1 release notes (#3769) * add MAD page * link to GitHub issues in release notes known issues * update templates for 6.2.1 * Revert "add MAD page" This reverts commit 9cce72bba306286c7eb317d592645d4e0e1b27aa. * update wordlist for spellcheck linter * add rccl note * update rocal version change heading to be more obvious * make rocal note more specific * fix missing space * fix capitalization --- .wordlist.txt | 3 + RELEASE.md | 32 +++++-- tools/autotag/templates/highlights/6.2.1.md | 35 ++++++- tools/autotag/templates/known_issues/6.2.1.md | 9 ++ tools/autotag/templates/support/6.2.1.md | 2 +- .../templates/upcoming_changes/6.2.1.md | 93 +------------------ 6 files changed, 76 insertions(+), 98 deletions(-) create mode 100644 tools/autotag/templates/known_issues/6.2.1.md diff --git a/.wordlist.txt b/.wordlist.txt index 85b7b8b2c..e748ca6a7 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -53,6 +53,7 @@ CSC CSE CSV CSn +CTest CTests CU CUDA @@ -387,6 +388,7 @@ UAC UC UCC UCX +UE UIF UMC USM @@ -653,6 +655,7 @@ quasirandom queueing rccl rdc +rdma reStructuredText redirections refactorization diff --git a/RELEASE.md b/RELEASE.md index a0b32695d..d832dc8eb 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -24,9 +24,12 @@ See the [ROCm documentation release history](https://rocm.docs.amd.com/en/latest The following are notable new features and improvements in ROCm 6.2.1. For changes to individual components, see [Detailed component changes](#detailed-component-changes). -### rocAL version change +### rocAL major version change -The version of rocAL has been updated to 2.0.0. Applications built using rocAL 1.0.0 must be recompiled to work with rocAL 2.0.0. See [the rocAL detailed changes](#rocal-2-0-0) for more information. +The new version of rocAL introduces many new features, but does not modify any of the existing public API functions. However, the version number was incremented from 1.3 to 2.0. +Applications linked to version 1.3 must be recompiled to link against version 2.0. + +See [the rocAL detailed changes](#rocal-2-0-0) for more information. ### New support for FBGEMM (Facebook General Matrix Multiplication) @@ -140,7 +143,7 @@ Click the component's updated version to go to a detailed list of its changes. C Communication RCCL - 2.20.5 + 2.20.5 ⇒ 2.20.5 @@ -457,15 +460,30 @@ The following sections describe key changes to ROCm components. ### **Omnitrace** (1.11.2) -#### Known Issues +#### Known issues -* Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/). +Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/). + +See [issue #3767](https://github.com/ROCm/ROCm/issues/3767) on GitHub. + +### **RCCL** (2.20.5) + +#### Known issues + +On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, GPUDirect RDMA is disabled and impacts multi-node RCCL performance. +This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer. +Older RCCL versions are also impacted. + +This issue will be addressed in a future ROCm release. + +See [issue #3772](https://github.com/ROCm/ROCm/issues/3772) on GitHub. ### **rocAL** (2.0.0) #### Changes -* Version updated from 1.0.0 to 2.0.0. Applications built using rocAL 1.0.0 must be recompiled to work with rocAL 2.0.0. +* The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0. + Applications linked to version 1.3 must be recompiled to link against version 2.0. * Added development and test packages. * Added C++ rocAL audio unit test and Python script to run and compare the outputs. * Added Python support for audio decoders. @@ -540,6 +558,8 @@ this state. Additionally, error logging might fail in these situations, hinderin This issue is under investigation and will be resolved in a future ROCm release. +See [issue #3766](https://github.com/ROCm/ROCm/issues/3766) on GitHub. + ## ROCm upcoming changes The following changes to the ROCm software stack are anticipated for future releases. diff --git a/tools/autotag/templates/highlights/6.2.1.md b/tools/autotag/templates/highlights/6.2.1.md index 7dbfa87e8..fd8d465bc 100644 --- a/tools/autotag/templates/highlights/6.2.1.md +++ b/tools/autotag/templates/highlights/6.2.1.md @@ -1 +1,34 @@ -### Highlights will go here \ No newline at end of file +### rocAL major version change + +The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0. +Applications linked to version 1.3 must be recompiled to link against version 2.0. + +See [the rocAL detailed changes](#rocal-2-0-0) for more information. + +### New support for FBGEMM (Facebook General Matrix Multiplication) + +As of ROCm 6.2.1, ROCm supports Facebook General Matrix Multiplication (FBGEMM) and the related FBGEMM_GPU library. + +FBGEMM is a low-precision, high-performance CPU kernel library for convolution and matrix multiplication. It is used for server-side inference and as a back end for PyTorch quantized operators. FBGEMM_GPU includes a collection of PyTorch GPU operator libraries for training and inference. For more information, see the ROCm [Model acceleration libraries guide](https://rocm.docs.amd.com/en/6.2.1/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.html) +and [PyTorch's FBGEMM GitHub repository](https://github.com/pytorch/FBGEMM). + +### ROCm Offline Installer Creator changes + +The [ROCm Offline Installer Creator 6.2.1](https://rocm.docs.amd.com/projects/install-on-linux/en/6.2.1/install/rocm-offline-installer.html) introduces several new features and improvements including: + +* Logging support for create and install logs +* More stringent checks for Linux versions and distributions +* Updated prerequisite repositories +* Fixed CTest issues + +### ROCm documentation changes + +There have been no changes to supported hardware or operating systems from ROCm 6.2.0 to ROCm 6.2.1. + +* The Programming Model Reference and Understanding the Programming Model topics in HIP have been consolidated into one topic, +[HIP programming model (conceptual)](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/understand/programming_model.html). +* The [HIP virtual memory management](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/how-to/virtual_memory.html) and [HIP virtual memory management API](https://rocm.docs.amd.com/projects/HIP/en/6.2.1/reference/virtual_memory_reference.html) topics have been added. + +```{note} +The ROCm documentation, like all ROCm projects, is open source and available on GitHub. To contribute to ROCm documentation, see the [ROCm documentation contribution guidelines](https://rocm.docs.amd.com/en/latest/contribute/contributing.html). +``` diff --git a/tools/autotag/templates/known_issues/6.2.1.md b/tools/autotag/templates/known_issues/6.2.1.md new file mode 100644 index 000000000..c5e50019d --- /dev/null +++ b/tools/autotag/templates/known_issues/6.2.1.md @@ -0,0 +1,9 @@ +### Instinct MI300X GPU recovery failure on uncorrectable errors + +For the AMD Instinct MI300X accelerator, GPU recovery resets triggered by uncorrectable errors (UE) might not complete +successfully, which can result in the system being left in an undefined state. A system reboot is needed to recover from +this state. Additionally, error logging might fail in these situations, hindering diagnostics. + +This issue is under investigation and will be resolved in a future ROCm release. + +See [issue #3766](https://github.com/ROCm/ROCm/issues/3766) on GitHub. diff --git a/tools/autotag/templates/support/6.2.1.md b/tools/autotag/templates/support/6.2.1.md index c0656b030..fead5f1cb 100644 --- a/tools/autotag/templates/support/6.2.1.md +++ b/tools/autotag/templates/support/6.2.1.md @@ -1,2 +1,2 @@ -## Operating system and hardware support changes +There are no changes to supported hardware or operating systems from ROCm 6.2.0 to ROCm 6.2.1. diff --git a/tools/autotag/templates/upcoming_changes/6.2.1.md b/tools/autotag/templates/upcoming_changes/6.2.1.md index cfe998369..8cce16c79 100644 --- a/tools/autotag/templates/upcoming_changes/6.2.1.md +++ b/tools/autotag/templates/upcoming_changes/6.2.1.md @@ -1,94 +1,7 @@ -### Default processor affinity behavior for helper threads - -Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm -helper threads are spawned on all available cores, ignoring the parent thread’s processor affinity. This can lead to -threads competing for available cores, which may result in suboptimal performance. This behavior occurs by default if -the environment variable `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is not set or is set to `1`. If -`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` is set to `0`, the ROCr runtime uses the parent process's core affinity mask when -creating helper threads. The parent’s affinity mask should then be set to account for the presence of additional threads -by ensuring the affinity mask contains enough cores. Depending on the affinity settings of the software environment, -batch system, launch commands like `numactl`/`taskset`, or explicit mask manipulation by the application itself, changing -the setting may be advantageous to performance. - -To ensure the parent's core affinity mask is honored by the ROCm helper threads, set the -`HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable as follows: - -```{code} shell -export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0 -``` - -To ensure ROCm helper threads run on all available cores, set the `HSA_OVERRIDE_CPU_AFFINITY_DEBUG` environment variable -as follows: - -``` shell -export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=1 -``` - -Or the default: - -``` shell - -unset HSA_OVERRIDE_CPU_AFFINITY_DEBUG -``` - -If unsure of the default processor affinity settings for your environment, run the following command from the shell: - -``` shell - -bash -c "echo taskset -p \$\$" -``` - -See [issue #3493](https://github.com/ROCm/ROCm/issues/3493) on GitHub. - -### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver - -AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video -output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox -G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions. - -The AMDGPU driver shipped with ROCm interferes with the operation of the display card video output. On Dell systems, -this includes both the local video output and remote access via iDRAC. The display appears blank (black) after loading -the `amdgpu` driver modules. Video output impacts both terminal access when running in `runlevel 3` and GUI access when -running in `runlevel 5`. Server functionality can still be accessed via SSH or other remote connection methods. - -See [issue #3494](https://github.com/ROCm/ROCm/issues/3494) on GitHub. - -### KFDTest failure on Instinct MI300X with Oracle Linux 8.9 - -The `KFDEvictTest.QueueTest` is failing on the MI300X platform during KFD (Kernel Fusion Driver) tests, causing the full -suite to not execute properly. This issue is suspected to be hardware-related. - -See [issue #3495](https://github.com/ROCm/ROCm/issues/3495) on GitHub. - -### Bandwidth limitation in gang and non-gang modes on Instinct MI300A - -Expected target peak non-gang performance (~60GB/s) and target peak gang performance (~90GB/s) are not achieved. Both gang -and non-gang performance are observed to be limited at 45GB/s. - -This issue will be addressed in a future ROCm release. - -See [issue #3496](https://github.com/ROCm/ROCm/issues/3496) on GitHub. - ### rocm-llvm-alt -ROCm provides an optional package -- `rocm-llvm-alt` -- that provides a closed-source compiler for -users interested in additional closed-source CPU optimizations. This feature is not functional in -the ROCm 6.2.0 release. Users who attempt to invoke the closed-source compiler will experience an -LLVM consumer-producer mismatch and the compilation will fail. There is no workaround that allows -use of the closed-source compiler. It is recommended to compile using the default open-source -compiler, which generates high-quality AMD CPU and AMD GPU code. +The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the functionality provided by the closed-source compiler should transition to the open-source compiler. Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler has been removed*". -See [issue #3492](https://github.com/ROCm/ROCm/issues/3492) on GitHub. +### rccl-rdma-sharp-plugins -## ROCm upcoming changes - -The section notes upcoming changes to the ROCm software stack. For upcoming changes related to individual components, review -the [Detailed component changes](detailed-component-changes). - -### rocm-llvm-alt - -The `rocm-llvm-alt` package will be removed in an upcoming release. Users relying on the -functionality provided by the closed-source compiler should transition to the open-source compiler. -Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by -the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler -has been removed*". +The RCCL plugin package, `rccl-rdma-sharp-plugins`, will be removed in an upcoming ROCm release.