From 9679a84a8bbd5cb58cac3b3eefc5677b6fc8f2b0 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Mon, 3 Jun 2024 05:51:38 -0700 Subject: [PATCH] Add components, known issues, and fixed issues to 6.1.2 RN / CL (#87) * Regenerate changelog * Add component changelogs and known issue Fix RELEASE.md headings Update pub datestamp for 6.1.2 Add AMDSMI and ROCm SMI to 6.1.2 template Add rccl and rocBLAS Update intro blurb and headings Add ROCm SMI fix Add missed heading to AMDSMI Update datestamp and release version number Update version and release number Add known issue re: MI300X error detection Words Add issue link Rm GitHub issue link Move known issue down Update ki wording Remove "this issue has been investigated ... " from known issue Fix changelog h1 --- CHANGELOG.md | 65 +++++++++++++--- RELEASE.md | 78 +++++++++++++++---- docs/conf.py | 8 +- tools/autotag/templates/rocm_changes/6.1.2.md | 40 ++++++++-- 4 files changed, 151 insertions(+), 40 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index cecf46294..89ea9c76d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,7 +17,7 @@ This page contains the changelog for AMD ROCm™ Software. ## ROCm 6.1.2 -ROCm 6.1.2 includes improvements to AMD SMI commands and output metrics, and extends support within the rocDecode library. +ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries. ### AMD SMI @@ -25,8 +25,6 @@ AMD SMI for ROCm 6.1.2 #### Additions -* Added macros that were in amdsmi.h to the amdsmi Python library (amdsmi_interface.py). -* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. * Added process isolation and clean shader APIs and CLI commands. * `amdsmi_get_gpu_process_isolation()` * `amdsmi_set_gpu_process_isolation()` @@ -36,19 +34,16 @@ AMD SMI for ROCm 6.1.2 #### Optimizations * Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command. -* Updated the CLI voltage curve command output to split the frequency and voltage output by curve point, if applicable. -* Updated `amdsmi_get_gpu_board_info()` to have larger structure sizes for `amdsmi_board_info_t`. -* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. -* Updated `amismi_get_power_cap_info` to return values in uW instead of W. #### Changes +* Updated `amismi_get_power_cap_info` to return values in uW instead of W. +* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. * Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. #### Fixes * `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -* Fixed the parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. * `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices. * Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices. * Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls. @@ -61,6 +56,35 @@ AMD SMI for ROCm 6.1.2 See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information. ``` +### HIPCC + +HIPCC for ROCm 6.1.2 + +#### Changes + +* **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. +* **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. +* **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. + +### ROCm SMI + +ROCm SMI for ROCm 6.1.2 + +#### Additions + +* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. + +#### Fixes + +* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. +* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. + +### Known issue with error detection on MI300X + +During poison consumption testing, the injection of uncorrectable errors will not generate an interrupt to the driver, +resulting in undetected errors. This can result in reliability and recovery issues on MI300X accelerator-based +setups. + ### Library changes in ROCm 6.1.2 | Library | Version | @@ -80,7 +104,7 @@ See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6. | MIVisionX | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.2) | | rccl | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.2) | | rocALUTION | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.2) | -| rocBLAS | [4.1.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | +| rocBLAS | 4.1.0 ⇒ [4.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | | rocDecode | 0.5.0 ⇒ [0.6.0](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.1.2) | | rocFFT | [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.2) | | rocm-cmake | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.2) | @@ -93,6 +117,26 @@ See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6. | rpp | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.2) | | Tensile | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2) | +#### RCCL + +RCCL 2.18.6 for ROCm 6.1.2 + +##### Changes + +* Reduced `NCCL_TOPO_MAX_NODES` to limit stack usage and avoid stack overflow. + +#### rocBLAS + +rocBLAS 4.1.2 for ROCm 6.1.2 + +##### Optimizations + +* Tuned BBS TN and TT operations on the CDNA3 architecture. + +##### Fixes + +* Fixed an issue related to obtaining solutions for BF16 TT operations. + #### rocDecode rocDecode 0.6.0 for ROCm 6.1.2 @@ -103,7 +147,7 @@ rocDecode 0.6.0 for ROCm 6.1.2 ##### Optimizations -* Updated error checking in the rocDecode-setup.py script. +* Updated error checking in the `rocDecode-setup.py` script. ##### Changes @@ -178,7 +222,6 @@ HIPCC for ROCm 6.1.1 * **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. * **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. - * **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. ### ROCm SMI diff --git a/RELEASE.md b/RELEASE.md index 43645cbc7..7af13a810 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -11,7 +11,7 @@ -ROCm 6.1.2 includes improvements to AMD SMI commands and output metrics, and extends support within the rocDecode library. +ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries. ### AMD SMI @@ -19,8 +19,6 @@ AMD SMI for ROCm 6.1.2 #### Additions -* Added macros that were in amdsmi.h to the amdsmi Python library (amdsmi_interface.py). -* Added the ring hang event to possible events in the `amdsmi_evt_notification_type_t` enum. * Added process isolation and clean shader APIs and CLI commands. * `amdsmi_get_gpu_process_isolation()` * `amdsmi_set_gpu_process_isolation()` @@ -30,19 +28,16 @@ AMD SMI for ROCm 6.1.2 #### Optimizations * Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command. -* Updated the CLI voltage curve command output to split the frequency and voltage output by curve point if applicable. -* Updated `amdsmi_get_gpu_board_info()` to have larger structure sizes for `amdsmi_board_info_t`. -* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. -* Updated `amismi_get_power_cap_info` to return values in uW instead of W. #### Changes +* Updated `amismi_get_power_cap_info` to return values in uW instead of W. +* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. * Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. #### Fixes * `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -* Fixed the parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. * `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices. * Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices. * Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls. @@ -55,7 +50,36 @@ AMD SMI for ROCm 6.1.2 See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information. ``` -### Library changes in ROCm 6.1.2 +### HIPCC + +HIPCC for ROCm 6.1.2 + +#### Changes + +* **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. +* **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. +* **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. + +### ROCm SMI + +ROCm SMI for ROCm 6.1.2 + +#### Additions + +* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. + +#### Fixes + +* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. +* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. + +### Known issue with error detection on MI300X + +During poison consumption testing, the injection of uncorrectable errors will not generate an interrupt to the driver, +resulting in undetected errors. This can result in reliability and recovery issues on MI300X accelerator-based +setups. + +## Library changes in ROCm 6.1.2 | Library | Version | |---------|---------| @@ -74,7 +98,7 @@ See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6. | MIVisionX | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.2) | | rccl | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.2) | | rocALUTION | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.2) | -| rocBLAS | [4.1.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | +| rocBLAS | 4.1.0 ⇒ [4.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) | | rocDecode | 0.5.0 ⇒ [0.6.0](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.1.2) | | rocFFT | [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.2) | | rocm-cmake | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.2) | @@ -87,28 +111,48 @@ See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6. | rpp | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.2) | | Tensile | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2) | -#### rocDecode +### RCCL + +RCCL 2.18.6 for ROCm 6.1.2 + +#### Changes + +* Reduced `NCCL_TOPO_MAX_NODES` to limit stack usage and avoid stack overflow. + +### rocBLAS + +rocBLAS 4.1.2 for ROCm 6.1.2 + +#### Optimizations + +* Tuned BBS TN and TT operations on the CDNA3 architecture. + +#### Fixes + +* Fixed an issue related to obtaining solutions for BF16 TT operations. + +### rocDecode rocDecode 0.6.0 for ROCm 6.1.2 -##### Additions +#### Additions * Added support for FFmpeg v5.x. -##### Optimizations +#### Optimizations -* Updated error checking in the rocDecode-setup.py script. +* Updated error checking in the `rocDecode-setup.py` script. -##### Changes +#### Changes * Updated core dependencies. * Updated to support the use of public LibVA headers. -##### Fixes +#### Fixes * Fixed some package dependencies. -##### Tested configurations +#### Tested configurations * Linux * Ubuntu 20.04 and 22.04 diff --git a/docs/conf.py b/docs/conf.py index da53afeb7..0089a06c9 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -38,8 +38,8 @@ latex_elements = { project = "ROCm Documentation" author = "Advanced Micro Devices, Inc." copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." -version = "6.0.1" -release = "6.0.1" +version = "6.1.2" +release = "6.1.2" setting_all_article_info = True all_article_info_os = ["linux", "windows"] all_article_info_author = "" @@ -49,12 +49,12 @@ article_pages = [ { "file":"about/release-notes", "os":["linux", "windows"], - "date":"2024-01-31" + "date":"2024-06-04" }, { "file":"about/changelog", "os":["linux", "windows"], - "date":"2024-01-31" + "date":"2024-06-04" }, {"file":"install/windows/install-quick", "os":["windows"]}, diff --git a/tools/autotag/templates/rocm_changes/6.1.2.md b/tools/autotag/templates/rocm_changes/6.1.2.md index 649e9d942..f7d5441b2 100644 --- a/tools/autotag/templates/rocm_changes/6.1.2.md +++ b/tools/autotag/templates/rocm_changes/6.1.2.md @@ -1,5 +1,5 @@ -ROCm 6.1.2 includes improvements to AMD SMI commands and output metrics, and extends support within the rocDecode library. +ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries. ### AMD SMI @@ -7,8 +7,6 @@ AMD SMI for ROCm 6.1.2 #### Additions -* Added macros that were in amdsmi.h to the amdsmi Python library (amdsmi_interface.py). -* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. * Added process isolation and clean shader APIs and CLI commands. * `amdsmi_get_gpu_process_isolation()` * `amdsmi_set_gpu_process_isolation()` @@ -18,19 +16,16 @@ AMD SMI for ROCm 6.1.2 #### Optimizations * Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command. -* Updated the CLI voltage curve command output to split the frequency and voltage output by curve point, if applicable. -* Updated `amdsmi_get_gpu_board_info()` to have larger structure sizes for `amdsmi_board_info_t`. -* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. -* Updated `amismi_get_power_cap_info` to return values in uW instead of W. #### Changes +* Updated `amismi_get_power_cap_info` to return values in uW instead of W. +* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. * Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. #### Fixes * `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -* Fixed the parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. * `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices. * Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices. * Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls. @@ -42,3 +37,32 @@ AMD SMI for ROCm 6.1.2 ```{note} See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information. ``` + +### HIPCC + +HIPCC for ROCm 6.1.2 + +#### Changes + +* **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. +* **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. +* **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. + +### ROCm SMI + +ROCm SMI for ROCm 6.1.2 + +#### Additions + +* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. + +#### Fixes + +* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. +* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. + +### Known issue with error detection on MI300X + +During poison consumption testing, the injection of uncorrectable errors will not generate an interrupt to the driver, +resulting in undetected errors. This can result in reliability and recovery issues on MI300X accelerator-based +setups.