diff --git a/CHANGELOG.md b/CHANGELOG.md index b379f383b..976222a27 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,189 @@ This page is a historical overview of changes made to ROCm components. This consolidated changelog documents key modifications and improvements across different versions of the ROCm software stack and its components. +## ROCm 7.1.1 + +See the [ROCm 7.1.1 release notes](https://rocm-stg.amd.com/en/latest/about/release-notes.html#rocm-7-1-1-release-notes) +for a complete overview of this release. + +### **AMD SMI** (26.2.0) + +#### Added + +- Caching for repeated ASIC information calls. + - The cache added to `amdsmi_get_gpu_asic_info` improves performance by avoiding redundant hardware queries. + - The cache stores ASIC info for each GPU device with a configurable duration, defaulting to 10 seconds. Use the `AMDSMI_ASIC_INFO_CACHE_MS` environment variable for cache duration configuration for `amdsmi_get_gpu_asic_info` API calls. + +- Support for GPU partition metrics. + - Provides support for `xcp_metrics` v1.0 and extends support for v1.1 (dynamic metrics). + - Added `amdsmi_get_gpu_partition_metrics_info`, which provides per XCP (partition) metrics. + +- Support for displaying newer VRAM memory types in `amd-smi static --vram`**. + - The `amdsmi_get_gpu_vram_info()` API now supports detecting DDR5, LPDDR4, LPDDR5, and HBM3E memory types. + +#### Changed + +- Updated `amd-smi static --numa` socket affinity data structure. It now displays CPU affinity information in both hexadecimal bitmask format and expanded CPU core ranges, replacing the previous simplified socket enumeration approach. + +#### Resolved Issues + +- Fixed incorrect topology weight calculations. + - Out of bound writes caused corruption in the weights field + +- Fixed `amd-smi event` not respecting the Linux timeout command. + +- Fixed an issue where `amdsmi_get_power_info` returned `AMDSMI_STATUS_API_FAILED`. + - VMs were incorrectly reporting `AMDSMI_STATUS_API_FAILED` when unable to get the power cap within the `amdsmi_get_power_info`. + - The API now returns `N/A` or `UINT_MAX` for values that cannot be retrieved, instead of failing entirely. + +- Fixed output for `amd-smi xgmi -l --json`. + +### **Composable Kernel** (1.1.0) + +#### Upcoming changes + +* Composable Kernel will adopt C++20 features in an upcoming ROCm release, updating the minimum compiler requirement to C++20. Ensure that your development environment meets this requirement to facilitate a seamless transition. + +### **HIP** (7.1.1) + +#### Added + +* Support for the flag `hipHostRegisterIoMemory` in `hipHostRegister`, used to register I/O memory with HIP runtime so the GPU can access it. + +#### Resolved issues + +* Incorrect Compute Unit (CU) mask in logging. HIP runtime now correctly sets the field width for the output print operation. When logging is enabled via the environment variable `AMD_LOG_LEVEL`, the runtime logs the accurate CU mask. +* A segmentation fault occurred when the dynamic queue management mechanism was enabled. HIP runtime now ensures GPU queues aren't NULL during marker submission, preventing crashes and improving robustness. +* An error encountered on HIP tear-down after device reset in certain applications due to accessing stale memory objects. HIP runtime now properly releases memory associated with host calls, ensuring reliable device resets. +* A race condition occurred in certain graph-related applications when pending asynchronous signal handlers referenced device memory that had already been released, leading to memory corruption. HIP runtime now uses a reference counting strategy to manage access to device objects in asynchronous event handlers, ensuring safe and reliable memory usage. + +### **MIGraphX** (2.14.0) + +#### Resolved issues + +* Fixed an error that resulted when running `make check` on systems running on a gfx1201 GPU [(#4397)](https://github.com/ROCm/AMDMIGraphX/pull/4397). + +### **RCCL** (2.27.7) + +#### Resolved Issues + +* Fixed a single-node data corruption issue in MSCCL on the AMD Instinct MI350X and MI355X GPUs for the LL protocol. This previously affected about two percent of the runs for single-node `AllReduce` with inputs smaller than 512 KiB. + +### **rocBLAS** (5.1.1) + +#### Changed + * By default, rocBLAS will not use stream order allocation for its internal workspace. To enable this behavior, set the `ROCBLAS_STREAM_ORDER_ALLOC` environment variable. + +### **ROCm Bandwidth Test** (2.6.0) + +#### Fixed + +- Test failure with error message `Cannot make canonical path`. +- Healthcheck test failure with seg fault on gfx942. +- Segmentation fault observed in `schmoo` and `one2all` when executed on `sgpu` setup. + +#### Known Issues + +- `rocm-bandwidth-test` folder fails to be removed after driver uninstallation: + * After running `amdgpu-uninstall`, the `rocm-bandwidth-test` folder and package are still present. + * Workaround: Remove the package manually using: + ``` + sudo apt-get remove -y rocm-bandwidth-test + ``` + +### **ROCm Compute Profiler** (3.3.1) + +#### Added + +* Support for PC sampling of multi-kernel applications. + * PC Sampling output instructions are displayed with the name of the kernel to which the individual instruction belongs. + * Single kernel selection is supported so that the PC samples of selected kernel can be displayed. + +#### Changed + +* Roofline analysis now runs on GPU 0 by default instead of all GPUs. + +#### Optimized + +* Improved roofline benchmarking by updating the `flops_benchmark` calculation. + +* Improved standalone roofline plots in profile mode (PDF output) and analyze mode (CLI and GUI visual plots): + * Fixed the peak MFMA/VALU lines being cut off. + * Cleaned up the overlapping roofline numeric values by moving them into the side legend. + * Added AI points chart with respective values, cache level, and compute/memory bound status. + * Added full kernel names to the symbol chart. + +#### Resolved issues + +* Resolved existing issues to improve stability. + +### **ROCm Systems Profiler** (1.2.1) + +#### Resolved issues + +- Fixed an issue of OpenMP Tools (OMPT) events, GPU performance counters, VA-API, MPI, and host events failing to be collected in the `rocpd` output. + +### **ROCm Validation Suite** (1.3.0) + +#### Added + +* Support for different test levels with `-r` option for MI3XXx. +* Set compute type for DGEMM operations in MI350X and MI355X. + +### **rocSHMEM** (3.0.0) + +#### Added +* Allowed IPC, RO, and GDA backends to be selected at runtime. +* GDA conduit for different NIC vendors: + * Broadcom BNXT\_RE (Thor 2) + * Mellanox MLX5 (IB and RoCE ConnectX-7) +* New APIs: + * `rocshmem_get_device_ctx` + +#### Changed + +* The following APIs have been deprecated: + * `rocshmem_wg_init` + * `rocshmem_wg_finalize` + * `rocshmem_wg_init_thread` + +* `rocshmem_ptr` can now return non-null pointer to a shared memory region when the IPC transport is available to reach that region. Previously, it would return a null pointer. +* `ROCSHMEM_RO_DISABLE_IPC` is renamed to `ROCSHMEM_DISABLE_MIXED_IPC`. + - This environment variable wasn't documented in earlier releases. It's now documented. + +#### Removed + +* rocSHMEM no longer requires rocPRIM and rocThrust as dependencies. +* Removed MPI compile-time dependency. + +#### Known issues + +* Only a subset of rocSHMEM APIs are implemented for the GDA conduit. + +### **rocWMMA** (2.1.0) + +#### Added + +* More unit tests to increase the code coverage. + +#### Changed + +* Increased compile timeout and improved visualization in `math-ci`. + +#### Removed + +* Absolute paths from the `RPATH` of sample and test binary files. + +#### Resolved Issues + +* Fixed issues caused by HIP changes: + * Removed the `.data` member from `HIP_vector_type`. + * Broadcast constructor now only writes to the first vector element. +* Fixed a bug related to `int32_t` usage in `hipRTC_gemm` for gfx942, caused by breaking changes in HIP. +* Replaced `#pragma unroll` with `static for` to fix a bug caused by the upgraded compiler which no longer supports using `#pragma unroll` with template parameter indices. +* Corrected test predicates for `BLK` and `VW` cooperative kernels. +* Modified `compute_utils.sh` in `build-infra` to ensure rocWMMA is built with gfx1151 target for ROCm 7.0 and beyond. + ## ROCm 7.1.0 See the [ROCm 7.1.0 release notes](https://rocm.docs.amd.com/en/docs-7.1.0/about/release-notes.html#rocm-7-1-0-release-notes) diff --git a/RELEASE.md b/RELEASE.md index 5f897ad33..d1f920632 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -10,7 +10,7 @@ -# ROCm 7.1.0 release notes +# ROCm 7.1.1 release notes The release notes provide a summary of notable changes since the previous ROCm release. @@ -37,36 +37,39 @@ documentation to verify compatibility and system requirements. ## Release highlights -The following are notable new features and improvements in ROCm 7.1.0. For changes to individual components, see +The following are notable new features and improvements in ROCm 7.1.1. For changes to individual components, see [Detailed component changes](#detailed-component-changes). ### Supported hardware, operating system, and virtualization changes -ROCm 7.1.0 extends the operating system support for the following AMD hardware: +ROCm 7.1.1 adds support for the following operating systems and kernel versions: -* AMD Instinct MI325X adds support for RHEL 10.0, SLES15 SP7, Debian 13, Debian 12, Oracle Linux 10, and Oracle Linux 9. -* AMD Instinct MI100 adds support for SLES 15 SP7. +* RHEL 10.1 (kernel: 6.12.0-124) -For more information about supported: +* RHEL 9.7 (kernel: 5.14.0-611) -* AMD hardware, see [Supported GPUs (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-gpus). +ROCm 7.1.1 extends the Debian 13 support for AMD Instinct MI355X and MI350X GPUs. -* Operating systems, see [Supported operating systems](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems) and [ROCm installation for Linux](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/). +For more information on: + +* AMD hardware, see [Supported GPUs (Linux)](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#supported-gpus). + +* Operating systems, see [Supported operating systems](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#supported-operating-systems) and [ROCm installation for Linux](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/). #### Virtualization support -ROCm 7.1.0 adds Guest OS support for RHEL 10.0 in KVM SR-IOV for AMD Instinct MI355X and MI350X GPUs. +ROCm 7.1.1 adds Ubuntu 24.04 as Guest OS in KVM SR-IOV for AMD Instinct MI300X GPUs. -For more information, see [Virtualization Support](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#virtualization-support). +For more information, see [Virtualization Support](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#virtualization-support). ### User space, driver, and firmware dependent changes -The software for AMD Datacenter GPU products requires maintaining a hardware +The software for AMD Data center GPU products requires maintaining a hardware and software stack with interdependencies between the GPU and baseboard firmware, AMD GPU drivers, and the ROCm user space software.
| @@ -93,37 +96,35 @@ firmware, AMD GPU drivers, and the ROCm user space software. } | ||||||||
|---|---|---|---|---|---|---|---|---|
| ROCm 7.1.0 | +ROCm 7.1.1 | MI355X |
- 01.25.15.04 (or later) - 01.25.13.09 + 01.25.16.03 (or later) + 01.25.15.04 |
- 30.20.0 + |
+ 30.20.1 + 30.20.0 30.10.2 30.10.1 30.10 |
- 8.5.0.K | +8.6.0.K | |
| MI350X |
- 01.25.15.04 (or later) - 01.25.13.09 + 01.25.16.03 (or later) + 01.25.15.04 |
- 30.20.0 - 30.10.2 - 30.10.1 - 30.10 |
||||||
| MI325X | +MI325X[2] |
- 01.25.05.01 + 01.25.05.02 (or later)[1] 01.25.04.02 |
-
- 30.20.0 + | 30.20.1 + 30.20.0[2] 30.10.2 30.10.1 30.10 @@ -136,34 +137,33 @@ firmware, AMD GPU drivers, and the ROCm user space software. | 01.25.05.00 (or later)[1] 01.25.03.12 |
+ 30.20.1 30.20.0 30.10.2 30.10.1 30.10 6.4.z where z (0–3) - 6.3.y where y (0–3) - 6.2.x where x (1–4) + 6.3.y where y (1–3) |
- 8.5.0.K | +8.6.0.K |
| MI300A | -BKC 26 - BKC 25 |
+ BKC 26 | Not Applicable | |||||
| MI250X | -IFWI 47 | +IFWI 47 (or later) | ||||||
| MI250 | -MU3 w/ IFWI 73 | +MU5 w/ IFWI 75 (or later) | ||||||
| MI210 | -MU3 w/ IFWI 73 | -8.5.0.K | +MU5 w/ IFWI 75 (or later) | +8.6.0.K | ||||
| MI100 | @@ -173,200 +173,111 @@ firmware, AMD GPU drivers, and the ROCm user space software.
[1]: PLDM bundle 01.25.05.00 will be available by November 2025.
+[1]: PLDM bundle 01.25.05.02 and 01.25.05.00 will be available by end of November 2025.
+[2]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU Driver (amdgpu) 30.20.0.
-#### AMD SMI improvement: Set power cap +#### AMD Instinct MI355X GPU resiliency improvement -AMD Instinct MI300X now supports setting a power cap in 1VF. The system is designed to select the lowest power cap value from those provided by the host, VM, and Advanced Platform Management Link (APML). This feature provides enhanced control over power management in virtualized environments, particularly in single-VM configurations. By allowing the VM to set a power cap, you can optimize power usage and efficiency for your specific needs. This feature requires PLDM bundle 01.25.05.00 (or later) firmware. +Multimedia Engine Reset is now supported by AMD GPU Driver (amdgpu) 30.20.1 for AMD Instinct MI355X GPUs. This finer-grain GPU resiliency allows recovery from faults related to VCN or JPEG without requiring a full GPU reset, thereby improving system stability and fault tolerance. Note that VCN queue reset functionality requires PLDM Bundle 01.25.16.03 (or later) firmware. -#### Virtualization update for AMD Instinct MI350 Series GPUs +### GEMM kernel selection improvement -* Enabled SPX/NPS1 support for multi-tenant (1VM, 2VM, 4VM, and 8VM). This feature depends on PLDM bundle 01.25.15.04. +GEMM kernel selection efficiency has been improved using Origami. This results in improved out-of-the-box performance of GEMM functions for hipBLASLT and rocBLAS, as well as a reduced need for tuning. This change reduces selection time, increases selection accuracy, and adds Origami libraries for all GEMM problem types on the AMD Instinct MI350X GPUs. -* Enabled CPX/NPS2 support (1VF/OAM). This feature depends on PLDM bundle 01.25.15.04. (Technical preview) +### Performance improvement in CK/AITER fused-attn -* Enabled DPX/NPS2 support (1VF/OAM). This feature depends on PLDM bundle 01.25.15.04. +Padding is now supported in native CK/AITER fused-attn mode, reducing the overall runtime. Previously, the Transformer Engine (TE) had to remove padding before processing and reapply it afterward as a workaround which added runtime overhead. With this update, TE can now pass padded input directly to CK/AITER and receive padded output, eliminating the need for that workaround. -* Enabled Guest OS support for RHEL 10 and RHEL 9.6. This feature depends on PLDM bundle 01.25.15.04. +### AI model support update -### HIP runtime compatibility improvements +ROCm 7.1.1 updates the support for the following AI models: -ROCm 7.1.0 improves the compatibility between the HIP runtime and NVIDIA CUDA. +* [Hugging Face Transformers](https://huggingface.co/docs/transformers/en/index) is now supported on gfx1201. +* [Microsoft Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4) is now supported on gfx1201. +* [Qwen QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) is now supported on gfx1201. +* [Google Gemma 3 27B](https://huggingface.co/google/gemma-3-27b-it) is now supported on gfx1100. -* New HIP APIs added for: +### ROCm Data Science updates - * Memory management: `hipMemsetD2D8`, `hipMemsetD2D8Async`, `hipMemsetD2D16`, `hipMemsetD2D16Async`, `hipMemsetD2D32`, `hipMemsetD2D32Async`, `hipMemcpyBatchAsync`, `hipMemcpy3DBatchAsync`, `hipMemcpy3DPeer`, `hipMemcpy3DPeerAsync`, `hipMemPrefetchAsync_v2`, and `hipMemAdvise_v2`. - * Module Management:`hipModuleGetFunctionCoun` and `hipModuleLoadFatBinary` - * Stream Management: `hipStreamSetAttribute`, `hipStreamGetAttribute`, and `hipStreamGetId` - * Device Management: `hipSetValidDevices` - * Driver Entry Point Access: `hipGetDriverEntryPoint` -* HIP runtime now supports nested tile partitioning within cooperative groups, matching CUDA functionality. -* Improved HIP module loading latency. +ROCm Data Science Toolkit (ROCm-DS) is a comprehensive open-source software collection designed to accelerate data science and machine learning workloads on AMD GPUs. In November 2025, ROCm-DS transitioned from early access (EA) to general availability (GA). -For detailed enhancements and updates refer to the [HIP Changelog](#hip-7-1-0). - -### hipBLASLt: Kernel optimizations and model support enhancements - -hipBLASLt introduces several performance and model compatibility improvements for AMD Instinct GPUs: - -* TF32 kernel optimization for AMD Instinct MI355X GPUs to enhance training and inference efficiency. -* FP32 kernel optimization for AMD Instinct MI350X GPUs, improving precision-based workloads. -* Llama 2 70B model support fix for AMD Instinct MI350X GPUs: Removed incorrect kernel to ensure accurate and stable execution. -* For AMD Instinct MI350X GPUs, added multiple high-performance kernels optimized for `FP16` and `BF16` data types, enhancing heuristic-based execution. -* FP8 low-precision data type operations on AMD Instinct MI350X GPUs. This update adds FP8 support for the Instinct MI350X using the hipBLASLt low-precision data type functionality. -* Mixtral-8x7b model optimization for AMD Instinct MI325X GPUs. - -### hipSPARSELt: SpMM performance improvements - -hipSPARSELt introduces significant performance enhancements for structured sparsity matrix multiplication (SpMM) on AMD Instinct MI300X GPUs: - -* New feature support -- Enabled multiple buffer single kernel execution for SpMM, improving efficiency in Split-K method scenarios. -* Kernel optimization -- Added multiple high-performance kernels optimized for `FP16` and `BF16` data types, enhancing heuristic-based execution. -* Tuning efficiency -- Improved the tuning process for SpMM kernels, resulting in better runtime adaptability and performance. - -### rocAL: Enhancements for vision transformer model training - -ROCm 7.1.0 introduces new capabilities in rocAL to support training of Vision Transformer (ViT) models: - -* Added support for CropResize augmentation and the CIFAR10 dataloader, commonly used in ViT training workflows. -* These updates enable seamless integration of rocAL into open-source PyTorch Vision Transformer models. - -This enhancement improves preprocessing efficiency and simplifies the setup of data pipelines for ViT-based deep learning applications. - -### RCCL: AMD Instinct MI350 Series enhancements - -* Optimized performance for select collective operations. -* Enhanced single-node performance on AMD Instinct MI350 GPUs. -* Achieved higher throughput with increased XGMI speed. -* Verified compatibility with NCCL 2.27.7. -* Improved efficiency for the All Gather collective. - -### ROCm Compute Profiler updates - -ROCm Compute Profiler has the following enhancements: - -* Single‑Pass Counter Collection feature has been added and can be used by adding the `set` filtering option to the profile. It allows profiling kernels in a single pass using a predefined metric set, reducing profiling overhead and session time. For more information, see [Filtering options](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/docs-7.1.0/how-to/profile/mode.html#filtering-options). -* Dynamic process attachment feature has been added. It allows starting or stopping profiling on a running application without restarting, enabling flexible analysis for long‑running jobs. For more information, see [Dynamic process attachment in ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/docs-7.1.0/how-to/live_attach_detach.html). -* Enhanced TUI Experience feature has been added. It allows interactive exploration of metrics with descriptions and views of high‑level compute and memory throughput panels for quick insights. For more information, see [Text-based User Interface (TUI) analysis](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/docs-7.1.0/how-to/analyze/tui.html). - -### ROCm Systems Profiler updates - -ROCm Systems Profiler has the following enhancements: - -* Validated JAX AI and PyTorch AI frameworks. -* Transitioned to using AMD SMI by default, instead of ROCm SMI to ensure the best support for the latest AMD GPUs. -* Integrated with ROCm Profiling Data (rocpd), enabling profiling results to be stored in a SQLite3 database. This provides a structured and efficient foundation for in-depth analysis and post-processing. For more information, see [ROCm Profiling Data (rocpd) output](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/docs-7.1.0/how-to/understanding-rocprof-sys-output.html#rocm-profiling-data-rocpd-output). -* Ability to generate an aggregated report for multi-processes has been added. For more information, see [Generating performance summary using rocpd](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/how-to/using-rocpd-output-format.html#generating-performance-summary-using-rocpd). -* Support for OpenMP (Open Multi-Processing) in Fortran has been added. - -### ROCprofiler-SDK updates - -ROCprofiler-SDK and `rocprofv3` include the following enhancements: - -* Dynamic process attachment feature has been added. This feature in ROCprofiler-SDK and `rocprofv3` allows dynamic profiling of a running GPU application by attaching to its process ID (PID), rather than launching the application through the profiler itself. This allows real-time data collection without interrupting execution, making it ideal for profiling long-running, containerized, or multiprocess workloads. For more details, see [Dynamic process attachment](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/how-to/using-rocprofv3-process-attachment.html) for `rocprofv3` and [Implementing Process Attachment Tools](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/api-reference/process_attachment.html) for `ROCprofiler-SDK`. -* Scratch-memory trace information has been added to the Perfetto output in `rocprofv3`, enhancing visibility into memory usage during profiling. Additionally, derived metrics and the required counters have been successfully integrated for gfx12XX Series GPUs, enabling users to collect performance counters through `rocprofv3` on these platforms. -* Host-trap (software-based) PC sampling is now available on RDNA4 architecture-based gfx12XX Series GPUs. It uses the kernel threads to interrupt GPU waves and capture PC states. For more details, see [Using PC sampling](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/how-to/using-pc-sampling.html). -* Real-time clock support has been added to the thread trace in `rocprofv3` for thread trace alignment on gfx9xx GPUs, enabling high-resolution clock computation and better synchronization across shader engines. -* `MultiKernelDispatch` thread trace support is now available across all ASICs, allowing users to profile multiple kernel dispatches within a single thread trace session. This enhances the timeline accuracy and enables deeper analysis of concurrent GPU workloads. -* Stability and robustness of the `rocpd` output format for `rocprofv3` has been improved. For details, see [Using rocpd output format](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/how-to/using-rocpd-output-format.html). -* Ability to generate an aggregated report for multi-processes has been added. For more information, see [Generating performance summary using rocpd](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.1.0/how-to/using-rocpd-output-format.html#generating-performance-summary-using-rocpd). - -### ROCm Data Center tool: Enhanced CPU metrics - -The ROCm Data Center tool (RDC) hardware monitoring capabilities have been expanded by integrating the new `AMDSMI` API. This enhancement enables more comprehensive visibility into CPU performance and topology. - -### RPP: New hue and saturation augmentations - -RPP adds support for hue and saturation augmentations in the ROCm -Performance Primitives (RPP) library. These enhancements are available for both -HIP and HOST backends and support multiple data types — ``U8``, ``F16``, -``F32``, and ``I8`` — with layout toggle variants for NCHW and NHWC. - -### TensileLite: Enhanced SpMM kernel tuning efficiency - -Optimized the tuning workflow for the SpMM kernel, resulting in improved performance and streamlined configuration. - -### Device-side assertion support and atomic metadata control in Clang - -ROCm 7.1.0 introduces two key compiler enhancements: - -* Device-compatible assertions: A ``__device__`` version of - ``std::__glibcxx_assert_fail()`` has been added to enable the use of ``std::array`` and - other libstdc++ features in device code. This resolves previous compilation - failures caused by non-constexpr host assertions being invoked from device - contexts. - -* Clang atomic metadata attribute: The new ``[[clang::atomic]]`` statement - attribute allows fine-grained control over how atomic operations are lowered in - LLVM IR. Users can specify memory types (for example, ``remote_memory``, - ``fine_grained_memory``) and floating-point behavior (``ignore_denormal_mode``) to - optimize performance without compromising correctness. These attributes can - override global compiler flags on a per-block basis, improving atomic operation - efficiency on architectures like AMDGPU. - -### Model optimization for AMD Instinct MI300X GPUs - -Kernel optimization for Flash Attention and Paged Attention models on AMD Instinct MI300X GPUs. +This GA release marks a significant milestone for ROCm-DS as hipDF and hipMM transition to production status. Additionally, it introduces two new production components: hipRAFT and hipVS. For more information, see [AMD ROCm-DS documentation](https://rocm.docs.amd.com/projects/rocm-ds/en/latest/). ### Deep learning and AI framework updates ROCm provides a comprehensive ecosystem for deep learning development. For more information, see [Deep learning frameworks for ROCm](https://rocm.docs.amd.com/en/docs-7.1.0/how-to/deep-learning-rocm.html) and the [Compatibility -matrix](../../docs/compatibility/compatibility-matrix.rst) for the complete list of Deep learning and AI framework versions tested for compatibility with ROCm. - -#### PyTorch - -Torch-MIGraphX integrates the AMD graph inference engine with the PyTorch ecosystem. It provides a `mgx_module` object that may be invoked in the same manner as any other torch module, but utilizes the MIGraphX inference engine internally. Although Torch-MIGraphX has been available in previous releases, installable WHL files are now officially published. +matrix](../../docs/compatibility/compatibility-matrix.rst) for the complete list of Deep learning and AI framework versions tested for compatibility with ROCm. As of November 2025, AMD ROCm has officially updated support for the following Deep learning and AI frameworks: #### JAX -* JAX customers can now use Llama-2 with JAX efficiently. -* The latest public JAX repo is {fab}`github` [rocm-jax](https://github.com/ROCm/rocm-jax/tree/master). +User of the JAX deep learning framework can now efficiently use Llama-2. For more information, see [JAX compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/jax-compatibility.html). -#### TensorFlow -ROCm 7.1.0 enables support for TensorFlow 2.20.0. +#### Deep Graph Library (DGL) -#### ONNX Runtime +Deep Graph Library [(DGL)](https://www.dgl.ai/) is an easy-to-use, high-performance, and scalable Python package for deep learning on graphs. DGL is framework agnostic, meaning that if a deep graph model is a component in an end-to-end application, the rest of the logic is implemented using PyTorch. It's supported on ROCm 7.0.0, ROCm 6.4.3, and ROCm 6.4.0. For more information, see [DGL compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/dgl-compatibility.html). -The latest ONNX Runtime version (ONNX RT 1.23.1) is supported by the MIGraphX Execution Provider. +#### llama.cpp + +llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup. It's supported on ROCm 7.0.0 and ROCm 6.4.x. For more information, see [llama.cpp compatibility](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html). ### ROCm Offline Installer Creator updates - -The ROCm Offline Installer Creator 7.1.0 includes the following features and improvements: -* Added support for creating an offline installer for RHEL 8.10, 9.4, 9.6, and 10.0, where the kernel version of the target OS differs from the host OS creating the installer. - -* Fixes an issue in the Debian 13 Docker that prevented users from creating a driver install package using the default Docker kernel driver. +The ROCm Offline Installer Creator 7.1.1 includes the following features and improvements: +* Added support for RHEL 9.7 and 10.1. +* Added support for creating an offline installer for SLES 15.7, where the kernel version of the target OS differs from the host OS creating the installer. See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/install/rocm-offline-installer.html) for more information. ### ROCm Runfile Installer updates -The ROCm Runfile Installer 7.1.0 fixes warnings that occurred with rocm-examples testing. +The ROCm Runfile Installer 7.1.1 includes the following features and improvements: + +* Added support for RHEL 9.7 and 10.1. +* Fixed an issue where, after dependency installation, some dependencies were still marked as uninstalled. +* Fixed an issue where, the AMDGPU driver install would fail when multiple kernels were installed. +* Performance improvements for the RHEL/Oracle Linux dependency install. For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/install/rocm-runfile-installer.html). -### End of Support for ROCm Execution Provider (ROCm-EP) - -ROCm 7.1.0 marks the End of Support (EOS) for ROCm Execution Provider (ROCm-EP). ROCm 7.0.2 was the last official AMD-supported distribution of ROCm-EP. Refer to this [Pull Request](https://github.com/microsoft/onnxruntime/pull/25181) for more information. Migrate your applications to use the [MIGraphX Execution Provider](https://onnxruntime.ai/docs/execution-providers/MIGraphX-ExecutionProvider.html#migraphx-execution-provider). - ### ROCm documentation updates ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases. -* [Tutorials for AI developers](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) have been expanded with the following two new tutorials: - * Pretraining tutorial: [Speculative decoding draft model with SpecForge](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/SpecForge_SGlang.html) - * GPU development and optimization tutorial: [Quark MXFP4 quantization for vLLM](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/mxfp4_quantization_quark_vllm.html) +* The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/latest/) has been enhanced with new [GPU programming pattern tutorials](https://advanced-micro-devices-demo--5.com.readthedocs.build/projects/hipbook-internal/en/5/tutorial/programming-patterns.html). These tutorials address common GPU challenges, including memory coherence, race conditions, and data transfer overhead. They provide practical, performance-oriented examples for real-world applications in machine learning, scientific computing, and image processing. The following tutorials have been added: - For more information about the changes, see [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html). + * **Two-dimensional kernels**: Efficient matrix and image processing with optimized thread mapping and memory access. + * **Stencil operations**: Implementing spatially dependent computations for image filtering and physics simulations. + * **Atomic operations**: Managing concurrent memory access safely for tasks such as histogram generation. + * **Multi-kernel programming**: Coordinating multiple GPU kernels for complex iterative algorithms such as graph traversal. + * **CPU-GPU cooperative computing**: Balancing workloads between CPU and GPU for hybrid algorithms such as K-means clustering. -* ROCm components support a wide range of environment variables that can be used for testing, logging, debugging, experimental features, and more. The [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-7.1.0/reference/env-variables.html) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/docs-7.1.0/api-reference/env-variables.html) components have been updated with new environment variable content. +* [Tutorials for AI developers](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) have been expanded with the following two new pretraining tutorials: + * [Pretraining with TorchTitan](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/torchtitan_deepseek.html) + * [Training a model with Primus](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/training_with_primus.html) + + For more information about the changes, see the [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html). -* The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.1.0/index.html) introduces a new tutorial that shows you how to transform your GPU applications from repeated direction to choreographed performance with HIP graphs. HIP graphs model dependencies between operations as nodes and edges on a diagram. Each node in the graph represents an operation, and each edge represents a dependency between two nodes. For more information, see [HIP graphs](https://rocm.docs.amd.com/projects/HIP/en/docs-7.1.0/how-to/hip_runtime_api/hipgraph.html#how-to-hip-graph) and [HIP Graph API Tutorial](https://rocm.docs.amd.com/projects/HIP/en/docs-7.1.0/tutorial/graph_api.html). +* The [ROCm examples repository](https://github.com/ROCm/rocm-examples) has been expanded with examples for the following ROCm components: + * [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/) + * [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/) + * [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/latest/) + * [hipTensor](https://rocm.docs.amd.com/projects/hipTensor/en/latest/) + * [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/latest/) + * [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/) + * [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/latest/) + + Usage examples are now available for the following performance analysis tools: + + * [ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/index.html) + * [ROCm Systems Profiler](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html) + * [rocprofv3](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocprofv3.html) + + The complete source code for the [HIP Graph Tutorial](https://rocm.docs.amd.com/projects/HIP/en/latest/tutorial/graph_api.html) is also available as part of the ROCm examples. ## ROCm components -The following table lists the versions of ROCm components for ROCm 7.1.0, including any version -changes from 7.0.2 to 7.1.0. Click the component's updated version to go to a list of its changes. +The following table lists the versions of ROCm components for ROCm 7.1.1, including any version +changes from 7.1.0 to 7.1.1. Click the component's updated version to go to a list of its changes. Click {fab}`github` to go to the component's source code on GitHub. @@ -395,42 +306,42 @@ Click {fab}`github` to go to the component's source code on GitHub.