diff --git a/CHANGELOG.md b/CHANGELOG.md index 24c3424bf..2c961cbcf 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -140,6 +140,11 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc * ``hipblasLtMatmul()`` now returns an error when the workspace size is insufficient, rather than causing a segmentation fault. +#### Optimized + +* `TF32` kernel optimization for the AMD Instinct MI355X GPU to enhance training and inference efficiency. +* Meta Model optimization for the AMD Instinct MI350X GPU to enable better performance across transformer-based models. + #### Resolved issues * Fixed incorrect results when using ldd and ldc dimension parameters with some solutions. @@ -368,9 +373,10 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc * The MSCCL++ feature is now disabled by default. The `--disable-mscclpp` build flag is replaced with `--enable-mscclpp` in the `rccl/install.sh` script. * Compatibility with NCCL 2.27.7. -#### Optimized - -* Improved small message performance for `alltoall` by enabling and optimizing batched P2P operations. +### Optimized +* Enabled and optimized batched P2P operations to improve small message performance for `AllToAll` and `AllGather`. +* Optimized channel count selection to improve efficiency for small-to-medium message sizes in `ReduceScatter`. +* Changed code inlining to improve latency for small message sizes for `AllReduce`, `AllGather`, and `ReduceScatter`. #### Known issues diff --git a/RELEASE.md b/RELEASE.md index e40271e32..bd07cd745 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -255,7 +255,7 @@ Optimized the tuning workflow for the SpMM kernel, resulting in improved perform ROCm Compute Profiler has the following enhancements: * Single‑Pass Counter Collection feature has been added. It allows profiling kernels in a single pass using a predefined metric set, reducing profiling overhead and session time. -* Dynamic Attach/Detach feature has been added. It allows starting or stopping profiling on a running application without restarting, enabling flexible analysis for long‑running jobs. +* Live Attach/Detach feature has been added. It allows starting or stopping profiling on a running application without restarting, enabling flexible analysis for long‑running jobs. * Enhanced TUI Experience feature has been added. It allows for interactive exploration of metrics with descriptions and view high‑level compute and memory throughput panels for quick insights. ### ROCm Systems Profiler updates @@ -266,6 +266,17 @@ ROCm Systems Profiler has the following enhancements: * Transitioned to using AMD SMI by default, instead of ROCm SMI. * Integrated with ROCm Profiling Data (rocpd), enabling profiling results to be stored in a SQLite3 database. This provides a structured and efficient foundation for in-depth analysis and post-processing. +### ROCprofiler-SDK updates + +ROCprofiler-SDK and `rocprofv3` include the following enhancements: + +* Dynamic process attachment feature has been added. This feature in ROCprofiler-SDK and `rocprofv3` allows dynamic profiling of a running GPU application by attaching to its process ID (PID), rather than launching the application through the profiler itself. This allows real-time data collection without interrupting execution, making it ideal for profiling long-running, containerized, or multiprocess workloads. For more details, refer to [Dynamic process attachment](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocprofv3-process-attachment.html) documentation for `rocprofv3` and [Implementing Process Attachment Tools](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/api-reference/process_attachment.html) for `ROCprofiler-SDK`. +* Scratch-memory trace information has been added to the Perfetto output in `rocprofv3`, enhancing visibility into memory usage during profiling. Additionally, derived metrics and the required counters have been successfully integrated for gfx12XX series GPUs, enabling users to collect performance counters through `rocprofv3` on these platforms. +* Host-trap (software-based) PC sampling is now available on RDNA4 architecture-based gfx12xx series GPUs. It uses the kernel threads to interrupt GPU waves and capture PC states. For more details, see [Using PC sampling](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-pc-sampling.html). +* Real-time clock support has been added to the thread trace in `rocprofv3` for thread trace alignment on gfx9xx GPUs, enabling high-resolution clock computation and better synchronization across shader engines. +* `MultiKernelDispatch` thread trace support is now available across all ASICs, allowing users to profile multiple kernel dispatches within a single thread trace session. This enhances the timeline accuracy and enables deeper analysis of concurrent GPU workloads. +* Stability and robustness of the `rocpd` output format for `rocprofv3` has been improved. For details, see [Using rocpd output format](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocpd-output-format.html). + ### Device-side assertion support and atomic metadata control in Clang ROCm 7.1.0 introduces two key compiler enhancements: @@ -578,7 +589,7 @@ Click {fab}`github` to go to the component's source code on GitHub. ROCprofiler-SDK - 1.0.0 + 1.0.0 ⇒ 1.0.0 @@ -795,6 +806,11 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc * ``hipblasLtMatmul()`` now returns an error when the workspace size is insufficient, rather than causing a segmentation fault. +#### Optimized + +* `TF32` kernel optimization for the AMD Instinct MI355X GPU to enhance training and inference efficiency. +* Meta Model optimization for the AMD Instinct MI350X GPU to enable better performance across transformer-based models. + #### Resolved issues * Fixed incorrect results when using ldd and ldc dimension parameters with some solutions. @@ -1024,8 +1040,9 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc * Compatibility with NCCL 2.27.7. #### Optimized - -* Improved small message performance for `alltoall` by enabling and optimizing batched P2P operations. +* Enabled and optimized batched P2P operations to improve small message performance for `AllToAll` and `AllGather`. +* Optimized channel count selection to improve efficiency for small-to-medium message sizes in `ReduceScatter`. +* Changed code inlining to improve latency for small message sizes for `AllReduce`, `AllGather`, and `ReduceScatter`. #### Known issues @@ -1141,7 +1158,7 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc ### **ROCm Compute Profiler** (3.3.0) #### Added -* Live attach/detach feature that allows coupling with a workload process, without controlling its start or end. +* Live Attach/Detach feature that allows coupling with a workload process, without controlling its start or end. * Use '--attach-pid' to specify the target process ID. * Use '--attach-duration-msec' to specify time duration. * `rocpd` choice for `--format-rocprof-output` option in profile mode. @@ -1293,7 +1310,22 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc #### Removed -* `rocprofv2` doesn't support gfx12. For gfx12, use `rocprofv3` tool. +* `rocprofv2` doesn't support gfx12xx Series GPUs. For gfx12xx Series GPUs, use `rocprofv3` tool. + +### **ROCprofiler-SDK** (1.0.0) + +#### Added +* Dynamic process attachment- ROCprofiler-SDK and `rocprofv3` now facilitate dynamic profiling of a running GPU applications by attaching to its process ID (PID), rather than launching the application through the profiler itself. +* Scratch-memory trace information to the Perfetto output in `rocprofv3`. +* New capabilities to the thread trace support in `rocprofv3`: + * Real-time clock support for thread trace alignment on gfx9 architecture. This enables high-resolution clock computation and better synchronization across shader engines. + * `MultiKernelDispatch` thread trace support is now available across all ASICs. +* Documentation for dynamic process attachment. +* Documentation for `rocpd` summaries. + + +### Optimized +* Improved the stability and robustness of the `rocpd` output. ### **rocPyDecode** (0.7.0)