mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 14:48:06 -05:00
rocBLAS and HipBLASLt known issue added 7.1.0 (#5634)
* rocBLAS and HipBLASLt known issue added * Title warning fixed * Jeff's feedback added * Leo's feedback incorporated * Minor feedback * MI325X PLDM udpate * Leo's feedback added * PyTorch profiling issue added * Changelog synced * JAX section removed * Ram's feedback added
This commit is contained in:
18
CHANGELOG.md
18
CHANGELOG.md
@@ -48,10 +48,6 @@ for a complete overview of this release.
|
||||
|
||||
* Fixed certain output in `amd-smi monitor` when GPUs are partitioned. It fixes the issue with amd-smi monitor such as: `amd-smi monitor -Vqt`, `amd-smi monitor -g 0 -Vqt -w 1`, and `amd-smi monitor -Vqt --file /tmp/test1`. These commands will now be able to display as normal in partitioned GPU scenarios.
|
||||
|
||||
```{note}
|
||||
See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.1/CHANGELOG.md) for details, examples, and in-depth descriptions.
|
||||
```
|
||||
|
||||
### **Composable Kernel** (1.1.0)
|
||||
|
||||
#### Added
|
||||
@@ -493,7 +489,7 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
|
||||
* Enabled `TCP_TCP_LATENCY` counter and associated counter for all GPUs except MI300.
|
||||
* Interactive metric descriptions in TUI analyze mode.
|
||||
* You can now left click on any metric cell to view detailed descriptions in the dedicated `METRIC DESCRIPTION` tab.
|
||||
* Support for analysis report output as a sqlite database using ``--output-format db`` analysis mode option.
|
||||
* Support for analysis report output as a SQLite database using ``--output-format db`` analysis mode option.
|
||||
* `Compute Throughput` panel to TUI's `High Level Analysis` category with the following metrics: VALU FLOPs, VALU IOPs, MFMA FLOPs (F8), MFMA FLOPs (BF16), MFMA FLOPs (F16), MFMA FLOPs (F32), MFMA FLOPs (F64), MFMA FLOPs (F6F4) (in gfx950), MFMA IOPs (Int8), SALU Utilization, VALU Utilization, MFMA Utilization, VMEM Utilization, Branch Utilization, IPC
|
||||
|
||||
* `Memory Throughput` panel to TUI's `High Level Analysis` category with the following metrics: vL1D Cache BW, vL1D Cache Utilization, Theoretical LDS Bandwidth, LDS Utilization, L2 Cache BW, L2 Cache Utilization, L2-Fabric Read BW, L2-Fabric Write BW, sL1D Cache BW, L1I BW, Address Processing Unit Busy, Data-Return Busy, L1I-L2 Bandwidth, sL1D-L2 BW
|
||||
@@ -579,7 +575,7 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
|
||||
* MI300A/X L2-Fabric 64B read counter may display negative values - The rocprof-compute metric 17.6.1 (Read 64B) can report negative values due to incorrect calculation when TCC_BUBBLE_sum + TCC_EA0_RDREQ_32B_sum exceeds TCC_EA0_RDREQ_sum.
|
||||
* A workaround has been implemented using max(0, calculated_value) to prevent negative display values while the root cause is under investigation.
|
||||
* The profile mode crashes when `--format-rocprof-output json` is selected.
|
||||
* As a workaround, this option should either not be provided or should be set to `csv` instead of `json`. This issue does not affect the profiling results since both `csv` and `json` output formats lead to the same profiling data.
|
||||
* As a workaround, this option should either not be provided or should be set to `csv` instead of `json`. This issue does not affect the profiling results since both `csv` and `json` output formats lead to the same profiling data.
|
||||
|
||||
### **ROCm Data Center Tool** (1.2.0)
|
||||
|
||||
@@ -620,6 +616,14 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
|
||||
- Updated PAPI module to v7.2.0b2.
|
||||
- ROCprofiler-SDK is now used for tracing OMPT API calls.
|
||||
|
||||
#### Known issues
|
||||
|
||||
* PyTorch and other Python applications might fail to profile device activities when it is unable to find the libraries in the default linker path. As a workaround, you need to explicitly add the library path to ``LD_LIBRARY_PATH``. For PyTorch use:
|
||||
|
||||
```
|
||||
export LD_LIBRARY_PATH=:/opt/venv/lib/python3.10/site-packages/torch/lib:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
### **rocPRIM** (4.1.0)
|
||||
|
||||
#### Added
|
||||
@@ -699,7 +703,7 @@ As of ROCm 7.0, the internal error state is cleared on each call to `hipGetLastE
|
||||
|
||||
#### Added
|
||||
|
||||
* Hybrid computation support for existing routines: STEQR
|
||||
* Hybrid computation support for existing STEQR routines.
|
||||
|
||||
#### Optimized
|
||||
|
||||
|
||||
31
RELEASE.md
31
RELEASE.md
@@ -53,6 +53,10 @@ For more information about supported:
|
||||
|
||||
* Operating systems, see [Supported operating systems](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems) and [ROCm installation for Linux](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/).
|
||||
|
||||
```{note}
|
||||
Starting ROCm 7.1.0, Upstream Inter-Process Communication (IPC) works with Checkpoint Restore in User space (CRIU) feature, but it requires the most up-to-date kernel and CRIU plugin.
|
||||
```
|
||||
|
||||
#### Virtualization support
|
||||
|
||||
ROCm 7.1.0 adds Guest OS support for RHEL 10.0 in KVM SR-IOV for AMD Instinct MI355X and MI350X GPUs.
|
||||
@@ -119,8 +123,7 @@ firmware, AMD GPU drivers, and the ROCm user space software.
|
||||
<tr>
|
||||
<td>MI325X</td>
|
||||
<td>
|
||||
01.25.05.01<br>
|
||||
01.25.04.02
|
||||
01.25.04.02<a href="#footnote2"><sup>[2]</sup></a>
|
||||
</td>
|
||||
<td>
|
||||
30.20.0<br>
|
||||
@@ -174,6 +177,7 @@ firmware, AMD GPU drivers, and the ROCm user space software.
|
||||
</div>
|
||||
|
||||
<p id="footnote1">[1]: PLDM bundle 01.25.05.00 will be available by November 2025.</p>
|
||||
<p id="footnote2">[2]: If using KVM SR-IOV, it’s recommended not to use AMD GPU Driver (amdgpu) 30.20.0 with PLDM bundle 01.25.04.02.</p>
|
||||
|
||||
#### AMD SMI improvement: Set power cap
|
||||
|
||||
@@ -317,11 +321,6 @@ matrix](../../docs/compatibility/compatibility-matrix.rst) for the complete list
|
||||
|
||||
Torch-MIGraphX integrates the AMD graph inference engine with the PyTorch ecosystem. It provides a `mgx_module` object that may be invoked in the same manner as any other torch module, but utilizes the MIGraphX inference engine internally. Although Torch-MIGraphX has been available in previous releases, installable WHL files are now officially published.
|
||||
|
||||
#### JAX
|
||||
|
||||
* JAX customers can now use Llama-2 with JAX efficiently.
|
||||
* The latest public JAX repo is {fab}`github` [rocm-jax](https://github.com/ROCm/rocm-jax/tree/master).
|
||||
|
||||
#### TensorFlow
|
||||
ROCm 7.1.0 enables support for TensorFlow 2.20.0.
|
||||
|
||||
@@ -1181,7 +1180,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
||||
* Enabled `TCP_TCP_LATENCY` counter and associated counter for all GPUs except MI300.
|
||||
* Interactive metric descriptions in TUI analyze mode.
|
||||
* You can now left click on any metric cell to view detailed descriptions in the dedicated `METRIC DESCRIPTION` tab.
|
||||
* Support for analysis report output as a sqlite database using ``--output-format db`` analysis mode option.
|
||||
* Support for analysis report output as a SQLite database using ``--output-format db`` analysis mode option.
|
||||
* `Compute Throughput` panel to TUI's `High Level Analysis` category with the following metrics: VALU FLOPs, VALU IOPs, MFMA FLOPs (F8), MFMA FLOPs (BF16), MFMA FLOPs (F16), MFMA FLOPs (F32), MFMA FLOPs (F64), MFMA FLOPs (F6F4) (in gfx950), MFMA IOPs (Int8), SALU Utilization, VALU Utilization, MFMA Utilization, VMEM Utilization, Branch Utilization, IPC
|
||||
|
||||
* `Memory Throughput` panel to TUI's `High Level Analysis` category with the following metrics: vL1D Cache BW, vL1D Cache Utilization, Theoretical LDS Bandwidth, LDS Utilization, L2 Cache BW, L2 Cache Utilization, L2-Fabric Read BW, L2-Fabric Write BW, sL1D Cache BW, L1I BW, Address Processing Unit Busy, Data-Return Busy, L1I-L2 Bandwidth, sL1D-L2 BW
|
||||
@@ -1308,6 +1307,14 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
||||
- Updated PAPI module to v7.2.0b2.
|
||||
- ROCprofiler-SDK is now used for tracing OMPT API calls.
|
||||
|
||||
#### Known issues
|
||||
|
||||
* PyTorch and other Python applications might fail to profile device activities when it is unable to find the libraries in the default linker path. As a workaround, you need to explicitly add the library path to ``LD_LIBRARY_PATH``. For PyTorch use:
|
||||
|
||||
```
|
||||
export LD_LIBRARY_PATH=:/opt/venv/lib/python3.10/site-packages/torch/lib:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
### **rocPRIM** (4.1.0)
|
||||
|
||||
#### Added
|
||||
@@ -1498,6 +1505,14 @@ ROCgdb might fail when running the `step-schedlock-spurious-waves.exp` test case
|
||||
|
||||
Due to a missing `rocm-core` dependency from the ROCm Bandwidth Test, you can't cleanly uninstall ROCm Bandwidth Test using the `amdgpu-install` script. As a workaround, uninstall ROCm Bandwidth Test manually, using the native package managers. For more information, see [Installation via native package manager](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/package-manager-index.html). The issue will be fixed in a future ROCm release. See [GitHub issue #5611](https://github.com/ROCm/ROCm/issues/5611).
|
||||
|
||||
### OpenBLAS runtime dependency for hipblastlt-test and hipblaslt-bench
|
||||
|
||||
Running `hipblaslt-test` or `hipblaslt-bench` without installing the OpenBLAS development package results in the following error:
|
||||
```
|
||||
libopenblas.so.0: cannot open shared object file: No such file or directory
|
||||
```
|
||||
As a workaround, first install `libopenblas-dev` or `libopenblas-deve`, depending on the package manager used. The issue will be fixed in a future ROCm release.
|
||||
|
||||
## ROCm resolved issues
|
||||
|
||||
The following are previously known issues resolved in this release. For resolved issues related to
|
||||
|
||||
@@ -136,7 +136,7 @@ The following section maps supported data types and GPU-accelerated TensorFlow
|
||||
features to their minimum supported ROCm and TensorFlow versions.
|
||||
|
||||
Data types
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
-----------------
|
||||
|
||||
The data type of a tensor is specified using the ``dtype`` attribute or
|
||||
argument, and TensorFlow supports a wide range of data types for different use
|
||||
@@ -254,7 +254,7 @@ are as follows:
|
||||
- 1.7
|
||||
|
||||
Features
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
-----------------
|
||||
|
||||
This table provides an overview of key features in TensorFlow and their
|
||||
availability in ROCm.
|
||||
@@ -346,7 +346,7 @@ availability in ROCm.
|
||||
- 1.9.2
|
||||
|
||||
Distributed library features
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
-------------------------------------
|
||||
|
||||
Enables developers to scale computations across multiple devices on a single machine or
|
||||
across multiple machines.
|
||||
|
||||
Reference in New Issue
Block a user