Remove duplicate entry for Tensile

This commit is contained in:
Sam Wu
2024-04-16 13:27:04 -06:00
committed by Sam Wu
parent 2ea7ac694e
commit 6d7daee9af

View File

@@ -60,7 +60,6 @@ environments where legacy `DT_RPATH` is the preferred form of linking (instead o
do **not** recommend trying to install both sets of packages.
```
### Library changes in ROCm 6.1.0
| Library | Version |
@@ -319,53 +318,6 @@ rocWMMA 1.4.0 for ROCm 6.1.0
* Built all test in large code model
* Removed inefficient branching in layout loop unrolling
#### Tensile
Tensile 4.40.0 for ROCm 6.1.0
##### Additions
* New `DisableKernelPieces` values to invalidate local read, local write, and global read
* Added Stream-K kernel generation, including two-tile Stream-K algorithm by setting `StreamK=3`
* New feature to allow testing Stream-K grid multipliers
* Added debug output to check occupancy for Stream-K
* Added reject condition for FractionalLoad + DepthU!=power of 2
* New `TENSILE_DB` debugging value to dump the common kernel parameters
* Added predicate for APU libs
* New parameter (`ClusterLocalRead`) to turn on/off wider local read opt for `TileMajorLDS`
* New parameter (`ExtraLatencyForLR`) to add extra interval between local read and wait
* New logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
* Added initialization type and general batched options to the `rocblas-bench` input creator script
##### Optimizations
* Enabled `MFMA` + `LocalSplitU=4` for `MT16x16`
* Enabled (`DirectToVgpr` + `MI4x4`) and supported skinny MacroTile
* Optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU
loop, wider global load depending on array size, and parallel reduction depending on array size
* Auto LdsPad calculation for `TileMajorLds` + `MI16x16`
* Auto LdsPad calculation for `UnrollMajorLds` + `MI16x16` + `VectorWidth`
##### Changes
* Cleared `hipErrorNotFound` error since it is an expected part of the search
* Modified hipCC search path for Linux
* Changed PCI ID from 32-bit to 64-bit for ROCm SMI HW monitor
* Changed `LdsBlockSizePerPad` to `LdsBlockSizePerPadA`, B to specify LBSPP separately
* Changed the default value of `LdsPadA`, B, `LdsBlockSizePerPadA`, B from 0 to -1
* Updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
* Replaced `std::regex` with `fnmatch()/PathMatchSpec` as a workaround to `std::regex` stack overflow
known bug
##### Fixes
* hipCC compile append flag `parallel-jobs=4`
* Race condition in Stream-K that appeared with large grids and small sizes
* Mismatch issue with `LdsPad` + `LdsBlockSizePerPad!=0` and `TailLoop`
* Mismatch issue with `LdsPad` + `LdsBlockSizePerPad!=0` and `SplitLds`
* Incorrect reject condition check for `DirectToLds` + `LdsBlockSizePerPad=-1` case
* Small fix for `LdsPad` optimization (`LdsElement` calculation)
#### hipBLAS
hipBLAS 2.1.0 for ROCm 6.1.0
@@ -405,7 +357,6 @@ hipTensor 1.2.0 for ROCm 6.1.0
* Fixed bug in contraction calculation with data type F32
#### hipBLASLt
hipBLASLt 0.7.0 for ROCm 6.1.0
@@ -882,7 +833,6 @@ Tensile 4.40.0 for ROCm 6.1.0
##### Known issue
* In a future release, the ROCm Validation Suite P2P Benchmark and Qualification Tool (PBQT) tests will be optimized to meet the target bandwidth requirements for MI300X.
#### MI200 SR-IOV
@@ -890,7 +840,6 @@ Tensile 4.40.0 for ROCm 6.1.0
* Multimedia applications may encounter compilation errors in the MI200 Single Root Input/Output Virtualization (SR-IOV) environment. This is because MI200 SR-IOV does not currently support multimedia applications.
### AMD MI300A RAS
#### Fixed defect
@@ -901,7 +850,6 @@ Tensile 4.40.0 for ROCm 6.1.0
This issue is resolved in the ROCm 6.1 release, and users will no longer encounter the GFX correctable error (CE) and uncorrectable error (UE) failures.
## ROCm 6.0.2
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU