* Use components.xml instead of default.xml * Rm unused var * Use category instead of group * Add group and category * Change changelog template * Conditional display * Remove sort * Add mappings * Jinja does not track state * Handle dupe logic in python * Construct doc page and repo url * Add repo url * Add doc page * Avoid using bare URL * Add None key * Test release notes
25 KiB
Release notes
This page contains the release notes for AMD ROCm™ Software.
ROCm 6.1.2
ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries.
OS support
ROCm 6.1.2 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).
AMD SMI
AMD SMI for ROCm 6.1.2
Additions
- Added process isolation and clean shader APIs and CLI commands.
amdsmi_get_gpu_process_isolation()amdsmi_set_gpu_process_isolation()amdsmi_set_gpu_clear_sram_data()
- Added the
MIN_POWERmetric to output provided byamd-smi static --limit.
Optimizations
- Updated the
amd-smi monitor --pcieoutput to prevent delays with themonitorcommand.
Changes
- Updated
amismi_get_power_cap_infoto return values in uW instead of W. - Updated Python library return types for
amdsmi_get_gpu_memory_reserved_pagesandamdsmi_get_gpu_bad_page_info. - Updated the output of
amd-smi metric --ecc-blocksto show counters available from blocks.
Fixes
amdsmi_get_gpu_board_info()no longer returns junk character strings.amd-smi metric --powernow correctly details power output for RDNA3, RDNA2, and MI1x devices.- Fixed the
amdsmitstReadWrite.TestPowerCapReadWritetest for RDNA3, RDNA2, and MI100 devices. - Fixed an issue with the
amdsmi_get_gpu_memory_reserved_pagesandamdsmi_get_gpu_bad_page_infoPython interface calls.
Removals
- Removed the
amdsmi_get_gpu_process_infoAPI from the Python library. It was removed from the C library in an earlier release.
See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information.
HIPCC
HIPCC for ROCm 6.1.2
Changes
- Upcoming: a future release will enable use of compiled binaries
hipcc.binandhipconfig.binby default. No action is needed by users; you may continue calling high-level Perl scriptshipccandhipconfig.hipcc.binandhipconfig.binwill be invoked by the high-level Perl scripts. To revert to the previous behavior and invokehipcc.plandhipconfig.pl, set theHIP_USE_PERL_SCRIPTSenvironment variable to1. - Upcoming: a subsequent release will remove high-level Perl scripts
hipccandhipconfig. This release will remove theHIP_USE_PERL_SCRIPTSenvironment variable. It will renamehipcc.binandhipconfig.bintohipccandhipconfigrespectively. No action is needed by the users. To revert to the previous behavior, invokehipcc.plandhipconfig.plexplicitly. - Upcoming: a subsequent release will remove
hipcc.plandhipconfig.pl.
ROCm SMI
ROCm SMI for ROCm 6.1.2
Additions
- Added the ring hang event to the
amdsmi_evt_notification_type_tenum.
Fixes
- Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on GitHub.
- Fixed the parsing of
pp_od_clk_voltageinget_od_clk_volt_infoto work better with MI-series hardware.
Library changes in ROCm 6.1.2
| Category | Group | Name | Version | Repository |
|---|---|---|---|---|
| Libraries | Machine Learning and Computer Vision | composable_kernel | 0.2.0 | ROCm/composable_kernel |
| AMDMIGraphX | 2.9 | ROCm/AMDMIGraphX | ||
| MIOpen | 3.1.0 | ROCm/MIOpen | ||
| MIVisionX | 2.5.0 | ROCm/MIVisionX | ||
| rpp | 1.5.0 | ROCm/rpp | ||
| Communication | rccl | 2.18.6 | ROCm/rccl | |
| hipBLAS | 2.1.0 | ROCm/hipBLAS | ||
| hipBLASLt | 0.7.0 | ROCm/hipBLASLt | ||
| hipFFT | 1.0.14 | ROCm/hipFFT | ||
| hipRAND | 2.10.17 | ROCm/hipRAND | ||
| hipSOLVER | 2.1.1 | ROCm/hipSOLVER | ||
| hipSPARSE | 3.0.1 | ROCm/hipSPARSE | ||
| hipSPARSELt | 0.2.0 | ROCm/hipSPARSELt | ||
| rocALUTION | 3.1.1 | ROCm/rocALUTION | ||
| rocBLAS | 4.1.2 | ROCm/rocBLAS | ||
| rocFFT | 1.0.27 | ROCm/rocFFT | ||
| rocRAND | 3.0.1 | ROCm/rocRAND | ||
| rocSOLVER | 3.25.0 | ROCm/rocSOLVER | ||
| rocSPARSE | 3.1.2 | ROCm/rocSPARSE | ||
| rocWMMA | 1.4.0 | ROCm/rocWMMA | ||
| Tensile | 4.40.0 | ROCm/Tensile | ||
| Primitives | hipCUB | 3.1.0 | ROCm/hipCUB | |
| hipTensor | 1.2.0 | ROCm/hipTensor | ||
| rocPRIM | 3.1.0 | ROCm/rocPRIM | ||
| rocThrust | 3.0.1 | ROCm/rocThrust | ||
| ROCdbgapi | 0.71.0 | ROCm/ROCdbgapi | ||
| rocm-cmake | 0.12.0 | ROCm/rocm-cmake |
AMDMIGraphX
MIGraphX 2.9 for ROCm 6.1.2
Additions
- Added FP8 support
- Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch
- Added support for the
Hardmax,DynamicQuantizeLinear,Qlinearconcat,Unique,QLinearAveragePool,QLinearSigmoid,QLinearLeakyRelu,QLinearMul,IsInfoperators - Created web site examples for
Whisper,Llama-2, andStable Diffusion 2.1 - Created examples of using the ONNX Runtime MIGraphX Execution Provider with the
InceptionV3andResnet50models - Updated operators to support ONNX Opset 19
- Enable fuse_pointwise and fuse_reduce in the driver
- Add support for dot-(mul)-softmax-dot offloads to MLIR
- Added Blas auto-tuning for GEMMs
- Added dynamic shape support for the multinomial operator
- Added fp16 to accuracy checker
- Added initial code for running on Windows OS
Optimizations
- Improved the output of migraphx-driver command
- Documentation now shows all environment variables
- Updates needed for general stride support
- Enabled Asymmetric Quantization
- Added ScatterND unsupported reduction modes
- Rewrote softmax for better performance
- General improvement to how quantization is performed to support INT8
- Used problem_cache for gemm tuning
- Improved performance by always using rocMLIR for quantized convolution
- Improved group convolutions by using rocMLIR
- Improved accuracy of fp16 models
- ScatterElements unsupported reduction
- Added concat fusions
- Improved INT8 support to include UINT8
- Allow reshape ops between dq and quant_op
- Improve dpp reductions on navi
- Have the accuracy checker print the whole final buffer
- Added support for handling dynamic Slice and ConstantOfShape ONNX operators
- Add support for the dilations attribute to Pooling ops
- Add layout attribute support for LSTM operator
- Improved performance by removing contiguous for reshapes
- Handle all slice input variations
- Add scales attribute parse in upsample for older opset versions
- Added support for uneven Split operations
- Improved unit testing to run in python virtual environments
Fixes
- Fixed outstanding issues in autogenerated documentation
- Update model zoo paths for examples
- Fixed promote_literals_test by using additional if condition
- Fixed export API symbols from dynamic library
- Fixed bug in pad operator from dimension reduction
- Fixed using the LD to embed files and enable by default when building shared libraries on linux
- fixed get_version()
- Fixed Round operator inaccuracy
- Fixed wrong size check when axes not present for slice
- Set the .SO version correctly
Changes
- Cleanup LSTM and RNN activation functions
- Placed gemm_pointwise at a higher priority than layernorm_pointwise
- Updated README to mention the need to include GPU_TARGETS when building MIGraphX
Removals
- Removed unused device kernels from Gather and Pad operators
- Removed int8x4 format
composable_kernel
CK 0.2.0 for ROCm 6.1.2
Fixes
- Fixed a bug in 6-dimensional kernels (#555)
- Fixed a test case failure with grouped convolution backward weight (#524)
Optimizations
- Improved the performance of the normalization kernel
Additions
- New CMake flags:
- "DL_KERNELS"-* Must be set to "ON" in order to build the gemm_dl and batched_gemm_multi_d_dl instances
- "DTYPES" -- Can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build an instance of the specified data types
- "INSTANCES_ONLY" -- Only builds CK library and instances without tests, examples, or profiler
- New feature: if GPU_TARGETS is not set in the CMake command line, CK will be built for all targets supported by the compiler
- Support for MI300A/MI300X
- Support for AMD RDNA 3
- New user tutorial (#563)
- Additional instances for irregular GEMM sizes (#560)
- New inter-wave consumer-producer programming model for GEMM kernels (#310)
- GEMM with support multiple elementwise fusions (multi-D) (#534)
- Multi-embeddings support (#542)
- AMD RDNA 3 blockwise GEMM and real GEMM support (#541)
- AMD RDNA grouped convolution backward weight support (#505)
- MaxPool and AvgPool forward (#815); MaxPool backward (#750)
Changes
None
hipBLAS
hipBLAS 2.1.0 for ROCm 6.1.2
Additions
- New build option to automatically use hipconfig --platform to determine HIP platform
- Level 1 functions have additional ILP64 API for both C and Fortran (
_64name suffix) with int64_t function arguments - New functions hipblasGetMathMode and hipblasSetMathMode
Deprecations
- USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig
Changes
- Some Level 2 function argument names have changed from
mtonto match legacy BLAS; there was no change in implementation. - Updated client code to use YAML-based testing
- Renamed
.doxygenand.sphinxfolders todoxygenandsphinx, respectively - Added CMake support for documentation
hipBLASLt
hipBLASLt 0.7.0 for ROCm 6.1.2
Additions
- Added
hipblasltExtSoftmaxextension API - Added
hipblasltExtLayerNormextension API - Added
hipblasltExtAMaxextension API - Added
GemmTuningextension parameter to set split-k by user - Support for mix precision datatype: fp16/fp8 in with fp16 out
Deprecations
- algoGetHeuristic() ext API for GroupGemm will be deprecated in a future release of hipBLASLt
hipCUB
hipCUB 3.1.0 for ROCm 6.1.2
Changed
- CUB backend references CUB and Thrust version 2.1.0.
- Updated
HIPCUB_HOST_WARP_THREADSmacro definition to matchhost_warp_sizechanges from rocPRIM 3.0. - Implemented
__int128_tand__uint128_tsupport for radix_sort.
Fixed
- Fixed build issues with
rmake.pyon Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
Added
- Added interface
DeviceMemcpy::Batchedfor batched memcpy from rocPRIM and CUB.
hipFFT
hipFFT 1.0.14 for ROCm 6.1.2
Changes
- When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule.
Fixes
- Fixed error when creating length-1 plans.
hipRAND
hipRAND 2.10.17 for ROCm 6.1.2
Fixes
- Fixed benchmark and unit test builds on Windows
hipSOLVER
hipSOLVER 2.1.1 for ROCm 6.1.2
Changed
BUILD_WITH_SPARSEnow defaults to OFF on Windows.
Fixed
- Fixed benchmark client build when
BUILD_WITH_SPARSEis OFF.
hipSPARSE
hipSPARSE 3.0.1 for ROCm 6.1.2
Fixes
- Fixes to the build chain
hipSPARSELt
hipSPARSELt 0.2.0 for ROCm 6.1.2
Added
- Support Matrix B is a Structured Sparsity Matrix.
hipTensor
hipTensor 1.2.0 for ROCm 6.1.2
Additions
- API support for permutation of rank 4 tensors: f16 and f32
- New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64
- Added scale and bilinear contraction samples and tests for new supported data types
- Added permutation samples and tests for f16, f32 types
Fixes
- Fixed bug in contraction calculation with data type f32
MIOpen
MIOpen 3.1.0 for ROCm 6.1.2
Added
- CK-based 2d/3d convolution solvers to support nchw/ncdhw layout
- Fused solver for Fwd Convolution with Residual, Bias and activation
- AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver
- Forward, backward data and backward weight convolution solver with fp8/bfp8
- check for packed tensors for convolution solvers
- Integrate CK's layer norm
- Combine gtests into single binary
Fixed
- fix for backward passes bwd/wrw for CK group conv 3d
- Fixed out-of-bounds memory access : ConvOclDirectFwdGen
- fixed build failure due to hipRTC
Changed
- Standardize workspace abstraction
- Use split CK libraries
Removed
- clamping to MAX from CastTensor used in Bwd and WrW convolution
MIVisionX
MIVisionX for ROCm 6.1.2
Additions
- CTest: Tests for install verification
- Hardware support updates
- Doxygen support for API documentation
Optimizations
- CMakeList Cleanup
- Readme
Changes
- rocAL: PyBind Link to prebuilt library
- PyBind11
- RapidJSON
- Setup Updates
- RPP - Use package install
- Dockerfiles: Updates & bugfix
- CuPy - No longer installed with setup.py
Fixes
- rocAL bug fix and updates
Tested Configurations
- Windows
10/11 - Linux distribution
- Ubuntu -
20.04/22.04 - CentOS -
7/8 - RHEL -
8/9 - SLES -
15-SP4
- Ubuntu -
- ROCm: rocm-core -
5.7.0.50700-6 - miopen-hip -
2.20.0.50700-63 - MIGraphX -
2.7.0.50700-63 - Protobuf - V3.12.4
- OpenCV - 4.6.0
- RPP - [1.5.0]
- FFMPEG - n4.4.2
- Dependencies for all preceding packages
- MIVisionX setup script -
V2.6.1
Known Issues
- OpenCV 4.X support for some applications is missing
- MIVisionX package install requires manual prerequisites installation
rccl
RCCL 2.18.6 for ROCm 6.1.2
Changed
- Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow
rocALUTION
rocALUTION 3.1.1 for ROCm 6.1.2
Additions
TripleMatrixProductfunctionality forGlobalMatrix- Multi-Node/GPU support for
UA-AMG,SA-AMGandRS-AMG - Iterative ILU0 preconditioner
ItILU0 - Iterative triangular solve, selectable via
SolverDecrclass
Deprecations
LocalMatrix::AMGConnectLocalMatrix::AMGAggregateLocalMatrix::AMGPMISAggregateLocalMatrix::AMGSmoothedAggregationLocalMatrix::AMGAggregationPairwiseAMG
Known Issues
PairwiseAMGdoes currently not support matrix sizes that exceed int32 rangePairwiseAMGmight fail building the hierarchy on certain input matrices
rocBLAS
rocBLAS 4.1.2 for ROCm 6.1.2
Fixes
- Fixes BF16 TT get_solutions
Optimizations
- Tune gfx942 BBS TN, TT
ROCdbgapi
rocm-dbgapi 0.71.0 for ROCm 6.1.2
Added
- Add support for gfx940, gfx941 and gfx942 architectures.
rocFFT
rocFFT 1.0.27 for ROCm 6.1.2
Fixes
- Fixed kernel launch failure on execute of very large odd-length real-complex transforms.
Additions
- Enable multi-gpu testing on systems without direct GPU-interconnects
rocm-cmake
rocm-cmake 0.12.0 for ROCm 6.1.2
Changed
- ROCMSphinxDoc: Allow separate source and config directories.
- ROCMCreatePackage: Allow additional
PROVIDESon header-only packages. - ROCMInstallTargets: Don't install executable targets by default for ASAN builds.
- ROCMTest: Add RPATH for installed tests.
- Finalize rename to ROCmCMakeBuildTools
Fixed
- ROCMClangTidy: Fixed invalid list index.
- Test failures when ROCM_CMAKE_GENERATOR is empty.
rocPRIM
rocPRIM 3.1.0 for ROCm 6.1.2
Additions
- New primitive:
block_run_length_decode - New primitive:
batch_memcpy
Changes
- Renamed:
scan_config_v2toscan_configscan_by_key_config_v2toscan_by_key_configradix_sort_config_v2toradix_sort_configreduce_by_key_config_v2toreduce_by_key_configradix_sort_config_v2toradix_sort_config
- Removed support for custom config types for device algorithms
host_warp_size()was moved intorocprim/device/config_types.hpp; it now uses eitherdevice_idor astreamparameter to query the proper device and adevice_idout parameter- The return type is
hipError_t
- The return type is
- Added support for
__int128_tindevice_radix_sortandblock_radix_sort - Improved the performance of
match_any, andblock_histogramwhich uses it
Deprecations
- Removed
reduce_by_key_config,MatchAny,scan_config,scan_by_key_config, andradix_sort_config
Fixes
- Build issues with
rmake.pyon Windows when using VS 2017 15.8 or later (due to a breaking fix with extended aligned storage)
rocRAND
rocRAND 3.0.1 for ROCm 6.1.2
Fixes
- Implemented workaround for regressions in XORWOW and LFSR on MI200
rocSOLVER
rocSOLVER 3.25.0 for ROCm 6.1.2
Added
- Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm:
- SYEVDJ (with batched and strided_batched versions)
- HEEVDJ (with batched and strided_batched versions)
- Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm:
- SYGVDJ (with batched and strided_batched versions)
- HEGVDJ (with batched and strided_batched versions)
Changed
- Relaxed array length requirements for GESVDX with
rocblas_srange_index.
Removed
- Removed gfx803 and gfx900 from default build targets.
Fixed
- Corrected singular vector normalization in BDSVDX and GESVDX
- Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX
rocSPARSE
rocSPARSE 3.1.2 for ROCm 6.1.2
Additions
- New LRB algorithm to SpMV, supporting CSR format
- rocBLAS as now an optional dependency for SDDMM algorithms
- Additional verbose output for
csrgemmandbsrgemm
Optimizations
- Triangular solve with multiple rhs (SpSM, csrsm, ...) now calls SpSV, csrsv, etcetera when nrhs equals 1
- Improved user manual section Installation and Building for Linux and Windows
- Improved SpMV in CSR format on MI300
rocThrust
rocThrust 3.0.1 for ROCm 6.1.2
Fixes
- Ported a fix from thrust 2.2 that ensures
thrust::optionalis trivially copyable.
rocWMMA
rocWMMA 1.4.0 for ROCm 6.1.2
Additions
- Added bf16 support for hipRTC sample
Changes
- Changed Clang C++ version to C++17
- Updated rocwmma_coop API
- Linked rocWMMA to hiprtc
Fixes
- Fixed compile/runtime arch checks
- Built all test in large code model
- Removed inefficient branching in layout loop unrolling
rpp
rpp for ROCm 6.1.2
Changes
- Prerequisites
Tested Configurations
- Linux distribution
- Ubuntu -
20.04/22.04 - CentOS -
7 - RHEL -
8/9
- Ubuntu -
- ROCm: rocm-core -
5.5.0.50500-63 - Clang - Version
5.0.1and above - CMake - Version
3.22.3 - IEEE 754-based half-precision floating-point library - Version
1.12.0
Tensile
Tensile 4.40.0 for ROCm 6.1.2
Additions
- new DisableKernelPieces values to invalidate local read, local write, and global read
- stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
- feature to allow testing stream-k grid multipliers
- debug output to check occupancy for Stream-K
- reject condition for FractionalLoad + DepthU!=power of 2
- new TENSILE_DB debugging value to dump the common kernel parameters
- predicate for APU libs
- new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
- new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
- new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
- initialization type and general batched options to the rocblas-bench input creator script
Optimizations
- enabled MFMA + LocalSplitU=4 for MT16x16
- enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
- optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
- auto LdsPad calculation for TileMajorLds + MI16x16
- auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
Changes
- cleared hipErrorNotFound error since it is an expected part of the search
- modified hipcc search path for Linux
- changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
- changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
- changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
- updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
- Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
Fixes
- hipcc compile append flag parallel-jobs=4
- race condition in Stream-K that appeared with large grids and small sizes
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
- incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
- small fix for LdsPad optimization (LdsElement calculation)