Files
ROCm/RELEASE.md
Sam Wu 8311130829 New template for changelog (#100)
* Use components.xml instead of default.xml

* Rm unused var

* Use category instead of group

* Add group and category

* Change changelog template

* Conditional display

* Remove sort

* Add mappings

* Jinja does not track state

* Handle dupe logic in python

* Construct doc page and repo url

* Add repo url

* Add doc page

* Avoid using bare URL

* Add None key

* Test release notes
2024-07-03 09:17:21 -06:00

25 KiB

Release notes

This page contains the release notes for AMD ROCm™ Software.


ROCm 6.1.2

ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries.

OS support

ROCm 6.1.2 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).

AMD SMI

AMD SMI for ROCm 6.1.2

Additions

  • Added process isolation and clean shader APIs and CLI commands.
    • amdsmi_get_gpu_process_isolation()
    • amdsmi_set_gpu_process_isolation()
    • amdsmi_set_gpu_clear_sram_data()
  • Added the MIN_POWER metric to output provided by amd-smi static --limit.

Optimizations

  • Updated the amd-smi monitor --pcie output to prevent delays with the monitor command.

Changes

  • Updated amismi_get_power_cap_info to return values in uW instead of W.
  • Updated Python library return types for amdsmi_get_gpu_memory_reserved_pages and amdsmi_get_gpu_bad_page_info.
  • Updated the output of amd-smi metric --ecc-blocks to show counters available from blocks.

Fixes

  • amdsmi_get_gpu_board_info() no longer returns junk character strings.
  • amd-smi metric --power now correctly details power output for RDNA3, RDNA2, and MI1x devices.
  • Fixed the amdsmitstReadWrite.TestPowerCapReadWrite test for RDNA3, RDNA2, and MI100 devices.
  • Fixed an issue with the amdsmi_get_gpu_memory_reserved_pages and amdsmi_get_gpu_bad_page_info Python interface calls.

Removals

  • Removed the amdsmi_get_gpu_process_info API from the Python library. It was removed from the C library in an earlier release.
See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information.

HIPCC

HIPCC for ROCm 6.1.2

Changes

  • Upcoming: a future release will enable use of compiled binaries hipcc.bin and hipconfig.bin by default. No action is needed by users; you may continue calling high-level Perl scripts hipcc and hipconfig. hipcc.bin and hipconfig.bin will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke hipcc.pl and hipconfig.pl, set the HIP_USE_PERL_SCRIPTS environment variable to 1.
  • Upcoming: a subsequent release will remove high-level Perl scripts hipcc and hipconfig. This release will remove the HIP_USE_PERL_SCRIPTS environment variable. It will rename hipcc.bin and hipconfig.bin to hipcc and hipconfig respectively. No action is needed by the users. To revert to the previous behavior, invoke hipcc.pl and hipconfig.pl explicitly.
  • Upcoming: a subsequent release will remove hipcc.pl and hipconfig.pl.

ROCm SMI

ROCm SMI for ROCm 6.1.2

Additions

  • Added the ring hang event to the amdsmi_evt_notification_type_t enum.

Fixes

  • Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on GitHub.
  • Fixed the parsing of pp_od_clk_voltage in get_od_clk_volt_info to work better with MI-series hardware.

Library changes in ROCm 6.1.2

Category Group Name Version Repository
Libraries Machine Learning and Computer Vision composable_kernel 0.2.0 ROCm/composable_kernel
AMDMIGraphX 2.9 ROCm/AMDMIGraphX
MIOpen 3.1.0 ROCm/MIOpen
MIVisionX 2.5.0 ROCm/MIVisionX
rpp 1.5.0 ROCm/rpp
Communication rccl 2.18.6 ROCm/rccl
hipBLAS 2.1.0 ROCm/hipBLAS
hipBLASLt 0.7.0 ROCm/hipBLASLt
hipFFT 1.0.14 ROCm/hipFFT
hipRAND 2.10.17 ROCm/hipRAND
hipSOLVER 2.1.1 ROCm/hipSOLVER
hipSPARSE 3.0.1 ROCm/hipSPARSE
hipSPARSELt 0.2.0 ROCm/hipSPARSELt
rocALUTION 3.1.1 ROCm/rocALUTION
rocBLAS 4.1.2 ROCm/rocBLAS
rocFFT 1.0.27 ROCm/rocFFT
rocRAND 3.0.1 ROCm/rocRAND
rocSOLVER 3.25.0 ROCm/rocSOLVER
rocSPARSE 3.1.2 ROCm/rocSPARSE
rocWMMA 1.4.0 ROCm/rocWMMA
Tensile 4.40.0 ROCm/Tensile
Primitives hipCUB 3.1.0 ROCm/hipCUB
hipTensor 1.2.0 ROCm/hipTensor
rocPRIM 3.1.0 ROCm/rocPRIM
rocThrust 3.0.1 ROCm/rocThrust
ROCdbgapi 0.71.0 ROCm/ROCdbgapi
rocm-cmake 0.12.0 ROCm/rocm-cmake

AMDMIGraphX

MIGraphX 2.9 for ROCm 6.1.2

Additions
  • Added FP8 support
  • Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch
  • Added support for the Hardmax, DynamicQuantizeLinear, Qlinearconcat, Unique, QLinearAveragePool, QLinearSigmoid, QLinearLeakyRelu, QLinearMul, IsInf operators
  • Created web site examples for Whisper, Llama-2, and Stable Diffusion 2.1
  • Created examples of using the ONNX Runtime MIGraphX Execution Provider with the InceptionV3 and Resnet50 models
  • Updated operators to support ONNX Opset 19
  • Enable fuse_pointwise and fuse_reduce in the driver
  • Add support for dot-(mul)-softmax-dot offloads to MLIR
  • Added Blas auto-tuning for GEMMs
  • Added dynamic shape support for the multinomial operator
  • Added fp16 to accuracy checker
  • Added initial code for running on Windows OS
Optimizations
  • Improved the output of migraphx-driver command
  • Documentation now shows all environment variables
  • Updates needed for general stride support
  • Enabled Asymmetric Quantization
  • Added ScatterND unsupported reduction modes
  • Rewrote softmax for better performance
  • General improvement to how quantization is performed to support INT8
  • Used problem_cache for gemm tuning
  • Improved performance by always using rocMLIR for quantized convolution
  • Improved group convolutions by using rocMLIR
  • Improved accuracy of fp16 models
  • ScatterElements unsupported reduction
  • Added concat fusions
  • Improved INT8 support to include UINT8
  • Allow reshape ops between dq and quant_op
  • Improve dpp reductions on navi
  • Have the accuracy checker print the whole final buffer
  • Added support for handling dynamic Slice and ConstantOfShape ONNX operators
  • Add support for the dilations attribute to Pooling ops
  • Add layout attribute support for LSTM operator
  • Improved performance by removing contiguous for reshapes
  • Handle all slice input variations
  • Add scales attribute parse in upsample for older opset versions
  • Added support for uneven Split operations
  • Improved unit testing to run in python virtual environments
Fixes
  • Fixed outstanding issues in autogenerated documentation
  • Update model zoo paths for examples
  • Fixed promote_literals_test by using additional if condition
  • Fixed export API symbols from dynamic library
  • Fixed bug in pad operator from dimension reduction
  • Fixed using the LD to embed files and enable by default when building shared libraries on linux
  • fixed get_version()
  • Fixed Round operator inaccuracy
  • Fixed wrong size check when axes not present for slice
  • Set the .SO version correctly
Changes
  • Cleanup LSTM and RNN activation functions
  • Placed gemm_pointwise at a higher priority than layernorm_pointwise
  • Updated README to mention the need to include GPU_TARGETS when building MIGraphX
Removals
  • Removed unused device kernels from Gather and Pad operators
  • Removed int8x4 format

composable_kernel

CK 0.2.0 for ROCm 6.1.2

Fixes
  • Fixed a bug in 6-dimensional kernels (#555)
  • Fixed a test case failure with grouped convolution backward weight (#524)
Optimizations
  • Improved the performance of the normalization kernel
Additions
  • New CMake flags:
    • "DL_KERNELS"-* Must be set to "ON" in order to build the gemm_dl and batched_gemm_multi_d_dl instances
    • "DTYPES" -- Can be set to any subset of "fp64;fp32;fp16;fp8;bf16;int8" to build an instance of the specified data types
    • "INSTANCES_ONLY" -- Only builds CK library and instances without tests, examples, or profiler
  • New feature: if GPU_TARGETS is not set in the CMake command line, CK will be built for all targets supported by the compiler
  • Support for MI300A/MI300X
  • Support for AMD RDNA 3
  • New user tutorial (#563)
  • Additional instances for irregular GEMM sizes (#560)
  • New inter-wave consumer-producer programming model for GEMM kernels (#310)
  • GEMM with support multiple elementwise fusions (multi-D) (#534)
  • Multi-embeddings support (#542)
  • AMD RDNA 3 blockwise GEMM and real GEMM support (#541)
  • AMD RDNA grouped convolution backward weight support (#505)
  • MaxPool and AvgPool forward (#815); MaxPool backward (#750)
Changes

None

hipBLAS

hipBLAS 2.1.0 for ROCm 6.1.2

Additions
  • New build option to automatically use hipconfig --platform to determine HIP platform
  • Level 1 functions have additional ILP64 API for both C and Fortran (_64 name suffix) with int64_t function arguments
  • New functions hipblasGetMathMode and hipblasSetMathMode
Deprecations
  • USE_CUDA build option; use HIP_PLATFORM=amd or HIP_PLATFORM=nvidia to override hipconfig
Changes
  • Some Level 2 function argument names have changed from m to n to match legacy BLAS; there was no change in implementation.
  • Updated client code to use YAML-based testing
  • Renamed .doxygen and .sphinx folders to doxygen and sphinx, respectively
  • Added CMake support for documentation

hipBLASLt

hipBLASLt 0.7.0 for ROCm 6.1.2

Additions
  • Added hipblasltExtSoftmax extension API
  • Added hipblasltExtLayerNorm extension API
  • Added hipblasltExtAMax extension API
  • Added GemmTuning extension parameter to set split-k by user
  • Support for mix precision datatype: fp16/fp8 in with fp16 out
Deprecations
  • algoGetHeuristic() ext API for GroupGemm will be deprecated in a future release of hipBLASLt

hipCUB

hipCUB 3.1.0 for ROCm 6.1.2

Changed
  • CUB backend references CUB and Thrust version 2.1.0.
  • Updated HIPCUB_HOST_WARP_THREADS macro definition to match host_warp_size changes from rocPRIM 3.0.
  • Implemented __int128_t and __uint128_t support for radix_sort.
Fixed
  • Fixed build issues with rmake.py on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
Added
  • Added interface DeviceMemcpy::Batched for batched memcpy from rocPRIM and CUB.

hipFFT

hipFFT 1.0.14 for ROCm 6.1.2

Changes
  • When building hipFFT from source, rocFFT code no longer needs to be initialized as a git submodule.
Fixes
  • Fixed error when creating length-1 plans.

hipRAND

hipRAND 2.10.17 for ROCm 6.1.2

Fixes
  • Fixed benchmark and unit test builds on Windows

hipSOLVER

hipSOLVER 2.1.1 for ROCm 6.1.2

Changed
  • BUILD_WITH_SPARSE now defaults to OFF on Windows.
Fixed
  • Fixed benchmark client build when BUILD_WITH_SPARSE is OFF.

hipSPARSE

hipSPARSE 3.0.1 for ROCm 6.1.2

Fixes
  • Fixes to the build chain

hipSPARSELt

hipSPARSELt 0.2.0 for ROCm 6.1.2

Added
  • Support Matrix B is a Structured Sparsity Matrix.

hipTensor

hipTensor 1.2.0 for ROCm 6.1.2

Additions
  • API support for permutation of rank 4 tensors: f16 and f32
  • New datatype support in contractions of rank 4: f16, bf16, complex f32, complex f64
  • Added scale and bilinear contraction samples and tests for new supported data types
  • Added permutation samples and tests for f16, f32 types
Fixes
  • Fixed bug in contraction calculation with data type f32

MIOpen

MIOpen 3.1.0 for ROCm 6.1.2

Added
  • CK-based 2d/3d convolution solvers to support nchw/ncdhw layout
  • Fused solver for Fwd Convolution with Residual, Bias and activation
  • AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver
  • Forward, backward data and backward weight convolution solver with fp8/bfp8
  • check for packed tensors for convolution solvers
  • Integrate CK's layer norm
  • Combine gtests into single binary
Fixed
  • fix for backward passes bwd/wrw for CK group conv 3d
  • Fixed out-of-bounds memory access : ConvOclDirectFwdGen
  • fixed build failure due to hipRTC
Changed
  • Standardize workspace abstraction
  • Use split CK libraries
Removed
  • clamping to MAX from CastTensor used in Bwd and WrW convolution

MIVisionX

MIVisionX for ROCm 6.1.2

Additions
  • CTest: Tests for install verification
  • Hardware support updates
  • Doxygen support for API documentation
Optimizations
  • CMakeList Cleanup
  • Readme
Changes
  • rocAL: PyBind Link to prebuilt library
    • PyBind11
    • RapidJSON
  • Setup Updates
  • RPP - Use package install
  • Dockerfiles: Updates & bugfix
  • CuPy - No longer installed with setup.py
Fixes
  • rocAL bug fix and updates
Tested Configurations
  • Windows 10 / 11
  • Linux distribution
    • Ubuntu - 20.04 / 22.04
    • CentOS - 7 / 8
    • RHEL - 8 / 9
    • SLES - 15-SP4
  • ROCm: rocm-core - 5.7.0.50700-6
  • miopen-hip - 2.20.0.50700-63
  • MIGraphX - 2.7.0.50700-63
  • Protobuf - V3.12.4
  • OpenCV - 4.6.0
  • RPP - [1.5.0]
  • FFMPEG - n4.4.2
  • Dependencies for all preceding packages
  • MIVisionX setup script - V2.6.1
Known Issues
  • OpenCV 4.X support for some applications is missing
  • MIVisionX package install requires manual prerequisites installation

rccl

RCCL 2.18.6 for ROCm 6.1.2

Changed
  • Reduced NCCL_TOPO_MAX_NODES to limit stack usage and avoid overflow

rocALUTION

rocALUTION 3.1.1 for ROCm 6.1.2

Additions
  • TripleMatrixProduct functionality for GlobalMatrix
  • Multi-Node/GPU support for UA-AMG, SA-AMG and RS-AMG
  • Iterative ILU0 preconditioner ItILU0
  • Iterative triangular solve, selectable via SolverDecr class
Deprecations
  • LocalMatrix::AMGConnect
  • LocalMatrix::AMGAggregate
  • LocalMatrix::AMGPMISAggregate
  • LocalMatrix::AMGSmoothedAggregation
  • LocalMatrix::AMGAggregation
  • PairwiseAMG
Known Issues
  • PairwiseAMG does currently not support matrix sizes that exceed int32 range
  • PairwiseAMG might fail building the hierarchy on certain input matrices

rocBLAS

rocBLAS 4.1.2 for ROCm 6.1.2

Fixes
  • Fixes BF16 TT get_solutions
Optimizations
  • Tune gfx942 BBS TN, TT

ROCdbgapi

rocm-dbgapi 0.71.0 for ROCm 6.1.2

Added
  • Add support for gfx940, gfx941 and gfx942 architectures.

rocFFT

rocFFT 1.0.27 for ROCm 6.1.2

Fixes
  • Fixed kernel launch failure on execute of very large odd-length real-complex transforms.
Additions
  • Enable multi-gpu testing on systems without direct GPU-interconnects

rocm-cmake

rocm-cmake 0.12.0 for ROCm 6.1.2

Changed
  • ROCMSphinxDoc: Allow separate source and config directories.
  • ROCMCreatePackage: Allow additional PROVIDES on header-only packages.
  • ROCMInstallTargets: Don't install executable targets by default for ASAN builds.
  • ROCMTest: Add RPATH for installed tests.
  • Finalize rename to ROCmCMakeBuildTools
Fixed
  • ROCMClangTidy: Fixed invalid list index.
  • Test failures when ROCM_CMAKE_GENERATOR is empty.

rocPRIM

rocPRIM 3.1.0 for ROCm 6.1.2

Additions
  • New primitive: block_run_length_decode
  • New primitive: batch_memcpy
Changes
  • Renamed:
    • scan_config_v2 to scan_config
    • scan_by_key_config_v2 to scan_by_key_config
    • radix_sort_config_v2 to radix_sort_config
    • reduce_by_key_config_v2 to reduce_by_key_config
    • radix_sort_config_v2 to radix_sort_config
  • Removed support for custom config types for device algorithms
  • host_warp_size() was moved into rocprim/device/config_types.hpp; it now uses either device_id or a stream parameter to query the proper device and a device_id out parameter
    • The return type is hipError_t
  • Added support for __int128_t in device_radix_sort and block_radix_sort
  • Improved the performance of match_any, and block_histogram which uses it
Deprecations
  • Removed reduce_by_key_config, MatchAny, scan_config, scan_by_key_config, and radix_sort_config
Fixes
  • Build issues with rmake.py on Windows when using VS 2017 15.8 or later (due to a breaking fix with extended aligned storage)

rocRAND

rocRAND 3.0.1 for ROCm 6.1.2

Fixes
  • Implemented workaround for regressions in XORWOW and LFSR on MI200

rocSOLVER

rocSOLVER 3.25.0 for ROCm 6.1.2

Added
  • Eigensolver routines for symmetric/hermitian matrices using Divide & Conquer and Jacobi algorithm:
    • SYEVDJ (with batched and strided_batched versions)
    • HEEVDJ (with batched and strided_batched versions)
  • Generalized symmetric/hermitian-definite eigensolvers using Divide & Conquer and Jacobi algorithm:
    • SYGVDJ (with batched and strided_batched versions)
    • HEGVDJ (with batched and strided_batched versions)
Changed
  • Relaxed array length requirements for GESVDX with rocblas_srange_index.
Removed
  • Removed gfx803 and gfx900 from default build targets.
Fixed
  • Corrected singular vector normalization in BDSVDX and GESVDX
  • Fixed potential memory access fault in STEIN, SYEVX/HEEVX, SYGVX/HEGVX, BDSVDX and GESVDX

rocSPARSE

rocSPARSE 3.1.2 for ROCm 6.1.2

Additions
  • New LRB algorithm to SpMV, supporting CSR format
  • rocBLAS as now an optional dependency for SDDMM algorithms
  • Additional verbose output for csrgemm and bsrgemm
Optimizations
  • Triangular solve with multiple rhs (SpSM, csrsm, ...) now calls SpSV, csrsv, etcetera when nrhs equals 1
  • Improved user manual section Installation and Building for Linux and Windows
  • Improved SpMV in CSR format on MI300

rocThrust

rocThrust 3.0.1 for ROCm 6.1.2

Fixes
  • Ported a fix from thrust 2.2 that ensures thrust::optional is trivially copyable.

rocWMMA

rocWMMA 1.4.0 for ROCm 6.1.2

Additions
  • Added bf16 support for hipRTC sample
Changes
  • Changed Clang C++ version to C++17
  • Updated rocwmma_coop API
  • Linked rocWMMA to hiprtc
Fixes
  • Fixed compile/runtime arch checks
  • Built all test in large code model
  • Removed inefficient branching in layout loop unrolling

rpp

rpp for ROCm 6.1.2

Changes
  • Prerequisites
Tested Configurations
  • Linux distribution
    • Ubuntu - 20.04 / 22.04
    • CentOS - 7
    • RHEL - 8/9
  • ROCm: rocm-core - 5.5.0.50500-63
  • Clang - Version 5.0.1 and above
  • CMake - Version 3.22.3
  • IEEE 754-based half-precision floating-point library - Version 1.12.0

Tensile

Tensile 4.40.0 for ROCm 6.1.2

Additions
  • new DisableKernelPieces values to invalidate local read, local write, and global read
  • stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
  • feature to allow testing stream-k grid multipliers
  • debug output to check occupancy for Stream-K
  • reject condition for FractionalLoad + DepthU!=power of 2
  • new TENSILE_DB debugging value to dump the common kernel parameters
  • predicate for APU libs
  • new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
  • new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
  • new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
  • initialization type and general batched options to the rocblas-bench input creator script
Optimizations
  • enabled MFMA + LocalSplitU=4 for MT16x16
  • enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
  • optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
  • auto LdsPad calculation for TileMajorLds + MI16x16
  • auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
Changes
  • cleared hipErrorNotFound error since it is an expected part of the search
  • modified hipcc search path for Linux
  • changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
  • changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
  • changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
  • updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
  • Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
Fixes
  • hipcc compile append flag parallel-jobs=4
  • race condition in Stream-K that appeared with large grids and small sizes
  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
  • incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
  • small fix for LdsPad optimization (LdsElement calculation)