mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 06:38:00 -05:00
Merge branch 'develop' into swraw/docs
This commit is contained in:
@@ -98,6 +98,7 @@ DIMM
|
||||
DKMS
|
||||
DL
|
||||
DMA
|
||||
DOMContentLoaded
|
||||
DNN
|
||||
DNNL
|
||||
DPM
|
||||
@@ -552,6 +553,7 @@ ZenDNN
|
||||
accuracies
|
||||
activations
|
||||
addr
|
||||
addEventListener
|
||||
ade
|
||||
ai
|
||||
alloc
|
||||
@@ -585,6 +587,7 @@ boson
|
||||
bosons
|
||||
br
|
||||
BrainFloat
|
||||
btn
|
||||
buildable
|
||||
bursty
|
||||
bzip
|
||||
@@ -596,6 +599,7 @@ centric
|
||||
changelog
|
||||
checkpointing
|
||||
chiplet
|
||||
classList
|
||||
cmake
|
||||
cmd
|
||||
coalescable
|
||||
@@ -607,6 +611,7 @@ composable
|
||||
concretization
|
||||
config
|
||||
conformant
|
||||
const
|
||||
constructible
|
||||
convolutional
|
||||
convolves
|
||||
@@ -670,6 +675,7 @@ exascale
|
||||
executables
|
||||
ffmpeg
|
||||
filesystem
|
||||
forEach
|
||||
fortran
|
||||
fp
|
||||
framebuffer
|
||||
@@ -678,6 +684,7 @@ galb
|
||||
gcc
|
||||
gdb
|
||||
gemm
|
||||
getAttribute
|
||||
gfortran
|
||||
gfx
|
||||
githooks
|
||||
@@ -829,6 +836,8 @@ recommenders
|
||||
quantile
|
||||
quantizer
|
||||
quasirandom
|
||||
querySelector
|
||||
querySelectorAll
|
||||
queueing
|
||||
qwen
|
||||
radeon
|
||||
@@ -891,9 +900,11 @@ scalability
|
||||
scalable
|
||||
scipy
|
||||
seealso
|
||||
selectedTag
|
||||
sendmsg
|
||||
seqs
|
||||
serializers
|
||||
setAttribute
|
||||
sglang
|
||||
shader
|
||||
sharding
|
||||
@@ -920,6 +931,7 @@ symlink
|
||||
symlinks
|
||||
sys
|
||||
tabindex
|
||||
targetContainer
|
||||
td
|
||||
tensorfloat
|
||||
th
|
||||
|
||||
81
CHANGELOG.md
81
CHANGELOG.md
@@ -21,7 +21,7 @@ for a complete overview of this release.
|
||||
|
||||
* Default command:
|
||||
|
||||
A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through laternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`.
|
||||
A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through alternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`.
|
||||
|
||||
* Support for GPU metrics 1.8:
|
||||
- Added new fields for `amdsmi_gpu_xcp_metrics_t` including:
|
||||
@@ -30,7 +30,7 @@ for a complete overview of this release.
|
||||
- Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts
|
||||
- Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks.
|
||||
- Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics).
|
||||
- Increased available JPEG engines to 40. Current ASICs may not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI.
|
||||
- Increased available JPEG engines to 40. Current ASICs might not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI.
|
||||
|
||||
* Bad page threshold count.
|
||||
- Added `amdsmi_get_gpu_bad_page_threshold` to Python API and CLI; root/sudo permissions required to display the count.
|
||||
@@ -99,32 +99,32 @@ for a complete overview of this release.
|
||||
|
||||
#### Removed
|
||||
|
||||
- Removed unnecessary API, `amdsmi_free_name_value_pairs()`
|
||||
- Unnecessary API, `amdsmi_free_name_value_pairs()`
|
||||
- This API is only used internally to free up memory from the Python interface and does not need to be
|
||||
exposed to the user.
|
||||
|
||||
- Removed unused definitions:
|
||||
- Unused definitions:
|
||||
- `AMDSMI_MAX_NAME`, `AMDSMI_256_LENGTH`, `AMDSMI_MAX_DATE_LENGTH`, `MAX_AMDSMI_NAME_LENGTH`, `AMDSMI_LIB_VERSION_YEAR`,
|
||||
`AMDSMI_DEFAULT_VARIANT`, `AMDSMI_MAX_NUM_POWER_PROFILES`, `AMDSMI_MAX_DRIVER_VERSION_LENGTH`.
|
||||
|
||||
- Removed unused member `year` in struct `amdsmi_version_t`.
|
||||
- Unused member `year` in struct `amdsmi_version_t`.
|
||||
|
||||
- Removed `amdsmi_io_link_type_t` and replaced with `amdsmi_link_type_t`.
|
||||
- `amdsmi_io_link_type_t` and replaced with `amdsmi_link_type_t`.
|
||||
- `amdsmi_io_link_type_t` is no longer needed as `amdsmi_link_type_t` is sufficient.
|
||||
- `amdsmi_link_type_t` enum has changed.
|
||||
- This change will also affect `amdsmi_link_metrics_t`, where the link_type field changes from `amdsmi_io_link_type_t` to `amdsmi_link_type_t`.
|
||||
|
||||
- Removed `amdsmi_get_power_info_v2()`.
|
||||
- `amdsmi_get_power_info_v2()`.
|
||||
- The ``amdsmi_get_power_info()`` has been unified and the v2 function is no longer needed or used.
|
||||
|
||||
- Removed `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`.
|
||||
- `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`.
|
||||
|
||||
- The `amdsmi_get_gpu_vram_info` now provides vendor names as a string.
|
||||
- `amdsmi_vram_vendor_type_t` enum structure is removed.
|
||||
- `amdsmi_vram_info_t` member named `amdsmi_vram_vendor_type_t` is changed to a character string.
|
||||
- `amdsmi_get_gpu_vram_info` now no longer requires decoding the vendor name as an enum.
|
||||
|
||||
- Removed backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`.
|
||||
- Backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`.
|
||||
- Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion about which field to use. By removing backward compatibility, it is easier to identify the relevant field.
|
||||
- The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`.
|
||||
@@ -203,7 +203,7 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
|
||||
- `num_threads` Total number of threads in the group. The legacy API size is alias.
|
||||
- `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for aritimetic reduction across lanes of a warp, and `__reduce_and_sync`, `__reduce_or_sync`, and `__reduce_xor_sync`
|
||||
functions added for logical reduction. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions).
|
||||
* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as the following. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
|
||||
* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as follows. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
|
||||
- Data types for `FP4`/`FP6`/`FP8`.
|
||||
- HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs.
|
||||
- HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs.
|
||||
@@ -220,7 +220,7 @@ functions added for logical reduction. For details, see [Warp cross-lane functio
|
||||
|
||||
#### Changed
|
||||
* Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows.
|
||||
* Removal of Beta warnings in HIP Graph APIs
|
||||
* Removal of beta warnings in HIP Graph APIs
|
||||
All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported.
|
||||
* Behavior changes
|
||||
- `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution.
|
||||
@@ -421,7 +421,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Added
|
||||
|
||||
* Added a new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default.
|
||||
* Added a new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large that symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default.
|
||||
* Added single pass operators in `agent/single_pass_scan_operators.hpp` which contains the following API:
|
||||
* `BlockScanRunningPrefixOp`
|
||||
* `ScanTileStatus`
|
||||
@@ -437,7 +437,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Removed
|
||||
|
||||
* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you would like to build for these architectures, please specify them explicitly in the `AMDGPU_TARGETS` cmake option.
|
||||
* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you want to build for these architectures, specify them explicitly in the `AMDGPU_TARGETS` cmake option.
|
||||
* Deprecated `hipcub::AsmThreadLoad` is removed, use `hipcub::ThreadLoad` instead.
|
||||
* Deprecated `hipcub::AsmThreadStore` is removed, use `hipcub::ThreadStore` instead.
|
||||
* Deprecated `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed.
|
||||
@@ -587,7 +587,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
* Added element-wise binary operation support.
|
||||
* Added element-wise trinary operation support.
|
||||
* Added support for new GPU target gfx950.
|
||||
* Added support for GPU target gfx950.
|
||||
* Added dynamic unary and binary operator support for element-wise operations and permutation.
|
||||
* Added a CMake check for `f8` datatype availability.
|
||||
* Added `hiptensorDestroyOperationDescriptor` to free all resources related to the provided descriptor.
|
||||
@@ -629,7 +629,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
#### Added
|
||||
|
||||
* Added the compiler `-gsplit-dwarf` option to enable the generation of separate debug information file at compile time. When used, separate debug information files are generated for host and for each offload architecture. For additional information, see [DebugFission](https://gcc.gnu.org/wiki/DebugFission).
|
||||
* Added `llvm-flang`, AMD's next generation Fortran compiler is a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub.
|
||||
* Added `llvm-flang`, AMD's next-generation Fortran compiler. It's a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub.
|
||||
* Added Comgr support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps to improve performance in the device library link step.
|
||||
* Added compiler support of a new target-specific builtin `__builtin_amdgcn_processor_is` for late or deferred queries of the current target processor, and `__builtin_amdgcn_is_invocable` to determine the current target processor ability to invoke a particular builtin.
|
||||
* Added HIPIFY support for NVIDIA CUDA 12.9.1 APIs. Added support for all new device and host APIs, including FP4, FP6, and FP128, and support for the corresponding ROCm HIP equivalents.
|
||||
@@ -761,11 +761,11 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Known issues
|
||||
|
||||
* Installation on CentOS/RedHat/SLES requires the manual installation of the `FFMPEG` & `OpenCV` dev packages.
|
||||
* Installation on RHEL and SLES requires the manual installation of the `FFMPEG` and `OpenCV` dev packages.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
* Optimized audio augmentations support for VX_RPP
|
||||
* Optimized audio augmentations support for VX_RPP.
|
||||
|
||||
### **RCCL** (2.26.6)
|
||||
|
||||
@@ -813,7 +813,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Known issues
|
||||
* Package installation on SLES requires manually installing `TurboJPEG`.
|
||||
* Package installation on CentOS, RedHat, and SLES requires manually installing the `FFMPEG Dev` package.
|
||||
* Package installation on RHEL and SLES requires manually installing the `FFMPEG Dev` package.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
@@ -993,7 +993,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
* Individual `plugins`: The `plugins` (shared libraries) are available at: `/opt/rocm/lib/rocm_bandwidth_test/plugins/`
|
||||
|
||||
```{note}
|
||||
Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/rocm-rel-7.0/README.md) file for details about the new options and outputs.
|
||||
Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/amd-mainline/README.md) file for details about the new options and outputs.
|
||||
```
|
||||
|
||||
#### Changed
|
||||
@@ -1002,7 +1002,7 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
#### Removed
|
||||
|
||||
- The old CLI, parameters, and switches used.
|
||||
- The old CLI, parameters, and switches.
|
||||
|
||||
### **ROCm Compute Profiler** (3.2.3)
|
||||
|
||||
@@ -1051,8 +1051,6 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
* Support for Roofline plot on CLI (single run) analysis.
|
||||
|
||||
* Roofline support for RHEL 10 OS.
|
||||
|
||||
* `FP4` and `FP6` data types have been added for roofline profiling on AMD Instinct MI350 series.
|
||||
|
||||
##### rocprofv3 support
|
||||
@@ -1121,6 +1119,8 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow.
|
||||
|
||||
* Roofline feature is currently not functional on Azure Linux 3.0 and Debian 12.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. Using ``rocprof v1/v2/v3`` interfaces will trigger a deprecation warning.
|
||||
@@ -1166,7 +1166,7 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
#### Removed
|
||||
|
||||
- Removed backwards compatibility for `rsmi_dev_gpu_metrics_info_get()`'s `jpeg_activity` and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` and `xcp_stats.vcn_busy`.
|
||||
- Backwards compability is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion for users about which field to use. By removing backward compatibility, it is easier to identify the relevant field.
|
||||
- The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`.
|
||||
|
||||
@@ -1225,7 +1225,6 @@ See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/rele
|
||||
* Added new optimization to the backend for `device_transform` when the input and output are pointers.
|
||||
* Added `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers.
|
||||
* Added `rocprim:device_transform` for n-ary transform operations API with as input `n` number of iterators inside a `rocprim::tuple`.
|
||||
* Added gfx950 support.
|
||||
* Added `rocprim::key_value_pair::operator==`.
|
||||
* Added the `rocprim::unrolled_copy` thread function to copy multiple items inside a thread.
|
||||
* Added the `rocprim::unrolled_thread_load` function to load multiple items inside a thread using `rocprim::thread_load`.
|
||||
@@ -1242,12 +1241,12 @@ See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/rele
|
||||
|
||||
#### Changed
|
||||
|
||||
* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits` respectively.
|
||||
* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits`, respectively.
|
||||
* Marked the initialisation constructor of `rocprim::reverse_iterator<Iter>` `explicit`, use `rocprim::make_reverse_iterator`.
|
||||
* Merged `radix_key_codec` into type_traits system.
|
||||
* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`.
|
||||
* The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change.
|
||||
The previous default accumulator types could lead to situations in which unexpected overflow occured, such as when the input or inital type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing:
|
||||
The previous default accumulator types could lead to situations in which unexpected overflow occurred, such as when the input or initial type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing:
|
||||
|
||||
* `rocprim::inclusive_scan`
|
||||
* Previous default: `class AccType = typename std::iterator_traits<InputIterator>::value_type>`
|
||||
@@ -1262,7 +1261,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* Previous default: `class AccType = detail::input_type_t<InitValueType>>`
|
||||
* Current default: `class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>`
|
||||
* Undeprecated internal `detail::raw_storage`.
|
||||
* A new version of `rocprim::thread_load` and `rocprim::thread_store` replace the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result.
|
||||
* A new version of `rocprim::thread_load` and `rocprim::thread_store` replaces the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result.
|
||||
* Renamed `rocprim::load_cs` to `rocprim::load_nontemporal` and `rocprim::store_cs` to `rocprim::store_nontemporal` to express the intent of these load and store methods better.
|
||||
* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example, `rocprim::ROCPRIM_300400_NS::symbol` instead of `rocPRIM::symbol`, letting the user link multiple libraries built with different versions of rocPRIM.
|
||||
|
||||
@@ -1287,7 +1286,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* `rocprim::detail::match_result_type`. Use `rocprim::invoke_result_binary_op_t` instead.
|
||||
* Removed the deprecated `rocprim::detail::radix_key_codec` function. Use `rocprim::radix_key_codec` instead.
|
||||
* Removed `rocprim/detail/radix_sort.hpp`, functionality can now be found in `rocprim/thread/radix_key_codec.hpp`.
|
||||
* Removed C++14 support, only C++17 is supported.
|
||||
* Removed C++14 support. Only C++17 is supported.
|
||||
* Due to the removal of `__AMDGCN_WAVEFRONT_SIZE` in the compiler, the following deprecated warp size-related symbols have been removed:
|
||||
* `rocprim::device_warp_size()`
|
||||
* For compile-time constants, this is replaced with `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()`. Use this when allocating global or shared memory.
|
||||
@@ -1311,7 +1310,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Known issues
|
||||
|
||||
* * When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs.
|
||||
* When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs.
|
||||
|
||||
### **ROCprofiler-SDK** (1.0.0)
|
||||
|
||||
@@ -1551,7 +1550,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Resolved issues
|
||||
|
||||
* Fixed an issue with internal calls to unqualified `distance()` which would be ambigious due to also visibile implementation through ADL.
|
||||
* Fixed an issue with internal calls to unqualified `distance()` which would be ambiguous due to the visible implementation through ADL.
|
||||
|
||||
#### Known issues
|
||||
|
||||
@@ -1565,10 +1564,10 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Added
|
||||
|
||||
* Added internal register layout transforms to support interleaved MMA layouts.
|
||||
* Added support for the gfx950 target.
|
||||
* Added mixed input `BF8`/`FP8` types for MMA support.
|
||||
* Added fragment scheduler API objects to embed thread block cooperation properties in fragments.
|
||||
* Internal register layout transforms to support interleaved MMA layouts.
|
||||
* Support for the gfx950 target.
|
||||
* Mixed input `BF8`/`FP8` types for MMA support.
|
||||
* Fragment scheduler API objects to embed thread block cooperation properties in fragments.
|
||||
|
||||
#### Changed
|
||||
|
||||
@@ -1582,9 +1581,9 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Removed
|
||||
|
||||
* Removed support for the gfx940 and gfx941 targets.
|
||||
* Removed the rocWMMA cooperative API.
|
||||
* Removed wave count template parameters from transforms APIs.
|
||||
* Support for the gfx940 and gfx941 targets.
|
||||
* The rocWMMA cooperative API.
|
||||
* Wave count template parameters from transforms APIs.
|
||||
|
||||
#### Optimized
|
||||
|
||||
@@ -1611,7 +1610,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* Handle creation and destruction APIs have been consolidated. Use `rppCreate()` for handle initialization and `rppDestroy()` for handle destruction.
|
||||
* The `logical_operations` function category has been renamed to `bitwise_operations`.
|
||||
* TurboJPEG package installation enabled for RPP Test Suite with `sudo apt-get install libturbojpeg0-dev`. Instructions have been updated in utilities/test_suite/README.md.
|
||||
* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to a unsigned int tensor) that provides the permutation order to swap the RGB channels of each input image in the batch in any order:
|
||||
* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to an unsigned int tensor), that provides the permutation order to swap the RGB channels of each input image in the batch in any order:
|
||||
|
||||
`RppStatus rppt_swap_channels_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, rppHandle_t rppHandle);`
|
||||
|
||||
@@ -1626,7 +1625,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Resolved issues
|
||||
|
||||
* Test package - debian packages will install required dependencies.
|
||||
* Test package - Debian packages will install required dependencies.
|
||||
|
||||
### **Tensile** (4.44.0)
|
||||
|
||||
@@ -1636,7 +1635,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
- Added code object compression via bundling.
|
||||
- Added support for non-default HIP SDK installations on Windows.
|
||||
- Added master solution library documentation.
|
||||
- Added compiler version dependent assembler and architecture capabilities.
|
||||
- Added compiler version-dependent assembler and architecture capabilities.
|
||||
- Added documentation from GitHub Wiki to ROCm docs.
|
||||
|
||||
#### Changed
|
||||
@@ -1659,7 +1658,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
- Fixed configure time path not being invoked at build.
|
||||
- Fixed find_package for msgpack to work with versions 5 and 6.
|
||||
- Fixed rhel9 testing.
|
||||
- Fixed RHEL 9 testing.
|
||||
- Fixed gfx908 builds.
|
||||
- Fixed the 'argument list too long' error.
|
||||
- Fixed version typo in 6.3 changelog.
|
||||
|
||||
242
RELEASE.md
242
RELEASE.md
@@ -45,9 +45,7 @@ ROCm 7.0.0 adds support for [AMD Instinct MI355X](https://www.amd.com/en/product
|
||||
ROCm 7.0.0 adds support for the following operating systems and kernel versions:
|
||||
|
||||
* Ubuntu 24.04.3 (kernel: 6.8 [GA], 6.14 [HWE])
|
||||
* RHEL 10 (kernel: 6.12.0-55)
|
||||
* Oracle Linux 10 (kernel: 6.12.0 UEK)
|
||||
* Rocky 9 (kernel: 5.14.0-570)
|
||||
* Rocky Linux 9 (kernel: 5.14.0-570)
|
||||
|
||||
ROCm 7.0.0 marks the end of support (EoS) for Ubuntu 24.04.2 (kernel: 6.8 [GA], 6.11 [HWE]) and SLES 15 SP6.
|
||||
|
||||
@@ -65,10 +63,22 @@ All KVM-based SR-IOV supported configurations require the GIM SR-IOV driver vers
|
||||
|
||||
### Deep learning and AI framework updates
|
||||
|
||||
ROCm 7.0 introduces several newly supported versions of Deep learning and AI frameworks. For more information, see [Deep learning frameworks for ROCm](https://rocm.docs.amd.com/en/latest/how-to/deep-learning-rocm.html) and the [Compatibility
|
||||
ROCm provides a comprehensive ecosystem for deep learning development. For more information, see [Deep learning frameworks for ROCm](https://rocm.docs.amd.com/en/latest/how-to/deep-learning-rocm.html) and the [Compatibility
|
||||
matrix](../../docs/compatibility/compatibility-matrix.rst) for the complete list of Deep learning and AI framework versions tested for compatibility with ROCm.
|
||||
|
||||
#### PyTorch
|
||||
#### New frameworks
|
||||
|
||||
AMD ROCm has officially added support for the following Deep learning and AI frameworks:
|
||||
|
||||
* Ray is a unified framework for scaling AI and Python applications from your laptop to a full cluster, without changing your code. Ray consists of a core distributed runtime and a set of AI libraries for simplifying machine learning computations. It is currently supported on ROCm 6.4.1. For more information, see [Ray compatibility](https://advanced-micro-devices-rocm-internal--500.com.readthedocs.build/en/500/compatibility/ml-compatibility/ray-compatibility.html).
|
||||
|
||||
* llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup. It is currently supported on ROCm 6.4.0. For more information, see [llama.cpp compatibility](https://advanced-micro-devices-rocm-internal--500.com.readthedocs.build/en/500/compatibility/ml-compatibility/llama-cpp-compatibility.html).
|
||||
|
||||
#### Updated framework support
|
||||
|
||||
ROCm 7.0 introduces several newly supported versions of Deep learning and AI frameworks:
|
||||
|
||||
##### PyTorch
|
||||
|
||||
ROCm 7.0 enables the following PyTorch features:
|
||||
|
||||
@@ -77,11 +87,11 @@ ROCm 7.0 enables the following PyTorch features:
|
||||
* Compilation of Python C++ extensions using ``amdclang++``.
|
||||
* Support for channels-last NHWC format for convolutions via MIOpen.
|
||||
|
||||
#### JAX
|
||||
##### JAX
|
||||
|
||||
ROCm 7.0 enables support for JAX 0.6.0.
|
||||
|
||||
#### Megatron-LM
|
||||
##### Megatron-LM
|
||||
|
||||
Megatron-LM for ROCm now supports:
|
||||
|
||||
@@ -91,26 +101,26 @@ Megatron-LM for ROCm now supports:
|
||||
|
||||
* Fused_bias_swiglu kernel.
|
||||
|
||||
#### TensorFlow
|
||||
##### TensorFlow
|
||||
|
||||
ROCm 7.0 enables support for TensorFlow 2.19.1.
|
||||
|
||||
#### ONNX Runtime
|
||||
##### ONNX Runtime
|
||||
|
||||
ROCm 7.0 enables support for ONNX Runtime 1.22.1.
|
||||
ROCm 7.0 enables support for ONNX Runtime 1.22.0.
|
||||
|
||||
#### vLLM
|
||||
##### vLLM
|
||||
|
||||
* Support for Open Compute Project (OCP) `FP8` data type.
|
||||
* `FP4` precision for Llama 3.1 405B.
|
||||
|
||||
#### Triton
|
||||
##### Triton
|
||||
|
||||
ROCm 7.0 enables support for Triton 3.3.0.
|
||||
|
||||
### Instinct Driver/ROCm packaging separation
|
||||
|
||||
The Instinct Driver is now distributed separately from the ROCm software stack and is stored under in its own location ``/amdgpu/`` in the package repository at [repo.radeon.com](https://repo.radeon.com/amdgpu/). The first release is designated as Instinct Driver version 30.10. See [ROCm Gets Modular: Meet the Instinct Datacenter GPU Driver](https://rocm.blogs.amd.com/ecosystems-and-partners/instinct-gpu-driver/README.html) for more information.
|
||||
The Instinct Driver is now distributed separately from the ROCm software stack and is stored under in its own location ``/amdgpu/`` in the package repository at [repo.radeon.com](https://repo.radeon.com/amdgpu/). The first release is designated as Instinct Driver version 30.10. See the [ROCm Gets Modular: Meet the Instinct Datacenter GPU Driver](https://rocm.blogs.amd.com/ecosystems-and-partners/instinct-gpu-driver/README.html) blog and [User and kernel-space support matrix](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/user-kernel-space-compat-matrix.html)for more information.
|
||||
|
||||
[AMD SMI](https://github.com/ROCm/amdsmi) continues to stay with the ROCm software stack under the ROCm organization repository.
|
||||
|
||||
@@ -127,11 +137,11 @@ The HIP runtime now includes support for:
|
||||
* `constexpr` operators for `FP16` and `BF16`.
|
||||
* `__syncwarp` operation.
|
||||
* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`.
|
||||
* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`).
|
||||
* Added warp level primitives: `__syncwarp` and reduce intrinsics (for example, `__reduce_add_sync()`).
|
||||
* Extended fine grained system memory pool.
|
||||
* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement.
|
||||
|
||||
In addition, the HIP runtime includes functional improvements, which improves functionality, runtime performance, and user experience. For more information, see [HIP changelog](#hip-7-0-0) below.
|
||||
Additionally, the HIP runtime includes functional improvements, which improve functionality, runtime performance, and the user experience. For more information, see [HIP changelog](#hip-7-0-0) below.
|
||||
|
||||
### Compiler changes and improvements
|
||||
|
||||
@@ -152,11 +162,11 @@ Key compiler enhancements include:
|
||||
* Added a new target-specific builtin ``__builtin_amdgcn_is_invocable``, enabling fine-grained, per-builtin feature availability.
|
||||
* The compiler driver now uses parallel code generation by default when compiling using full LTO (including when using the `-fgpu-rdc` option) for HIP. This divides the optimized LLVM IR module into roughly equal partitions before instruction selection and lowering, which can help improve build times.
|
||||
|
||||
Each kernel in the linked LTO module can be put in a separate partition, and any non-inlined function it depends on may be copied alongside it. Thus, while parallel code generation can improve build time, it can duplicate non-inlined, non-kernel functions across multiple partitions, potentially increasing the binary size of the final object file.
|
||||
Each kernel in the linked LTO module can be put in a separate partition, and any non-inlined function it depends on can be copied alongside it. Thus, while parallel code generation can improve build time, it can duplicate non-inlined, non-kernel functions across multiple partitions, potentially increasing the binary size of the final object file.
|
||||
|
||||
* Compiler option `-flto-partitions=<num>`:
|
||||
|
||||
Equivalent to the `--lto-partitions=<num>` LLD option. Controls the number of partitions used for parallel code generation when using full LTO (including when using `-fgpu-rdc`). The number of partitions must be greater than 0, and a value of 1 disables the feature. The default value is 8.
|
||||
Equivalent to the `--lto-partitions=<num>` LLD option. Controls the number of partitions used for parallel code generation when using full LTO (including when using `-fgpu-rdc`). The number of partitions must be greater than 0, and a value of 1 turns off the feature. The default value is 8.
|
||||
|
||||
Developers are encouraged to experiment with different numbers of partitions using the `-flto-partitions` Clang command line option. Recommended values are 1 to 16 partitions, with especially large projects containing many kernels potentially benefiting from up to 64 partitions. It is not recommended to use a value greater than the number of threads on the machine. Smaller projects, or those containing only a few kernels, might not benefit at all from partitioning and might even experience a slight increase in build time due to the small overhead of analyzing and partitioning the modules.
|
||||
|
||||
@@ -169,11 +179,10 @@ Key compiler enhancements include:
|
||||
|
||||
#### New data type support
|
||||
|
||||
MX-compliant data types bring microscaling support to ROCm. For more information, see the [OCP Microscaling (MX) Formats Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). The ROCm 7.0 enables functional support for MX data types `FP4`, `FP6`, and `FP8` on AMD Instinct MI350 series accelerators in these ROCm libraries:
|
||||
MX-compliant data types bring microscaling support to ROCm. For more information, see the [OCP Microscaling (MX) Formats Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). ROCm 7.0 enables functional support for MX data types `FP4`, `FP6`, and `FP8` on AMD Instinct MI350 series accelerators in these ROCm libraries:
|
||||
|
||||
* Composable Kernel (`FP4`, `FP6`, and `FP8` only)
|
||||
* hipBLASLt
|
||||
* MIGraphX (`FP4` only)
|
||||
|
||||
The following libraries are updated to support the Open Compute Project (OCP) floating-point `FP8` format on MI350 series accelerators instead of the NANOO `FP8` format:
|
||||
|
||||
@@ -183,8 +192,6 @@ The following libraries are updated to support the Open Compute Project (OCP) fl
|
||||
* MIGraphX
|
||||
* rocWMMA
|
||||
|
||||
MIGraphX now also supports `BF16`.
|
||||
|
||||
For more information about data types, see [Data types and precision support](https://rocm.docs.amd.com/en/latest/reference/precision-support.html).
|
||||
|
||||
#### hipBLASLt improvement
|
||||
@@ -193,10 +200,12 @@ GEMM performance has been improved for `FP8`, `FP16`, `BF16`, and `FP32` data ty
|
||||
|
||||
For more information about hipBLASLt changes, see the [hipBLASLt changelog](#hipblaslt-1-0-0) below.
|
||||
|
||||
#### MIGraphX support
|
||||
#### MIGraphX improvements
|
||||
|
||||
* Support for OCP `FP8` on AMD Instinct MI350X and MI355X accelerators.
|
||||
* Support for PyTorch 2.7 via Torch-MIGraphX.
|
||||
* Improved performance of Generative AI models
|
||||
* Added additional MSFT Contrib Operators for improved ONNX Runtime Experience
|
||||
|
||||
For more information about MIGraphX changes, see the [MIGraphX changelog](migraphx-2-13-0) below.
|
||||
|
||||
@@ -217,7 +226,7 @@ have been refined for improved usability. See the [AMD SMI changelog](#amd-smi-2
|
||||
|
||||
#### ROCgdb
|
||||
|
||||
The MX data types now support `FP4`, `FP6`, and `FP8`.
|
||||
The micro-scaling (MX) data types now support `FP4`, `FP6`, and `FP8`.
|
||||
|
||||
See the [ROCgdb changelog](#rocgdb-16-3) for more details.
|
||||
|
||||
@@ -225,11 +234,14 @@ See the [ROCgdb changelog](#rocgdb-16-3) for more details.
|
||||
|
||||
ROCm Compute Profiler includes the following key changes:
|
||||
|
||||
* MX data types support: `FP4`, `FP6`, and `FP8`.
|
||||
* AMD Instinct MI355X and MI350X performance counters: CPC, SPI, SQ, TA/TD/TCP, and TCC.
|
||||
* Enhanced roofline analysis with support for `INT8`, `INT32`, `FP8`, `FP16`, and `BF16` data types.
|
||||
* Roofline distinction for `FP32` and `FP64` data types.
|
||||
* Selective kernel profiling.
|
||||
* Interactive command line with a Textual User Interface (TUI) has been added to analyze mode. For more details, see [TUI analysis](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/amd-staging/how-to/analyze/tui.html).
|
||||
* Support added for advanced data types: `FP4` and `FP6`
|
||||
* Support for AMD Instinct MI355X and MI350X with addition of performance counters: CPC, SPI, SQ, TA/TD/TCP, and TCC.
|
||||
* Roofline enhancement added for AMD Instinct MI350 series.
|
||||
* Improved support for Selective Kernel profiling.
|
||||
* Program Counter (PC) sampling (Software-based) feature has been enabled for AMD Instinct MI200, MI300X, MI350X, and MI355X accelerators. This feature helps in GPU profiling to understand code execution patterns and hotspots during GPU kernel execution. For more details, see [Using PC sampling in ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/amd-staging/how-to/pc_sampling.html).
|
||||
* Program Counter (PC) sampling (Hardware-based, Stochastic) feature has been enabled for AMD Instinct MI300X, MI350, and MI355X accelerators.
|
||||
* Docker files has been added to package the application and dependencies into a single portable and executable standalone binary file.
|
||||
|
||||
See the [ROCm Compute Profiler changelog](#rocm-compute-profiler-3-2-3) for more details.
|
||||
|
||||
@@ -241,14 +253,14 @@ The ROCm Data Center tool (RDC) streamlines the administration of AMD GPUs in cl
|
||||
|
||||
ROCm Systems Profiler includes the following key changes:
|
||||
|
||||
* Trace support for computer vision APIs: H264, H265, AV1, VP9, and JPEG.
|
||||
* Trace support for computer vision engine activity.
|
||||
* OpenMP for C++ language and kernel activity support.
|
||||
* Improved profiling support for Computer Vision workloads through rocDecode and rocJPEG API tracing and engine activity sampling.
|
||||
* Network profiling support has been added to AMD Instinct MI300X, MI350X, and MI355X.
|
||||
* Improved profiling of the communication layer with RCCL and MPI API tracing.
|
||||
|
||||
See the [ROCm Systems Profiler changelog](#rocm-systems-profiler-1-1-0) for more details.
|
||||
|
||||
#### ROCm Validation Suite
|
||||
AMD Instinct MI355X and MI350X accelerator support in the IET (Integrated Execution Test), GST (GPU Stress Test), and Babel (memory bandwidth test) modules.
|
||||
In ROCm 7.0, ROCm Validation Suite includes support for the AMD Instinct MI355X and MI350X accelerators in the IET (Integrated Execution Test), GST (GPU Stress Test), and Babel (memory bandwidth test) modules.
|
||||
|
||||
See the [ROCm Validation Suite changelog](#rocm-validation-suite-1-2-0) for more details.
|
||||
|
||||
@@ -260,7 +272,7 @@ See the [ROCm Validation Suite changelog](#rocm-validation-suite-1-2-0) for more
|
||||
* ROCprofiler-SDK adds support for AMD Instinct MI350X and MI355X accelerators.
|
||||
* The stochastic and host-trap PC sampling support has been added for all AMD Instinct MI300 and MI350 series accelerators, which
|
||||
provides information particularly useful for understanding stalls during kernel execution.
|
||||
* The added support for tracing events surfaced by AMD's Kernel Fusion Driver (KFD) captures low level driver routines involved in mapping, invalidation, and migration of data between CPU and GPU memories. Such events are central to the support for [Unified Memory](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/memory_management/unified_memory.html) on AMD systems. Tracing of KFD events helps to detect performance problems arising from excessive data migration.
|
||||
* The added support for tracing events surfaced by AMD's Kernel Fusion Driver (KFD) captures low-level driver routines involved in mapping, invalidation, and migration of data between CPU and GPU memories. Such events are central to the support for [Unified Memory](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/memory_management/unified_memory.html) on AMD systems. Tracing of KFD events helps to detect performance problems arising from excessive data migration.
|
||||
* New APIs are added for profiling applications using thread traces (beta)
|
||||
which facilitates profiling wavefronts at the instruction timing level.
|
||||
|
||||
@@ -282,8 +294,8 @@ See the [ROCprofiler-SDK changelog](#rocprofiler-sdk-1-0-0) for more details.
|
||||
|
||||
The ROCm Offline Installer Creator 7.0.0 includes the following features and improvements:
|
||||
|
||||
* Added support for RHEL 10.0, Oracle 10.0, and Rocky 9.6.
|
||||
* Added support for the new graphics repo structure for graphics/mesa related packages.
|
||||
* Added support for Rocky Linux 9.6.
|
||||
* Added support for the new graphics repo structure for graphics/Mesa related packages.
|
||||
* Improvements to kernel header version matching for AMDGPU driver installation.
|
||||
* Added support for creating an offline installer when the kernel version of the target operating system differs from the operating system of the host creating the installer (for Ubuntu 22.04 and 24.04 only).
|
||||
|
||||
@@ -293,7 +305,7 @@ See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-
|
||||
|
||||
The ROCm Runfile Installer 7.0.0 adds the following features and improvements:
|
||||
|
||||
* Added support for RHEL 10.0, Oracle 10.0, and Rocky 9.6.
|
||||
* Added support for Rocky Linux 9.6.
|
||||
* Added `untar` mode for the `.run` file to allow extraction of ROCm to a given directory, similar to a normal tarball.
|
||||
* Added an RVS test script.
|
||||
* Fixes to the rocm-examples test script.
|
||||
@@ -324,9 +336,6 @@ ROCm documentation continues to be updated to provide clearer and more comprehen
|
||||
|
||||
* [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/develop/reference/env-variables.html)
|
||||
* [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/develop/reference/env-variables.html)
|
||||
* [MIVisionX](https://rocm.docs.amd.com/projects/MIVisionX/en/develop/reference/MIVisionX-env-variables.html)
|
||||
* [MIOpen](https://rocm.docs.amd.com/projects/MIOpen/en/develop/reference/env_variables.html)
|
||||
* [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/develop/reference/env-variables.html)
|
||||
* [ROCm Performance Primitives (RPP)](https://rocm.docs.amd.com/projects/rpp/en/develop/reference/rpp-env-variables.html)
|
||||
* [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/develop/reference/env_variables.html)
|
||||
* [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/develop/reference/env_variables.html)
|
||||
@@ -372,7 +381,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/MIOpen/en/docs-6.4.3/index.html">MIOpen</a></td>
|
||||
<td>3.4.0 ⇒ <a href="#miopen-3-5-0">3.5.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/MIOpen"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/miopen"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/MIVisionX/en/docs-6.4.3/index.html">MIVisionX</a></td>
|
||||
@@ -425,17 +434,17 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<th rowspan="16">Math</th>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipBLAS/en/docs-6.4.3/index.html">hipBLAS</a></td>
|
||||
<td>2.4.0 ⇒ <a href="#hipblas-3-0-0">3.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipBLAS"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblas"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.4.3/index.html">hipBLASLt</a></td>
|
||||
<td>0.12.1 ⇒ <a href="#hipblaslt-1-0-0">1.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipBLASLt"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipFFT/en/docs-6.4.3/index.html">hipFFT</a></td>
|
||||
<td>1.0.18 ⇒ <a href="#hipfft-1-0-20">1.0.20</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipFFT"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipfft"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipfort/en/docs-6.4.3/index.html">hipfort</a></td>
|
||||
@@ -445,7 +454,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipRAND/en/docs-6.4.3/index.html">hipRAND</a></td>
|
||||
<td>2.12.0 ⇒ <a href="#hiprand-3-0-0">3.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipRAND"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hiprand"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipSOLVER/en/docs-6.4.3/index.html">hipSOLVER</a></td>
|
||||
@@ -455,12 +464,12 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSE/en/docs-6.4.3/index.html">hipSPARSE</a></td>
|
||||
<td>3.2.0 ⇒ <a href="#hipsparse-4-0-1">4.0.1</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipSPARSE"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipsparse"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSELt/en/docs-6.4.3/index.html">hipSPARSELt</a></td>
|
||||
<td>0.2.3 ⇒ <a href="#hipsparselt-0-2-4">0.2.4</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipSPARSELt"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipsparselt"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocALUTION/en/docs-6.4.3/index.html">rocALUTION</a></td>
|
||||
@@ -470,17 +479,17 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.4.3/index.html">rocBLAS</a></td>
|
||||
<td>4.4.1 ⇒ <a href="#rocblas-5-0-0">5.0.0</a></td></td>
|
||||
<td><a href="https://github.com/ROCm/rocBLAS"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocblas"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocFFT/en/docs-6.4.3/index.html">rocFFT</a></td>
|
||||
<td>1.0.32 ⇒ <a href="#rocfft-1-0-34">1.0.34</a></td>
|
||||
<td><a href="https://github.com/ROCm/rocFFT"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocfft"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocRAND/en/docs-6.4.3/index.html">rocRAND</a></td>
|
||||
<td>3.3.0 ⇒ <a href="#rocrand-4-0-0">4.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/rocRAND"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocrand"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocSOLVER/en/docs-6.4.3/index.html">rocSOLVER</a></td>
|
||||
@@ -490,7 +499,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocSPARSE/en/docs-6.4.3/index.html">rocSPARSE</a></td>
|
||||
<td>3.4.0 ⇒ <a href="#rocsparse-4-0-2">4.0.2</a></td>
|
||||
<td><a href="https://github.com/ROCm/rocSPARSE"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocsparse"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocWMMA/en/docs-6.4.3/index.html">rocWMMA</a></td>
|
||||
@@ -500,7 +509,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/Tensile/en/docs-6.4.3/src/index.html">Tensile</a></td>
|
||||
<td>4.43.0 ⇒ <a href="#tensile-4-44-0">4.44.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/Tensile"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/shared/tensile"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
<tbody class="rocm-components-libs rocm-components-primitives tbody-reverse-zebra">
|
||||
@@ -509,7 +518,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<th rowspan="4">Primitives</th>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipCUB/en/docs-6.4.3/index.html">hipCUB</a></td>
|
||||
<td>3.4.0 ⇒ <a href="#hipcub-4-0-0">4.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/hipCUB"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipcub"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/hipTensor/en/docs-6.4.3/index.html">hipTensor</a></td>
|
||||
@@ -519,12 +528,12 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocPRIM/en/docs-6.4.3/index.html">rocPRIM</a></td>
|
||||
<td>3.4.1 ⇒ <a href="#rocprim-4-0-0">4.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/rocPRIM"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocprim"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/rocThrust/en/docs-6.4.3/index.html">rocThrust</a></td>
|
||||
<td>3.3.0 ⇒ <a href="#rocthrust-4-0-0">4.0.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/rocThrust"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocthrust"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
<tbody class="rocm-components-tools rocm-components-system tbody-reverse-zebra">
|
||||
@@ -684,7 +693,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
||||
|
||||
* Default command:
|
||||
|
||||
A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through laternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`.
|
||||
A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through alternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`.
|
||||
|
||||
* Support for GPU metrics 1.8:
|
||||
- Added new fields for `amdsmi_gpu_xcp_metrics_t` including:
|
||||
@@ -693,7 +702,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
||||
- Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts
|
||||
- Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks.
|
||||
- Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics).
|
||||
- Increased available JPEG engines to 40. Current ASICs may not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI.
|
||||
- Increased available JPEG engines to 40. Current ASICs might not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI.
|
||||
|
||||
* Bad page threshold count.
|
||||
- Added `amdsmi_get_gpu_bad_page_threshold` to Python API and CLI; root/sudo permissions required to display the count.
|
||||
@@ -762,32 +771,32 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
||||
|
||||
#### Removed
|
||||
|
||||
- Removed unnecessary API, `amdsmi_free_name_value_pairs()`
|
||||
- Unnecessary API, `amdsmi_free_name_value_pairs()`
|
||||
- This API is only used internally to free up memory from the Python interface and does not need to be
|
||||
exposed to the user.
|
||||
|
||||
- Removed unused definitions:
|
||||
- Unused definitions:
|
||||
- `AMDSMI_MAX_NAME`, `AMDSMI_256_LENGTH`, `AMDSMI_MAX_DATE_LENGTH`, `MAX_AMDSMI_NAME_LENGTH`, `AMDSMI_LIB_VERSION_YEAR`,
|
||||
`AMDSMI_DEFAULT_VARIANT`, `AMDSMI_MAX_NUM_POWER_PROFILES`, `AMDSMI_MAX_DRIVER_VERSION_LENGTH`.
|
||||
|
||||
- Removed unused member `year` in struct `amdsmi_version_t`.
|
||||
- Unused member `year` in struct `amdsmi_version_t`.
|
||||
|
||||
- Removed `amdsmi_io_link_type_t` and replaced with `amdsmi_link_type_t`.
|
||||
- `amdsmi_io_link_type_t` and replaced with `amdsmi_link_type_t`.
|
||||
- `amdsmi_io_link_type_t` is no longer needed as `amdsmi_link_type_t` is sufficient.
|
||||
- `amdsmi_link_type_t` enum has changed.
|
||||
- This change will also affect `amdsmi_link_metrics_t`, where the link_type field changes from `amdsmi_io_link_type_t` to `amdsmi_link_type_t`.
|
||||
|
||||
- Removed `amdsmi_get_power_info_v2()`.
|
||||
- `amdsmi_get_power_info_v2()`.
|
||||
- The ``amdsmi_get_power_info()`` has been unified and the v2 function is no longer needed or used.
|
||||
|
||||
- Removed `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`.
|
||||
- `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`.
|
||||
|
||||
- The `amdsmi_get_gpu_vram_info` now provides vendor names as a string.
|
||||
- `amdsmi_vram_vendor_type_t` enum structure is removed.
|
||||
- `amdsmi_vram_info_t` member named `amdsmi_vram_vendor_type_t` is changed to a character string.
|
||||
- `amdsmi_get_gpu_vram_info` now no longer requires decoding the vendor name as an enum.
|
||||
|
||||
- Removed backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`.
|
||||
- Backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`.
|
||||
- Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion about which field to use. By removing backward compatibility, it is easier to identify the relevant field.
|
||||
- The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`.
|
||||
@@ -866,16 +875,17 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
|
||||
- `num_threads` Total number of threads in the group. The legacy API size is alias.
|
||||
- `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for aritimetic reduction across lanes of a warp, and `__reduce_and_sync`, `__reduce_or_sync`, and `__reduce_xor_sync`
|
||||
functions added for logical reduction. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions).
|
||||
* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as the following. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
|
||||
* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as follows. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
|
||||
- Data types for `FP4`/`FP6`/`FP8`.
|
||||
- HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs.
|
||||
- HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs.
|
||||
* New `wptr` and `rptr` values in `ClPrint`, for better logging in dispatch barrier methods.
|
||||
* New debug mask, to print precise code object information for logging.
|
||||
* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`.
|
||||
* Added `constexpr` operators for `fp16`/`bf16`.
|
||||
* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`)
|
||||
* Extended fine grained system memory pool.
|
||||
* Support for the flags in APIs as following, now allows uncached memory allocation.
|
||||
- `hipExtHostRegisterUncached`, used in `hipHostRegister`.
|
||||
- `hipHostMallocUncached` and `hipHostAllocUncached`, used in `hipHostMalloc` and `hipHostAlloc`.
|
||||
* `num_threads` total number of threads in the group. The legacy API size is alias.
|
||||
* Added PCI CHIP ID information as the device attribute.
|
||||
* Added new tests applications for OCP data types `FP4`/`FP6`/`FP8`.
|
||||
@@ -883,8 +893,10 @@ functions added for logical reduction. For details, see [Warp cross-lane functio
|
||||
|
||||
#### Changed
|
||||
* Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows.
|
||||
* Removal of Beta warnings in HIP Graph APIs
|
||||
* Removal of beta warnings in HIP Graph APIs
|
||||
All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported.
|
||||
* `warpSize` has changed.
|
||||
In order to match the CUDA specification, the `warpSize` variable is no longer `constexpr`. In general, this should be a transparent change; however, if an application was using `warpSize` as a compile-time constant, it will have to be updated to handle the new definition. For more information, see the discussion of `warpSize` within the [HIP C++ language extensions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize).
|
||||
* Behavior changes
|
||||
- `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution.
|
||||
- Cooperative groups in `hipLaunchCooperativeKernelMultiDevice` and `hipLaunchCooperativeKernel` functions, additional input parameter validation checks are added.
|
||||
@@ -986,9 +998,6 @@ In order to match the CUDA runtime behavior more closely, HIP APIs with streams
|
||||
- Event Management Related APIs
|
||||
* `hipEventRecord`
|
||||
* `hipEventRecordWithFlags`
|
||||
* `warpSize` Change
|
||||
|
||||
In order to match the CUDA specification, the `warpSize` variable is no longer `constexpr`. In general, this should be a transparent change; however, if an application was using `warpSize` as a compile-time constant, it will have to be updated to handle the new definition. For more information, see either the discussion of `warpSize` within the [HIP C++ language extensions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize).
|
||||
|
||||
#### Optimized
|
||||
|
||||
@@ -1008,7 +1017,6 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
Developers can now use the environment variable `HSA_SCRATCH_SINGLE_LIMIT_ASYNC` to change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP API `hipDeviceSetLimit(hipExtLimitScratchCurrent, value)` to reset the scratch limit value.
|
||||
* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth.
|
||||
* Improved launch latency for `D2D` copies and `memset` on MI300 series.
|
||||
* Memory manager was implemented to improve the efficiency of memory usage and speed-up memory allocation/free in memory pools.
|
||||
* Introduced a threshold to handle the command submission patch to the GPU device(s), considering the synchronization with CPU, for performance improvement.
|
||||
|
||||
#### Resolved issues
|
||||
@@ -1024,6 +1032,11 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
* A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture.
|
||||
* A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments.
|
||||
|
||||
#### Known issues
|
||||
|
||||
* `hipLaunchHostFunc` returns an error during stream capture. Any application using `hipLaunchHostFunc` might fail to capture graphs during stream capture, instead, it returns `hipErrorStreamCaptureUnsupported`.
|
||||
* Compilation failure in kernels via hiprtc when use option `std=c++11`.
|
||||
|
||||
### **hipBLAS** (3.0.0)
|
||||
|
||||
#### Added
|
||||
@@ -1084,7 +1097,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Added
|
||||
|
||||
* Added a new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default.
|
||||
* Added a new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large that symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default.
|
||||
* Added single pass operators in `agent/single_pass_scan_operators.hpp` which contains the following API:
|
||||
* `BlockScanRunningPrefixOp`
|
||||
* `ScanTileStatus`
|
||||
@@ -1100,7 +1113,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Removed
|
||||
|
||||
* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you would like to build for these architectures, please specify them explicitly in the `AMDGPU_TARGETS` cmake option.
|
||||
* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you want to build for these architectures, specify them explicitly in the `AMDGPU_TARGETS` cmake option.
|
||||
* Deprecated `hipcub::AsmThreadLoad` is removed, use `hipcub::ThreadLoad` instead.
|
||||
* Deprecated `hipcub::AsmThreadStore` is removed, use `hipcub::ThreadStore` instead.
|
||||
* Deprecated `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed.
|
||||
@@ -1250,7 +1263,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
* Added element-wise binary operation support.
|
||||
* Added element-wise trinary operation support.
|
||||
* Added support for new GPU target gfx950.
|
||||
* Added support for GPU target gfx950.
|
||||
* Added dynamic unary and binary operator support for element-wise operations and permutation.
|
||||
* Added a CMake check for `f8` datatype availability.
|
||||
* Added `hiptensorDestroyOperationDescriptor` to free all resources related to the provided descriptor.
|
||||
@@ -1292,7 +1305,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
#### Added
|
||||
|
||||
* Added the compiler `-gsplit-dwarf` option to enable the generation of separate debug information file at compile time. When used, separate debug information files are generated for host and for each offload architecture. For additional information, see [DebugFission](https://gcc.gnu.org/wiki/DebugFission).
|
||||
* Added `llvm-flang`, AMD's next generation Fortran compiler is a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub.
|
||||
* Added `llvm-flang`, AMD's next-generation Fortran compiler. It's a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub.
|
||||
* Added Comgr support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps to improve performance in the device library link step.
|
||||
* Added compiler support of a new target-specific builtin `__builtin_amdgcn_processor_is` for late or deferred queries of the current target processor, and `__builtin_amdgcn_is_invocable` to determine the current target processor ability to invoke a particular builtin.
|
||||
* Added HIPIFY support for NVIDIA CUDA 12.9.1 APIs. Added support for all new device and host APIs, including FP4, FP6, and FP128, and support for the corresponding ROCm HIP equivalents.
|
||||
@@ -1424,11 +1437,11 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Known issues
|
||||
|
||||
* Installation on CentOS/RedHat/SLES requires the manual installation of the `FFMPEG` & `OpenCV` dev packages.
|
||||
* Installation on RHEL and SLES requires the manual installation of the `FFMPEG` and `OpenCV` dev packages.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
* Optimized audio augmentations support for VX_RPP
|
||||
* Optimized audio augmentations support for VX_RPP.
|
||||
|
||||
### **RCCL** (2.26.6)
|
||||
|
||||
@@ -1476,7 +1489,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
|
||||
#### Known issues
|
||||
* Package installation on SLES requires manually installing `TurboJPEG`.
|
||||
* Package installation on CentOS, RedHat, and SLES requires manually installing the `FFMPEG Dev` package.
|
||||
* Package installation on RHEL and SLES requires manually installing the `FFMPEG Dev` package.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
@@ -1656,7 +1669,7 @@ HIP runtime has the following functional improvements which improves runtime per
|
||||
* Individual `plugins`: The `plugins` (shared libraries) are available at: `/opt/rocm/lib/rocm_bandwidth_test/plugins/`
|
||||
|
||||
```{note}
|
||||
Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/rocm-rel-7.0/README.md) file for details about the new options and outputs.
|
||||
Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/amd-mainline/README.md) file for details about the new options and outputs.
|
||||
```
|
||||
|
||||
#### Changed
|
||||
@@ -1665,7 +1678,7 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
#### Removed
|
||||
|
||||
- The old CLI, parameters, and switches used.
|
||||
- The old CLI, parameters, and switches.
|
||||
|
||||
### **ROCm Compute Profiler** (3.2.3)
|
||||
|
||||
@@ -1714,8 +1727,6 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
* Support for Roofline plot on CLI (single run) analysis.
|
||||
|
||||
* Roofline support for RHEL 10 OS.
|
||||
|
||||
* `FP4` and `FP6` data types have been added for roofline profiling on AMD Instinct MI350 series.
|
||||
|
||||
##### rocprofv3 support
|
||||
@@ -1784,6 +1795,8 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
|
||||
* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow.
|
||||
|
||||
* Roofline feature is currently not functional on Azure Linux 3.0 and Debian 12.
|
||||
|
||||
#### Upcoming changes
|
||||
|
||||
* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. Using ``rocprof v1/v2/v3`` interfaces will trigger a deprecation warning.
|
||||
@@ -1829,7 +1842,7 @@ Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/release/roc
|
||||
#### Removed
|
||||
|
||||
- Removed backwards compatibility for `rsmi_dev_gpu_metrics_info_get()`'s `jpeg_activity` and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` and `xcp_stats.vcn_busy`.
|
||||
- Backwards compability is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available.
|
||||
- Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion for users about which field to use. By removing backward compatibility, it is easier to identify the relevant field.
|
||||
- The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`.
|
||||
|
||||
@@ -1888,7 +1901,6 @@ See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/rele
|
||||
* Added new optimization to the backend for `device_transform` when the input and output are pointers.
|
||||
* Added `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers.
|
||||
* Added `rocprim:device_transform` for n-ary transform operations API with as input `n` number of iterators inside a `rocprim::tuple`.
|
||||
* Added gfx950 support.
|
||||
* Added `rocprim::key_value_pair::operator==`.
|
||||
* Added the `rocprim::unrolled_copy` thread function to copy multiple items inside a thread.
|
||||
* Added the `rocprim::unrolled_thread_load` function to load multiple items inside a thread using `rocprim::thread_load`.
|
||||
@@ -1905,12 +1917,12 @@ See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/rele
|
||||
|
||||
#### Changed
|
||||
|
||||
* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits` respectively.
|
||||
* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits`, respectively.
|
||||
* Marked the initialisation constructor of `rocprim::reverse_iterator<Iter>` `explicit`, use `rocprim::make_reverse_iterator`.
|
||||
* Merged `radix_key_codec` into type_traits system.
|
||||
* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`.
|
||||
* The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change.
|
||||
The previous default accumulator types could lead to situations in which unexpected overflow occured, such as when the input or inital type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing:
|
||||
The previous default accumulator types could lead to situations in which unexpected overflow occurred, such as when the input or initial type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing:
|
||||
|
||||
* `rocprim::inclusive_scan`
|
||||
* Previous default: `class AccType = typename std::iterator_traits<InputIterator>::value_type>`
|
||||
@@ -1925,7 +1937,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* Previous default: `class AccType = detail::input_type_t<InitValueType>>`
|
||||
* Current default: `class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>`
|
||||
* Undeprecated internal `detail::raw_storage`.
|
||||
* A new version of `rocprim::thread_load` and `rocprim::thread_store` replace the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result.
|
||||
* A new version of `rocprim::thread_load` and `rocprim::thread_store` replaces the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result.
|
||||
* Renamed `rocprim::load_cs` to `rocprim::load_nontemporal` and `rocprim::store_cs` to `rocprim::store_nontemporal` to express the intent of these load and store methods better.
|
||||
* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example, `rocprim::ROCPRIM_300400_NS::symbol` instead of `rocPRIM::symbol`, letting the user link multiple libraries built with different versions of rocPRIM.
|
||||
|
||||
@@ -1950,7 +1962,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* `rocprim::detail::match_result_type`. Use `rocprim::invoke_result_binary_op_t` instead.
|
||||
* Removed the deprecated `rocprim::detail::radix_key_codec` function. Use `rocprim::radix_key_codec` instead.
|
||||
* Removed `rocprim/detail/radix_sort.hpp`, functionality can now be found in `rocprim/thread/radix_key_codec.hpp`.
|
||||
* Removed C++14 support, only C++17 is supported.
|
||||
* Removed C++14 support. Only C++17 is supported.
|
||||
* Due to the removal of `__AMDGCN_WAVEFRONT_SIZE` in the compiler, the following deprecated warp size-related symbols have been removed:
|
||||
* `rocprim::device_warp_size()`
|
||||
* For compile-time constants, this is replaced with `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()`. Use this when allocating global or shared memory.
|
||||
@@ -1974,7 +1986,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Known issues
|
||||
|
||||
* * When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs.
|
||||
* When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs.
|
||||
|
||||
### **ROCprofiler-SDK** (1.0.0)
|
||||
|
||||
@@ -2214,7 +2226,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Resolved issues
|
||||
|
||||
* Fixed an issue with internal calls to unqualified `distance()` which would be ambigious due to also visibile implementation through ADL.
|
||||
* Fixed an issue with internal calls to unqualified `distance()` which would be ambiguous due to the visible implementation through ADL.
|
||||
|
||||
#### Known issues
|
||||
|
||||
@@ -2228,10 +2240,10 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Added
|
||||
|
||||
* Added internal register layout transforms to support interleaved MMA layouts.
|
||||
* Added support for the gfx950 target.
|
||||
* Added mixed input `BF8`/`FP8` types for MMA support.
|
||||
* Added fragment scheduler API objects to embed thread block cooperation properties in fragments.
|
||||
* Internal register layout transforms to support interleaved MMA layouts.
|
||||
* Support for the gfx950 target.
|
||||
* Mixed input `BF8`/`FP8` types for MMA support.
|
||||
* Fragment scheduler API objects to embed thread block cooperation properties in fragments.
|
||||
|
||||
#### Changed
|
||||
|
||||
@@ -2245,9 +2257,9 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Removed
|
||||
|
||||
* Removed support for the gfx940 and gfx941 targets.
|
||||
* Removed the rocWMMA cooperative API.
|
||||
* Removed wave count template parameters from transforms APIs.
|
||||
* Support for the gfx940 and gfx941 targets.
|
||||
* The rocWMMA cooperative API.
|
||||
* Wave count template parameters from transforms APIs.
|
||||
|
||||
#### Optimized
|
||||
|
||||
@@ -2274,7 +2286,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
* Handle creation and destruction APIs have been consolidated. Use `rppCreate()` for handle initialization and `rppDestroy()` for handle destruction.
|
||||
* The `logical_operations` function category has been renamed to `bitwise_operations`.
|
||||
* TurboJPEG package installation enabled for RPP Test Suite with `sudo apt-get install libturbojpeg0-dev`. Instructions have been updated in utilities/test_suite/README.md.
|
||||
* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to a unsigned int tensor) that provides the permutation order to swap the RGB channels of each input image in the batch in any order:
|
||||
* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to an unsigned int tensor), that provides the permutation order to swap the RGB channels of each input image in the batch in any order:
|
||||
|
||||
`RppStatus rppt_swap_channels_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, rppHandle_t rppHandle);`
|
||||
|
||||
@@ -2289,7 +2301,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
#### Resolved issues
|
||||
|
||||
* Test package - debian packages will install required dependencies.
|
||||
* Test package - Debian packages will install required dependencies.
|
||||
|
||||
### **Tensile** (4.44.0)
|
||||
|
||||
@@ -2299,7 +2311,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
- Added code object compression via bundling.
|
||||
- Added support for non-default HIP SDK installations on Windows.
|
||||
- Added master solution library documentation.
|
||||
- Added compiler version dependent assembler and architecture capabilities.
|
||||
- Added compiler version-dependent assembler and architecture capabilities.
|
||||
- Added documentation from GitHub Wiki to ROCm docs.
|
||||
|
||||
#### Changed
|
||||
@@ -2322,7 +2334,7 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
|
||||
- Fixed configure time path not being invoked at build.
|
||||
- Fixed find_package for msgpack to work with versions 5 and 6.
|
||||
- Fixed rhel9 testing.
|
||||
- Fixed RHEL 9 testing.
|
||||
- Fixed gfx908 builds.
|
||||
- Fixed the 'argument list too long' error.
|
||||
- Fixed version typo in 6.3 changelog.
|
||||
@@ -2333,6 +2345,26 @@ The previous default accumulator types could lead to situations in which unexpec
|
||||
ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
|
||||
issues related to individual components, review the [Detailed component changes](#detailed-component-changes).
|
||||
|
||||
### A memory error in the kernel might lead to applications using the ROCr library being unresponsive
|
||||
|
||||
Applications using the ROCr library may become unresponsive if a memory error occurs in the launched kernel when the queue from which it was launched is destroyed. The application is unable to receive further signal, resulting in the stall condition. The issue will be fixed in a future ROCm release.
|
||||
|
||||
### Applications using stream capture APIs may fail during stream capture
|
||||
|
||||
Applications using ``hipLaunchHostFunc`` with stream capture APIs may fail to capture graphs during stream capture, and return `hipErrorStreamCaptureUnsupported`. This issue resulted from an update in ``hipStreamAddCallback``. This issue will be fixed in a future ROCm release.
|
||||
|
||||
### Compilation failure via hipRTC when compiling with std=c++11
|
||||
|
||||
Applications compiling kernels using `hipRTC` might fail while passing the `std=c++11` compiler option. This issue will be fixed in a future ROCm release.
|
||||
|
||||
### Compilation failure when referencing std::array if _GLIBCXX_ASSERTIONS is defined
|
||||
|
||||
Compiling from a device kernel or function results in failure when attempting to reference `std::array` if `_GLIBCXX_ASSERTIONS` is defined. The issue occurs because there's no device definition for `std::__glibcxx_asert_fail()`. This issue will be resolved in a future ROCm release with the implementation of `std::__glibcxx_assert_fail()`.
|
||||
|
||||
### Segmentation fault in ROCprofiler-SDK due to ABI mismatch affecting std::regex
|
||||
|
||||
Starting with GCC 5.1, GNU `libstdc++` introduced a dual Application Binary Interface (ABI) to adopt `C++11`, primarily affecting the `std::string` and its dependencies, including `std::regex`. If your code is compiled against headers expecting one ABI but linked or run with the other, it can cause problems with `std::string` and `std::regex`, leading to a segmentation fault in ROCprofiler-SDK, which uses `std::regex`. This issue is resolved in the [ROCm Systems `develop` branch](https://github.com/ROCm/rocm-systems) and will be part of a future ROCm release.
|
||||
|
||||
## ROCm resolved issues
|
||||
|
||||
The following are previously known issues resolved in this release. For resolved issues related to
|
||||
@@ -2340,12 +2372,16 @@ individual components, review the [Detailed component changes](#detailed-compone
|
||||
|
||||
### Failure when using a generic target with compression and vice versa
|
||||
|
||||
An issue where compilation for generic target with compression failing has been resolved in this release. This issue resulted in you being unable to compile for a generic target and use compression simultaneously. See [GitHub issue #4602](https://github.com/ROCm/ROCm/issues/4602).
|
||||
An issue where compiling of a generic target with compression failing has been resolved in this release. This issue prevented you from compiling a generic target and using compression simultaneously. See [GitHub issue #4602](https://github.com/ROCm/ROCm/issues/4602).
|
||||
|
||||
### Limited support for Sparse API and Pallas functionality in JAX
|
||||
|
||||
An issue where due to limited support for Sparse API in JAX, some of the functionality of the Pallas extension were restricted has been resolved. See [GitHub issue #4608](https://github.com/ROCm/ROCm/issues/4608).
|
||||
|
||||
### Failure to use –kokkos-trace option in ROCm Compute Profiler
|
||||
|
||||
An issue where using of the ``--kokkos-trace`` option resulted in a difference between the output of the ``--kokkos-trace`` and the ``counter_collection.csv`` output file has been resolved. Due to this issue the program used to exit with a warning message if the ``-kokkos-trace`` option was detected in the ROCm Compute Profiler. This issue resulted due to the partial implementation of ``--kokkos-trace`` in ``rocprofv3`` tool. See [GitHub issue #4604](https://github.com/ROCm/ROCm/issues/4604).
|
||||
|
||||
## ROCm upcoming changes
|
||||
|
||||
The following changes to the ROCm software stack are anticipated for future releases.
|
||||
@@ -2395,7 +2431,7 @@ and `__AMDGCN_WAVEFRONT_SIZE__` macros are deprecated and will be disabled in a
|
||||
|
||||
### Changes to ROCm Object Tooling
|
||||
|
||||
ROCm Object Tooling tools ``roc-obj-ls``, ``roc-obj-extract``, and ``roc-obj`` are
|
||||
ROCm Object Tooling tools ``roc-obj-ls``, ``roc-obj-extract``, and ``roc-obj`` were
|
||||
deprecated in ROCm 6.4, and will be removed in a future release. Functionality
|
||||
has been added to the ``llvm-objdump --offloading`` tool option to extract all
|
||||
clang-offload-bundles into individual code objects found within the objects
|
||||
|
||||
@@ -2,15 +2,14 @@ ROCm Version,7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6
|
||||
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,,
|
||||
,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2"
|
||||
,,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5"
|
||||
,RHEL 10,,,,,,,,,,,,,,,,,,
|
||||
,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2"
|
||||
,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8"
|
||||
,SLES 15 SP7,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4"
|
||||
,,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9
|
||||
,"Oracle Linux 10, 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,,
|
||||
,"Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,,
|
||||
,Debian 12,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,,
|
||||
,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,,,,,,,,,,,,
|
||||
,Rocky 9,,,,,,,,,,,,,,,,,,
|
||||
,Rocky Linux 9,,,,,,,,,,,,,,,,,,
|
||||
,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
|
||||
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,,,,,,,,,,,,,,,,,,
|
||||
,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3
|
||||
@@ -39,19 +38,19 @@ ROCm Version,7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6
|
||||
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,
|
||||
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`,N/A,N/A,N/A,N/A,N/A,0.7.0,0.7.0,0.7.0,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,
|
||||
:doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.1,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
|
||||
,,,,,,,,,,,,,,,,,,,
|
||||
,,,,,,,,,,,,,,,,,,,
|
||||
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
|
||||
`UCC <https://github.com/ROCm/ucc>`_,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
|
||||
`UCX <https://github.com/ROCm/ucx>`_,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
|
||||
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
|
||||
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
|
||||
,,,,,,,,,,,,,,,,,,,
|
||||
THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
|
||||
Thrust,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
|
||||
CUB,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
|
||||
,,,,,,,,,,,,,,,,,,,
|
||||
KMD & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
|
||||
:doc:`KMD versions <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"7.0.x, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
|
||||
:doc:`KMD versions <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
|
||||
,,,,,,,,,,,,,,,,,,,
|
||||
ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
|
||||
:doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0
|
||||
|
||||
|
@@ -28,14 +28,13 @@ compatibility and system requirements.
|
||||
|
||||
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2
|
||||
,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5
|
||||
,RHEL 10,,
|
||||
,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.5, 9.4"
|
||||
,RHEL 8.10,RHEL 8.10,RHEL 8.10
|
||||
,SLES 15 SP7,"SLES 15 SP7, SP6","SLES 15 SP6, SP5"
|
||||
,"Oracle Linux 10, 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_",Oracle Linux 8.10 [#ol-mi300x]_
|
||||
,"Oracle Linux 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_",Oracle Linux 8.10 [#ol-mi300x]_
|
||||
,Debian 12,Debian 12 [#single-node]_,
|
||||
,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_,
|
||||
,Rocky 9,,
|
||||
,Rocky Linux 9,,
|
||||
,.. _architecture-support-compatibility-matrix:,,
|
||||
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,,
|
||||
,CDNA3,CDNA3,CDNA3
|
||||
@@ -64,18 +63,18 @@ compatibility and system requirements.
|
||||
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,N/A
|
||||
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`,N/A,N/A,0.7.0
|
||||
:doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat]_,N/A,N/A,N/A
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.1,1.20.0,1.17.3
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.20.0,1.17.3
|
||||
,,,
|
||||
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix:,,
|
||||
`UCC <https://github.com/ROCm/ucc>`_,>=1.3.0,>=1.3.0,>=1.3.0
|
||||
`UCX <https://github.com/ROCm/ucx>`_,>=1.15.0,>=1.15.0,>=1.15.0
|
||||
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.3.0,>=1.3.0
|
||||
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.15.0,>=1.15.0
|
||||
,,,
|
||||
THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix:,,
|
||||
Thrust,2.6.0,2.5.0,2.3.2
|
||||
CUB,2.6.0,2.5.0,2.3.2
|
||||
,,,
|
||||
KMD & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,,
|
||||
:doc:`KMD versions <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"7.0.x, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x"
|
||||
:doc:`KMD versions <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x"
|
||||
,,,
|
||||
ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix:,,
|
||||
:doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0
|
||||
@@ -146,7 +145,6 @@ compatibility and system requirements.
|
||||
:doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.1.0,2.0.4,2.0.3
|
||||
,,,
|
||||
COMPILERS,.. _compilers-support-compatibility-matrix:,,
|
||||
`clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A
|
||||
:doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1
|
||||
`Flang <https://github.com/ROCm/flang>`_,20.0.0.25314,19.0.0.25224,18.0.0.24455
|
||||
:doc:`llvm-project <llvm-project:index>`,20.0.0.25314,19.0.0.25224,18.0.0.24491
|
||||
@@ -160,7 +158,7 @@ compatibility and system requirements.
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
.. [#ol-700-mi300x] **For ROCm 7.0** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X.
|
||||
.. [#ol-700-mi300x] **For ROCm 7.0** - Oracle Linux 9 is supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X.
|
||||
.. [#ol-mi300x] **Prior ROCm 7.0** - Oracle Linux is only on AMD Instinct MI300X.
|
||||
.. [#single-node] Debian 12 is supported only on AMD Instinct MI300X for single-node functionality.
|
||||
.. [#az-mi300x] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710.
|
||||
@@ -188,8 +186,6 @@ Use this lookup table to confirm which operating system and kernel versions are
|
||||
,,
|
||||
`Ubuntu <https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle>`_, 22.04.5, "5.15 [GA], 6.8 [HWE]", 2.35
|
||||
,,
|
||||
`Red Hat Enterprise Linux (RHEL 10) <https://access.redhat.com/articles/3078#RHEL10>`_, 10, 6.12.0-55, 2.39
|
||||
,,
|
||||
`Red Hat Enterprise Linux (RHEL 9) <https://access.redhat.com/articles/3078#RHEL9>`_, 9.6, 5.14.0-570, 2.34
|
||||
,9.5, 5.14+, 2.34
|
||||
,9.4, 5.14.0-427, 2.34
|
||||
@@ -200,10 +196,9 @@ Use this lookup table to confirm which operating system and kernel versions are
|
||||
,15 SP6, "6.5.0+, 6.4.0", 2.38
|
||||
,15 SP5, 5.14.21, 2.31
|
||||
,,
|
||||
`Rocky <https://wiki.rockylinux.org/rocky/version/>`_, 9, 5.14.0-570, 2.34
|
||||
`Rocky Linux <https://wiki.rockylinux.org/rocky/version/>`_, 9, 5.14.0-570, 2.34
|
||||
,,
|
||||
`Oracle Linux <https://blogs.oracle.com/scoter/post/oracle-linux-and-unbreakable-enterprise-kernel-uek-releases>`_, 10, 6.12.0 (UEK), 2.39
|
||||
,9, 6.12.0 (UEK), 2.34
|
||||
`Oracle Linux <https://blogs.oracle.com/scoter/post/oracle-linux-and-unbreakable-enterprise-kernel-uek-releases>`_, 9, 6.12.0 (UEK), 2.34
|
||||
,8, 5.15.0 (UEK), 2.28
|
||||
,,
|
||||
`Debian <https://www.debian.org/download>`_,12, 6.1.0, 2.36
|
||||
@@ -241,7 +236,7 @@ Expand for full historical view of:
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
||||
.. [#ol-700-mi300x-past-60] **For ROCm 7.0** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X.
|
||||
.. [#ol-700-mi300x-past-60] **For ROCm 7.0** - Oracle Linux 9 is supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X.
|
||||
.. [#mi300x-past-60] **Prior ROCm 7.0** - Oracle Linux is supported only on AMD Instinct MI300X.
|
||||
.. [#single-node-past-60] Debian 12 is supported only on AMD Instinct MI300X for single-node functionality.
|
||||
.. [#az-mi300x-past-60] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710.
|
||||
|
||||
@@ -27,7 +27,7 @@ with ROCm support:
|
||||
- Offers AMD-validated and community :ref:`Docker images <jax-docker-compat>`
|
||||
with ROCm and JAX preinstalled.
|
||||
|
||||
- ROCm JAX repository: `ROCm/jax <https://github.com/ROCm/jax>`_
|
||||
- ROCm JAX repository: `ROCm/rocm-jax <https://github.com/ROCm/rocm-jax>`_
|
||||
|
||||
- See the :doc:`ROCm JAX installation guide <rocm-install-on-linux:install/3rd-party/jax-install>`
|
||||
to get started.
|
||||
@@ -310,5 +310,54 @@ For a complete and up-to-date list of JAX public modules (for example, ``jax.num
|
||||
Since version 0.1.56, JAX has full support for ROCm, and the
|
||||
:ref:`Known issues and important notes <jax_comp_known_issues>` section
|
||||
contains details about limitations specific to the ROCm backend. The list of
|
||||
JAX API modules is maintained by the JAX project and is subject to change.
|
||||
JAX API modules are maintained by the JAX project and is subject to change.
|
||||
Refer to the official Jax documentation for the most up-to-date information.
|
||||
|
||||
Key features and enhancements for ROCm 7.0
|
||||
===============================================================================
|
||||
|
||||
- Upgraded XLA backend: Integrates a newer XLA version, enabling better
|
||||
optimizations, broader operator support, and potential performance gains.
|
||||
|
||||
- RNN support: Native RNN support (including LSTMs via ``jax.experimental.rnn``)
|
||||
now available on ROCm, aiding sequence model development.
|
||||
|
||||
- Comprehensive linear algebra capabilities: Offers robust ``jax.linalg``
|
||||
operations, essential for scientific and machine learning tasks.
|
||||
|
||||
- Expanded AMD GPU architecture support: Provides ongoing support for gfx1101
|
||||
GPUs and introduces support for gfx950 and gfx12xx GPUs.
|
||||
|
||||
- Mixed FP8 precision support: Enables ``lax.dot_general`` operations with mixed FP8
|
||||
types, offering pathways for memory and compute efficiency.
|
||||
|
||||
- Streamlined PyPi packaging: Provides reliable PyPi wheels for JAX on ROCm,
|
||||
simplifying the installation process.
|
||||
|
||||
- Pallas experimental kernel development: Continued Pallas framework
|
||||
enhancements for custom GPU kernels, including new intrinsics (specific
|
||||
kernel behaviors under review).
|
||||
|
||||
- Improved build system and CI: Enhanced ROCm build system and CI for greater
|
||||
reliability and maintainability.
|
||||
|
||||
- Enhanced distributed computing setup: Improved JAX setup in multi-GPU
|
||||
distributed environments.
|
||||
|
||||
.. _jax_comp_known_issues:
|
||||
|
||||
Known issues and notes for ROCm 7.0
|
||||
===============================================================================
|
||||
|
||||
- ``nn.dot_product_attention``: Certain configurations of ``jax.nn.dot_product_attention``
|
||||
may cause segmentation faults, though the majority of use cases work correctly.
|
||||
|
||||
- SVD with dynamic shapes: SVD on inputs with dynamic/symbolic shapes might result in an error.
|
||||
SVD with static shapes is unaffected.
|
||||
|
||||
- QR decomposition with symbolic shapes: QR decomposition operations may fail when using
|
||||
symbolic/dynamic shapes in shape polymorphic contexts.
|
||||
|
||||
- Pallas kernels: Specific advanced Pallas kernels may exhibit variations in
|
||||
numerical output or resource usage. These are actively reviewed as part of
|
||||
Pallas's experimental development.
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 81 KiB After Width: | Height: | Size: 114 KiB |
391
docs/data/reference/precision-support/precision-support.yaml
Normal file
391
docs/data/reference/precision-support/precision-support.yaml
Normal file
@@ -0,0 +1,391 @@
|
||||
# rocm-library-support.yaml
|
||||
library_groups:
|
||||
- group: "ML & Computer Vision"
|
||||
tag: "ml-cv"
|
||||
libraries:
|
||||
- name: "Composable Kernel"
|
||||
tag: "composable-kernel"
|
||||
doc_link: "composable_kernel:reference/Composable_Kernel_supported_scalar_types"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "float4"
|
||||
support: "✅"
|
||||
- type: "float6 (E2M3)"
|
||||
support: "✅"
|
||||
- type: "float6 (E3M2)"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "MIGraphX"
|
||||
tag: "migraphx"
|
||||
doc_link: "amdmigraphx:reference/cpp"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "⚠️"
|
||||
- type: "int16"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "int64"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "MIOpen"
|
||||
tag: "miopen"
|
||||
doc_link: "miopen:reference/datatypes"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "⚠️"
|
||||
- type: "int32"
|
||||
support: "⚠️"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "⚠️"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "⚠️"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "⚠️"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "⚠️"
|
||||
|
||||
- group: "Communication"
|
||||
tag: "communication"
|
||||
libraries:
|
||||
- name: "RCCL"
|
||||
tag: "rccl"
|
||||
doc_link: "rccl:api-reference/library-specification"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "int64"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- group: "Math Libraries"
|
||||
tag: "math-libs"
|
||||
libraries:
|
||||
- name: "hipBLAS"
|
||||
tag: "hipblas"
|
||||
doc_link: "hipblas:reference/data-type-support"
|
||||
data_types:
|
||||
- type: "float16"
|
||||
support: "⚠️"
|
||||
- type: "bfloat16"
|
||||
support: "⚠️"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipBLASLt"
|
||||
tag: "hipblaslt"
|
||||
doc_link: "hipblaslt:reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "float4"
|
||||
support: "✅"
|
||||
- type: "float6 (E2M3)"
|
||||
support: "✅"
|
||||
- type: "float6 (E3M2)"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipFFT"
|
||||
tag: "hipfft"
|
||||
doc_link: "hipfft:reference/fft-api-usage"
|
||||
data_types:
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipRAND"
|
||||
tag: "hiprand"
|
||||
doc_link: "hiprand:api-reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "Output only"
|
||||
- type: "int16"
|
||||
support: "Output only"
|
||||
- type: "int32"
|
||||
support: "Output only"
|
||||
- type: "int64"
|
||||
support: "Output only"
|
||||
- type: "float16"
|
||||
support: "Output only"
|
||||
- type: "float32"
|
||||
support: "Output only"
|
||||
- type: "float64"
|
||||
support: "Output only"
|
||||
|
||||
- name: "hipSOLVER"
|
||||
tag: "hipsolver"
|
||||
doc_link: "hipsolver:reference/precision"
|
||||
data_types:
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipSPARSE"
|
||||
tag: "hipsparse"
|
||||
doc_link: "hipsparse:reference/precision"
|
||||
data_types:
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipSPARSELt"
|
||||
tag: "hipsparselt"
|
||||
doc_link: "hipsparselt:reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocBLAS"
|
||||
tag: "rocblas"
|
||||
doc_link: "rocblas:reference/data-type-support"
|
||||
data_types:
|
||||
- type: "float16"
|
||||
support: "⚠️"
|
||||
- type: "bfloat16"
|
||||
support: "⚠️"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocFFT"
|
||||
tag: "rocfft"
|
||||
doc_link: "rocfft:reference/api"
|
||||
data_types:
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocRAND"
|
||||
tag: "rocrand"
|
||||
doc_link: "rocrand:api-reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "Output only"
|
||||
- type: "int16"
|
||||
support: "Output only"
|
||||
- type: "int32"
|
||||
support: "Output only"
|
||||
- type: "int64"
|
||||
support: "Output only"
|
||||
- type: "float16"
|
||||
support: "Output only"
|
||||
- type: "float32"
|
||||
support: "Output only"
|
||||
- type: "float64"
|
||||
support: "Output only"
|
||||
|
||||
- name: "rocSOLVER"
|
||||
tag: "rocsolver"
|
||||
doc_link: "rocsolver:reference/precision"
|
||||
data_types:
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocSPARSE"
|
||||
tag: "rocsparse"
|
||||
doc_link: "rocsparse:reference/precision"
|
||||
data_types:
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocWMMA"
|
||||
tag: "rocwmma"
|
||||
doc_link: "rocwmma:api-reference/api-reference-guide"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "Output only"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "Input only"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "Input only"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "tensorfloat32"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "Tensile"
|
||||
tag: "tensile"
|
||||
doc_link: "tensile:reference/precision-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "float8 (E4M3)"
|
||||
support: "✅"
|
||||
- type: "float8 (E5M2)"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "tensorfloat32"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- group: "Primitives"
|
||||
tag: "primitives"
|
||||
libraries:
|
||||
- name: "hipCUB"
|
||||
tag: "hipcub"
|
||||
doc_link: "hipcub:api-reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int16"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "int64"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "hipTensor"
|
||||
tag: "hiptensor"
|
||||
doc_link: "hiptensor:api-reference/api-reference"
|
||||
data_types:
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocPRIM"
|
||||
tag: "rocprim"
|
||||
doc_link: "rocprim:reference/data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int16"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "int64"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "✅"
|
||||
- type: "bfloat16"
|
||||
support: "✅"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
|
||||
- name: "rocThrust"
|
||||
tag: "rocthrust"
|
||||
doc_link: "rocthrust:data-type-support"
|
||||
data_types:
|
||||
- type: "int8"
|
||||
support: "✅"
|
||||
- type: "int16"
|
||||
support: "✅"
|
||||
- type: "int32"
|
||||
support: "✅"
|
||||
- type: "int64"
|
||||
support: "✅"
|
||||
- type: "float16"
|
||||
support: "⚠️"
|
||||
- type: "bfloat16"
|
||||
support: "⚠️"
|
||||
- type: "float32"
|
||||
support: "✅"
|
||||
- type: "float64"
|
||||
support: "✅"
|
||||
BIN
docs/data/rocm-software-stack-7_0_0.jpg
Normal file
BIN
docs/data/rocm-software-stack-7_0_0.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 350 KiB |
@@ -24,7 +24,7 @@ If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for L
|
||||
If you’re using a Radeon GPU for graphics-accelerated applications, refer to the
|
||||
`Radeon installation instructions <https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/install/native_linux/install-radeon.html>`_.
|
||||
|
||||
You can install ROCm on :ref:`compatible systems <rocm-install-on-linux:reference/system-requirements>` via your Linux
|
||||
You can install ROCm on :doc:`compatible systems <rocm-install-on-linux:reference/system-requirements>` via your Linux
|
||||
distribution's package manager. See the following documentation resources to get started:
|
||||
|
||||
* :doc:`ROCm installation overview <rocm-install-on-linux:install/install-overview>`
|
||||
|
||||
@@ -65,7 +65,7 @@ ROCm documentation is organized into the following categories:
|
||||
* [ROCm libraries](./reference/api-libraries.md)
|
||||
* [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md)
|
||||
* [Accelerator and GPU hardware specifications](./reference/gpu-arch-specs.rst)
|
||||
* [Precision support](./reference/precision-support.rst)
|
||||
* [Data types and precision support](./reference/precision-support.rst)
|
||||
* [Graph safe support](./reference/graph-safe-support.rst)
|
||||
<!-- markdownlint-enable MD051 -->
|
||||
:::
|
||||
|
||||
@@ -34,6 +34,40 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
|
||||
- SGPR File (KiB)
|
||||
- GFXIP Major version
|
||||
- GFXIP Minor version
|
||||
*
|
||||
- MI355X
|
||||
- CDNA4
|
||||
- gfx950
|
||||
- 288
|
||||
- 256 (32 per XCD)
|
||||
- 64
|
||||
- 160
|
||||
- 256
|
||||
- 32 (4 per XCD)
|
||||
- 32
|
||||
- 16 per 2 CUs
|
||||
- 64 per 2 CUs
|
||||
- 512
|
||||
- 12.5
|
||||
- 9
|
||||
- 5
|
||||
*
|
||||
- MI350X
|
||||
- CDNA4
|
||||
- gfx950
|
||||
- 288
|
||||
- 256 (32 per XCD)
|
||||
- 64
|
||||
- 160
|
||||
- 256
|
||||
- 32 (4 per XCD)
|
||||
- 32
|
||||
- 16 per 2 CUs
|
||||
- 64 per 2 CUs
|
||||
- 512
|
||||
- 12.5
|
||||
- 9
|
||||
- 5
|
||||
*
|
||||
- MI325X
|
||||
- CDNA3
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -19,7 +19,7 @@ subtrees:
|
||||
|
||||
- caption: Install
|
||||
entries:
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/
|
||||
title: ROCm on Linux
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-windows/en/latest/
|
||||
title: HIP SDK on Windows
|
||||
@@ -182,7 +182,7 @@ subtrees:
|
||||
- file: reference/gpu-arch-specs.rst
|
||||
- file: reference/gpu-atomics-operation.rst
|
||||
- file: reference/precision-support.rst
|
||||
title: Precision support
|
||||
title: Data types and precision support
|
||||
- file: reference/graph-safe-support.rst
|
||||
title: Graph safe support
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ ROCm is a software stack, composed primarily of open-source software, that
|
||||
provides the tools for programming AMD Graphics Processing Units (GPUs), from
|
||||
low-level kernels to high-level end-user applications.
|
||||
|
||||
.. image:: data/rocm-software-stack-6_4_0.jpg
|
||||
.. image:: data/rocm-software-stack-7_0_0.jpg
|
||||
:width: 800
|
||||
:alt: AMD's ROCm software stack and enabling technologies.
|
||||
:align: center
|
||||
|
||||
Reference in New Issue
Block a user