diff --git a/.wordlist.txt b/.wordlist.txt index cf9f990d4..9892e2915 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -62,6 +62,7 @@ CPU CPUs Cron CSC +CSDATA CSE CSV CSn @@ -81,6 +82,7 @@ CommonMark Concretized Conda ConnectX +CountOnes CuPy da Dashboarding @@ -97,6 +99,7 @@ DIMM DKMS DL DMA +DOMContentLoaded DNN DNNL DPM @@ -126,6 +129,7 @@ ESXi EoS fas FBGEMM +FIFOs FFT FFTs FFmpeg @@ -201,6 +205,7 @@ Higgs href Hyperparameters Huggingface +IB ICD ICT ICV @@ -209,8 +214,11 @@ IDEs IFWI IMDb IncDec +instrSize +interpolators IOMMU IOP +IOPS IOPM IOV IRQ @@ -247,12 +255,15 @@ LLM LLMs LLVM LM +LRU LSAN LSan LTS LSTMs +LteAll LanguageCrossEntropy LoRA +MECO MEM MERCHANTABILITY MFMA @@ -271,6 +282,7 @@ MNIST MPI MPT MSVC +mul MVAPICH MVFFR Makefile @@ -349,6 +361,7 @@ PCC PCI PCIe PEFT +perf PEQT PIL PILImage @@ -433,6 +446,7 @@ SKUs SLES SLURM SMEM +SMFMA SMI SMT SPI @@ -444,18 +458,23 @@ SWE SerDes ShareGPT Shlens +simd Skylake Softmax Spack SplitK Supermicro Szegedy +TagRAM TCA TCC +TCCs TCI TCIU TCP TCR +THREADGROUPS +threadgroups TensorRT TensorFloat TF @@ -499,6 +518,7 @@ UltraChat Uncached Unittests Unhandled +unwindowed VALU VBIOS VCN @@ -515,11 +535,13 @@ Vanhoucke Vulkan WGP WGPs +WR WX WikiText Wojna Workgroups Writebacks +xcc XCD XCDs XGBoost @@ -540,6 +562,7 @@ ZenDNN accuracies activations addr +addEventListener ade ai alloc @@ -555,6 +578,7 @@ autogenerated autotune avx awk +az backend backends bb @@ -572,6 +596,7 @@ boson bosons br BrainFloat +btn buildable bursty bzip @@ -583,6 +608,7 @@ centric changelog checkpointing chiplet +classList cmake cmd coalescable @@ -595,6 +621,7 @@ concretization config configs conformant +const constructible convolutional convolves @@ -658,6 +685,7 @@ exascale executables ffmpeg filesystem +forEach fortran fp framebuffer @@ -666,6 +694,7 @@ galb gcc gdb gemm +getAttribute gfortran gfx githooks @@ -775,6 +804,7 @@ opencv openmp openssl optimizers +ol os oversubscription pageable @@ -822,6 +852,8 @@ recommenders quantile quantizer quasirandom +querySelector +querySelectorAll queueing qwen radeon @@ -884,9 +916,11 @@ scalability scalable scipy seealso +selectedTag sendmsg seqs serializers +setAttribute sglang shader sharding @@ -913,6 +947,7 @@ symlink symlinks sys tabindex +targetContainer td tensorfloat th diff --git a/CHANGELOG.md b/CHANGELOG.md index 122110a7a..23ee67539 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,1672 @@ This page is a historical overview of changes made to ROCm components. This consolidated changelog documents key modifications and improvements across different versions of the ROCm software stack and its components. +## ROCm 7.0.0 + +See the [ROCm 7.0.0 release notes](https://rocm-stg.amd.com/en/latest/about/release-notes.html#rocm-7-0-0-release-notes) +for a complete overview of this release. + +### **AMD SMI** (26.0.0) + +#### Added + +* Ability to restart the AMD GPU driver from the CLI and API. + - `amdsmi_gpu_driver_reload()` API and `amd-smi reset --reload-driver` or `amd-smi reset -r` CLI options. + - Driver reload functionality is now separated from memory partition + functions; memory partition change requests should now be followed by a driver reload. + - Driver reload requires all GPU activity on all devices to be stopped. + +* Default command: + + A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through alternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`. + +* Support for GPU metrics 1.8: + - Added new fields for `amdsmi_gpu_xcp_metrics_t` including: + - Metrics to allow new calculations for violation status: + - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts + - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts + - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. + - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). + - Increased available JPEG engines to 40. Current ASICs might not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI. + +* Bad page threshold count. + - Added `amdsmi_get_gpu_bad_page_threshold` to Python API and CLI; root/sudo permissions are required to display the count. + +* CPU model name for RDC. + - Added new C and Python API `amdsmi_get_cpu_model_name`. + - Not sourced from esmi library. + +* New API `amdsmi_get_cpu_affinity_with_scope()`. + +* `socket power` to `amdsmi_get_power_info` + - Previously, the C API had the value in the `amdsmi_power_info` structure, but was unused. + - The value is representative of the socket's power agnostic of the the GPU version. + +* New event notification types to `amdsmi_evt_notification_type_t`. + The following values were added to the `amdsmi_evt_notification_type_t` enum: + - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_START` + - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_END` + - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_START` + - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_END` + - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_EVICTION` + - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_RESTORE` + - `AMDSMI_EVT_NOTIF_EVENT_UNMAP_FROM_GPU` + - `AMDSMI_EVT_NOTIF_PROCESS_START` + - `AMDSMI_EVT_NOTIF_PROCESS_END` + +- Power cap to `amd-smi monitor`. + - `amd-smi monitor -p` will display the power cap along with power. + +#### Changed + +* Separated driver reload functionality from `amdsmi_set_gpu_memory_partition()` and + `amdsmi_set_gpu_memory_partition_mode()` APIs -- and from the CLI `amd-smi set -M `. + +* Disabled `amd-smi monitor --violation` on guests. Modified `amd-smi metric -T/--throttle` to alias to `amd-smi metric -v/--violation`. + +* Updated `amdsmi_get_clock_info` in `amdsmi_interface.py`. + - The `clk_deep_sleep` field now returns the sleep integer value. + +* The `amd-smi topology` command has been enabled for guest environments. + - This includes full functionality so users can use the command just as they would in bare metal environments. + +* Expanded violation status tracking for GPU metrics 1.8. + - The driver will no longer be supporting existing single-value GFX clock below host limit fields (`acc_gfx_clk_below_host_limit`, `per_gfx_clk_below_host_limit`, `active_gfx_clk_below_host_limit`), they are now changed in favor of new per-XCP/XCC arrays. + - Added new fields to `amdsmi_violation_status_t` and related interfaces for enhanced violation breakdown: + - Per-XCP/XCC accumulators and status for: + - GFX clock below host limit (power, thermal, and total) + - Low utilization + - Added 2D arrays to track per-XCP/XCC accumulators, percentage, and active status: + - `acc_gfx_clk_below_host_limit_pwr`, `acc_gfx_clk_below_host_limit_thm`, `acc_gfx_clk_below_host_limit_total` + - `per_gfx_clk_below_host_limit_pwr`, `per_gfx_clk_below_host_limit_thm`, `per_gfx_clk_below_host_limit_total` + - `active_gfx_clk_below_host_limit_pwr`, `active_gfx_clk_below_host_limit_thm`, `active_gfx_clk_below_host_limit_total` + - `acc_low_utilization`, `per_low_utilization`, `active_low_utilization` + - Python API and CLI now report these expanded fields. + +* The char arrays in the following structures have been changed. + - `amdsmi_vbios_info_t` member `build_date` changed from `AMDSMI_MAX_DATE_LENGTH` to `AMDSMI_MAX_STRING_LENGTH`. + - `amdsmi_dpm_policy_entry_t` member `policy_description` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. + - `amdsmi_name_value_t` member `name` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. + +* For backwards compatibility, updated `amdsmi_bdf_t` union to have an identical unnamed struct. + +* Updated `amdsmi_get_temp_metric` and `amdsmi_temperature_type_t` with new values. + - Added new values to `amdsmi_temperature_type_t` representing various baseboard and GPU board temperature measures. + - Updated `amdsmi_get_temp_metric` API to be able to take in and return the respective values for the new temperature types. + +#### Removed + +- Unnecessary API, `amdsmi_free_name_value_pairs()` + - This API is only used internally to free up memory from the Python interface and does not need to be + exposed to the user. + +- Unused definitions: + - `AMDSMI_MAX_NAME`, `AMDSMI_256_LENGTH`, `AMDSMI_MAX_DATE_LENGTH`, `MAX_AMDSMI_NAME_LENGTH`, `AMDSMI_LIB_VERSION_YEAR`, + `AMDSMI_DEFAULT_VARIANT`, `AMDSMI_MAX_NUM_POWER_PROFILES`, `AMDSMI_MAX_DRIVER_VERSION_LENGTH`. + +- Unused member `year` in struct `amdsmi_version_t`. + +- `amdsmi_io_link_type_t` has been replaced with `amdsmi_link_type_t`. + - `amdsmi_io_link_type_t` is no longer needed as `amdsmi_link_type_t` is sufficient. + - `amdsmi_link_type_t` enum has changed; primarily, the ordering of the PCI and XGMI types. + - This change will also affect `amdsmi_link_metrics_t`, where the link_type field changes from `amdsmi_io_link_type_t` to `amdsmi_link_type_t`. + +- `amdsmi_get_power_info_v2()`. + - The ``amdsmi_get_power_info()`` has been unified and the v2 function is no longer needed or used. + +- `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`. + +- The `amdsmi_get_gpu_vram_info` now provides vendor names as a string. + - `amdsmi_vram_vendor_type_t` enum structure is removed. + - `amdsmi_vram_info_t` member named `amdsmi_vram_vendor_type_t` is changed to a character string. + - `amdsmi_get_gpu_vram_info` now no longer requires decoding the vendor name as an enum. + +- Backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`. + - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. + - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion about which field to use. By removing backward compatibility, it is easier to identify the relevant field. + - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. + +#### Optimized + +- Reduced ``amd-smi`` CLI API calls needed to be called before reading or (re)setting GPU features. This + improves overall runtime performance of the CLI. + +- Removed partition information from the default `amd-smi static` CLI command. + - Users can still retrieve the same data by calling `amd-smi`, `amd-smi static -p`, or `amd-smi partition -c -m`/`sudo amd-smi partition -a`. + - Reading `current_compute_partition` may momentarily wake the GPU up. This is due to reading XCD registers, which is expected behavior. Changing partitions is not a trivial operation, `current_compute_partition` SYSFS controls this action. + +- Optimized CLI command `amd-smi topology` in partition mode. + - Reduced the number of `amdsmi_topo_get_p2p_status` API calls to one fourth. + +#### Resolved issues + +- Removed duplicated GPU IDs when receiving events using the `amd-smi event` command. + +- Fixed `amd-smi monitor` decoder utilization (`DEC%`) not showing up on MI300 Series ASICs. + +#### Known issues + +- `amd-smi monitor` on Linux Guest systems triggers an attribute error. + +```{note} +See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. +``` + +### **Composable Kernel** (1.1.0) + +#### Added + +* Support for `BF16`, `F32`, and `F16` for 2D and 3D NGCHW grouped convolution backward data. +* Fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels. +* Support GKCYX for layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced). +* Support for GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW). +* Support for GKCYX layout for grouped convolution backward weight (NGCHW/GKCYX/NGKHW). +* Support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW). +* Support for Stream-K version of mixed `FP8` / `BF16` GEMM. +* Support for Multiple D GEMM. +* GEMM pipeline for microscaling (MX) `FP8` / `FP6` / `FP4` data types. +* Support for `FP16` 2:4 structured sparsity to universal GEMM. +* Support for Split K for grouped convolution backward data. +* Logit soft-capping support for fMHA forward kernels. +* Support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv). +* Benchmarking support for tile engine GEMM. +* Ping-pong scheduler support for GEMM operation along the K dimension. +* Rotating buffer feature for CK_Tile GEMM. +* `int8` support for CK_TILE GEMM. +* Vectorize Transpose optimization for CK Tile. +* Asynchronous copy for gfx950. + +#### Changed + +* Replaced the raw buffer load/store intrinsics with Clang20 built-ins. +* DL and DPP kernels are now enabled by default. +* Number of instances in instance factory for grouped convolution forward NGCHW/GKYXC/NGKHW has been reduced. +* Number of instances in instance factory for grouped convolution backward weight NGCHW/GKYXC/NGKHW has been reduced. +* Number of instances in instance factory for grouped convolution backward data NGCHW/GKYXC/NGKHW has been reduced. + +#### Removed + +* Removed support for gfx940 and gfx941 targets. + +#### Optimized + +* Optimized the GEMM multiply preshuffle and lds bypass with Pack of KGroup and better instruction layout. + +### **HIP** 7.0.0 + +#### Added + +* New HIP APIs + - `hipLaunchKernelEx` dispatches the provided kernel with the given launch configuration and forwards the kernel arguments. + - `hipLaunchKernelExC` launches a HIP kernel using a generic function pointer and the specified configuration. + - `hipDrvLaunchKernelEx` dispatches the device kernel represented by a HIP function object. + - `hipMemGetHandleForAddressRange` gets a handle for the address range requested. + - `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for aritimetic reduction across lanes of a warp, and `__reduce_and_sync`, `__reduce_or_sync`, and `__reduce_xor_sync` +functions added for logical reduction. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions). +* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as follows. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html). + - Data types for `FP4`/`FP6`/`FP8`. + - HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs. + - HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs. +* New `wptr` and `rptr` values in `ClPrint`, for better logging in dispatch barrier methods. +* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`. +* Added `constexpr` operators for `fp16`/`bf16`. +* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`). +* Support for the flags in APIs as following, now allows uncached memory allocation. + - `hipExtHostRegisterUncached`, used in `hipHostRegister`. + - `hipHostMallocUncached` and `hipHostAllocUncached`, used in `hipHostMalloc` and `hipHostAlloc`. +* `num_threads` total number of threads in the group. The legacy API size is alias. +* Added PCI CHIP ID information as the device attribute. +* Added new tests applications for OCP data types `FP4`/`FP6`/`FP8`. +* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement. + +#### Changed +* Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows. +* Removal of beta warnings in HIP Graph APIs. All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported. +* `warpSize` has changed. +In order to match the CUDA specification, the `warpSize` variable is no longer `constexpr`. In general, this should be a transparent change; however, if an application was using `warpSize` as a compile-time constant, it will have to be updated to handle the new definition. For more information, see the discussion of `warpSize` within the [HIP C++ language extensions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize). +* Behavior changes + - `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution. + - Cooperative groups in `hipLaunchCooperativeKernelMultiDevice` and `hipLaunchCooperativeKernel` functions, additional input parameter validation checks are added. + - `hipPointerGetAttributes` returns `hipSuccess` instead of an error with invalid value `hipErrorInvalidValue`, in case `NULL` host or attribute pointer is passed as input parameter. It now matches the functionality of `cudaPointerGetAttributes` which changed with CUDA 11 and above releases. + - `hipFree` previously there was an implicit wait which was applicable for all memory allocations, for synchronization purpose. This wait is now disabled for allocations made with `hipMallocAsync` and `hipMallocFromPoolAsync`, to match the behavior of CUDA API `cudaFree`. + - `hipFreeAsync` now returns `hipSuccess` when the input pointer is NULL, instead of ` hipErrorInvalidValue` , to be consistent with `hipFree`. + - Exceptions occurring during a kernel execution will not abort the process anymore but will return an error unless core dump is enabled. +* Changes in hipRTC. + - Removal of `hipRTC` symbols from HIP Runtime Library. + Any application using `hipRTC` APIs should link explicitly with the `hipRTC` library. This makes the usage of `hipRTC` library on Linux the same as on Windows and matches the behavior of CUDA `nvRTC`. + - `hipRTC` compilation + The device code compilation now uses namespace `__hip_internal`, instead of the standard headers `std`, to avoid namespace collision. + - Changes of datatypes from `hipRTC`. + Datatype definitions such as `int64_t`, `uint64_t`, `int32_t`, and `uint32_t`, etc. are removed to avoid any potential conflicts in some applications. HIP now uses internal datatypes instead, prefixed with `__hip`, for example, `__hip_int64_t`. +* HIP header clean up + - Usage of STD headers, HIP header files only include necessary STL headers. + - Deprecated structure `HIP_MEMSET_NODE_PARAMS` is removed. Developers can use the definition `hipMemsetParams` instead. +* API signature/struct changes + - API signatures are adjusted in some APIs to match corresponding CUDA APIs. Impacted APIs are as folloing: + * `hiprtcCreateProgram` + * `hiprtcCompileProgram` + * `hipMemcpyHtoD` + * `hipCtxGetApiVersion` + - HIP struct change in `hipMemsetParams`, it is updated and compatible with CUDA. + - HIP vector constructor change in `hipComplex` initialization now generates correct values. The affected constructors will be small vector types such as `float2`, `int4`, etc. +* Stream Capture updates + - Restricted stream capture mode, it is made in HIP APIs via adding the macro `CHECK_STREAM_CAPTURE_SUPPORTED ()`. +In the previous HIP enumeration `hipStreamCaptureMode`, three capture modes were defined. With checking in the macro, the only supported stream capture mode is now `hipStreamCaptureModeRelaxed`. The rest are not supported, and the macro will return `hipErrorStreamCaptureUnsupported`. This update involves the following APIs, which is allowed only in relaxed stream capture mode: + * `hipMallocManaged` + * `hipMemAdvise` + - Checks stream capture mode, the following APIs check the stream capture mode and return error codes to match the behavior of CUDA. + * `hipLaunchCooperativeKernelMultiDevice` + * `hipEventQuery` + * `hipStreamAddCallback` + - Returns error during stream capture. The following HIP APIs now returns specific error `hipErrorStreamCaptureUnsupported` on the AMD platform, but not always `hipSuccess`, to match behavior with CUDA: + * `hipDeviceSetMemPool` + * `hipMemPoolCreate` + * `hipMemPoolDestroy` + * `hipDeviceSetSharedMemConfig` + * `hipDeviceSetCacheConfig` + * `hipMemcpyWithStream` +* Error code update +Returned error/value codes are updated in the following HIP APIs to match the corresponding CUDA APIs. + - Module Management Related APIs: + * `hipModuleLaunchKernel` + * `hipExtModuleLaunchKernel` + * `hipExtLaunchKernel` + * `hipDrvLaunchKernelEx` + * `hipLaunchKernel` + * `hipLaunchKernelExC` + * `hipModuleLaunchCooperativeKernel` + * `hipModuleLoad` + - Texture Management Related APIs: +The following APIs update the return codes to match the behavior with CUDA: + * `hipTexObjectCreate`, supports zero width and height for 2D image. If either is zero, will not return `false`. + * `hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns `hipErrorNotFound`. + * `hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error `hipErrorInvalidChannelDescriptor`, instead of `hipErrorInvalidValue`. + * `hipGetTextureAlignmentOffset`, adds a return code `hipErrorInvalidTexture` when the texture reference pointer is NULL. + - Cooperative Group Related APIs, more calidations are added in the following API implementation: + * `hipLaunchCooperativeKernelMultiDevice` + * `hipLaunchCooperativeKernel` +* Invalid stream input parameter handling +In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Previously, the HIP runtime returned an error code `hipErrorContextIsDestroyed` if the stream was invalid. In CUDA version 12 and later, the equivalent behavior is to raise a segmentation fault. HIP runtime now matches the CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows: + - Stream Management Related APIs + * `hipStreamGetCaptureInfo` + * `hipStreamGetPriority` + * `hipStreamGetFlags` + * `hipStreamDestroy` + * `hipStreamAddCallback` + * `hipStreamQuery` + * `hipLaunchHostFunc` + - Graph Management Related APIs + * `hipGraphUpload` + * `hipGraphLaunch` + * `hipStreamBeginCaptureToGraph` + * `hipStreamBeginCapture` + * `hipStreamIsCapturing` + * `hipStreamGetCaptureInfo` + * `hipGraphInstantiateWithParams` + - Memory Management Related APIs + * `hipMemcpyPeerAsync` + * `hipMemcpy2DValidateParams` + * `hipMallocFromPoolAsync` + * `hipFreeAsync` + * `hipMallocAsync` + * `hipMemcpyAsync` + * `hipMemcpyToSymbolAsync` + * `hipStreamAttachMemAsync` + * `hipMemPrefetchAsync` + * `hipDrvMemcpy3D` + * `hipDrvMemcpy3DAsync` + * `hipDrvMemcpy2DUnaligned` + * `hipMemcpyParam2D` + * `hipMemcpyParam2DAsync` + * `hipMemcpy2DArrayToArray` + * `hipMemcpy2D` + * `hipMemcpy2DAsync` + * `hipDrvMemcpy2DUnaligned` + * `hipMemcpy3D` + - Event Management Related APIs + * `hipEventRecord` + * `hipEventRecordWithFlags` + +#### Optimized + +HIP runtime has the following functional improvements which improves runtime performance and user experience: + +* Reduced usage of the lock scope in events and kernel handling. + - Switches to `shared_mutex` for event validation, uses `std::unique_lock` in HIP runtime to create/destroy event, instead of `scopedLock`. + - Reduces the `scopedLock` in handling of kernel execution. HIP runtime now calls `scopedLock` during kernel binary creation/initialization, doesn't call it again during kernel vector iteration before launch. +* Implementation of unifying managed buffer and kernel argument buffer so HIP runtime doesn't need to create/load a separate kernel argument buffer. +* Refactored memory validation, creates a unique function to validate a variety of memory copy operations. +* Improved kernel logging using demangling shader names. +* Advanced support for SPIRV, now kernel compilation caching is enabled by default. This feature is controlled by the environment variable `AMD_COMGR_CACHE`, for details, see [hip_rtc document](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_rtc.html). +* Programmatic support for scratch limits on the AMD Instinct MI300 and MI350 Series up GPU devices. More enumeration values were added in `hipLimit_t` as following: + - `hipExtLimitScratchMin`, minimum allowed value in bytes for scratch limit on the device. + - `hipExtLimitScratchMax`, maximum allowed value in bytes for scratch limit on the device. + - `hipExtLimitScratchCurrent`, current scratch limit threshold in bytes on the device. Must be between the value `hipExtLimitScratchMin` and `hipExtLimitScratchMax`. + Developers can now use the environment variable `HSA_SCRATCH_SINGLE_LIMIT_ASYNC` to change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP API `hipDeviceSetLimit(hipExtLimitScratchCurrent, value)` to reset the scratch limit value. +* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth. +* Improved launch latency for `D2D` copies and `memset` on MI300 Series. +* Introduced a threshold to handle the command submission patch to the GPU device(s), considering the synchronization with CPU, for performance improvement. + +#### Resolved issues + +* Error of "unable to find modules" in HIP clean up for code object module. +* The issue of incorrect return error `hipErrorNoDevice`, when a crash occurred on GPU device due to illegal operation or memory violation. HIP runtime now handles the failure on the GPU side properly and reports the precise error code based on the last error seen on the GPU. +* Failures in some framework test applications, HIP runtime fixed the bug in retrieving a memory object from the IPC memory handle. +* A crash in TensorFlow related application. HIP runtime now combines multiple definitions of `callbackQueue` into a single function, in case of an exception, passes its handler to the application and provides corresponding error code. +* Fixed issue of handling the kernel parameters for the graph launch. +* Failures in roc-obj tools. HIP runtime now makes `DEPRECATED` message in roc-obj tools as `STDERR`. +* Support of `hipDeviceMallocContiguous` flags in `hipExtMallocWithFlags()`. It now enables `HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG` in the memory pool allocation on GPU device. +* Compilation failure, HIP runtime refactored the vector type alignment with `__hip_vec_align_v`. +* A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture. +* A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments. +* Compilation error when using bfloat16 functions. HIP runtime removed the anonymous namespace from FP16 functions to resolve this issue. + +#### Known issues + +* `hipLaunchHostFunc` returns an error during stream capture. Any application using `hipLaunchHostFunc` might fail to capture graphs during stream capture, instead, it returns `hipErrorStreamCaptureUnsupported`. +* Compilation failure in kernels via hiprtc when using option `std=c++11`. + +### **hipBLAS** (3.0.0) + +#### Added + +* Added the `hipblasSetWorkspace()` API. +* Support for codecoverage tests. + +#### Changed + +* HIPBLAS_V2 API is the only available API using the `hipComplex` and `hipDatatype` types. +* Documentation updates. +* Verbose compilation for `hipblas.cpp`. + +#### Removed + +* `hipblasDatatype_t` type. +* `hipComplex` and `hipDoubleComplex` types. +* Support code for non-production gfx targets. + +#### Resolved issues + +* The build time `CMake` configuration for the dependency on `hipBLAS-common` is fixed. +* Compiler warnings for unhandled enumerations have been resolved. + +### **hipBLASLt** (1.0.0) + +#### Added + +* Stream-K GEMM support has been enabled for the `FP32`, `FP16`, `BF16`, `FP8`, and `BF8` data types on the Instinct MI300A APU. To activate this feature, set the `TENSILE_SOLUTION_SELECTION_METHOD` environment variable to `2`, for example, `export TENSILE_SOLUTION_SELECTION_METHOD=2`. +* Fused Swish/SiLU GEMM (enabled by ``HIPBLASLT_EPILOGUE_SWISH_EXT`` and ``HIPBLASLT_EPILOGUE_SWISH_BIAS_EXT``). +* Support for ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS`` for gfx942. +* `HIPBLASLT_TUNING_USER_MAX_WORKSPACE` to constrain the maximum workspace size for user offline tuning. +* ``HIPBLASLT_ORDER_COL16_4R16`` and ``HIPBLASLT_ORDER_COL16_4R8`` to ``hipblasLtOrder_t`` to support `FP16`/`BF16` swizzle GEMM and `FP8` / `BF8` swizzle GEMM respectively. +* TF32 emulation on gfx950. +* Support for `FP6`, `BF6`, and `FP4` on gfx950. +* Support for block scaling by setting `HIPBLASLT_MATMUL_DESC_A_SCALE_MODE` and `HIPBLASLT_MATMUL_DESC_B_SCALE_MODE` to `HIPBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0`. + +#### Changed + +* The non-V2 APIs (``GemmPreference``, ``GemmProblemType``, ``GemmEpilogue``, ``GemmTuning``, ``GemmInputs``) in the cpp header are now the same as the V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``). The original non-V2 APIs are removed. + +#### Removed + +* ``HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXT`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXT`` are removed. Use the ``HIPBLASLT_MATMUL_DESC_A_SCALE_MODE`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_MODE`` attributes to set scalar (``HIPBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F``) or vector (``HIPBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F``) attributes. +* The `hipblasltExtAMaxWithScale` API is removed. + +#### Optimized + +* Improved performance for 8-bit (`FP8` / `BF8` / `I8`) NN/NT cases by adding ``s_delay_alu`` to reduce stalls from dependent ALU operations on gfx12+. +* Improved performance for 8-bit and 16-bit (`FP16` / `BF16`) TN cases by enabling software dependency checks (Expert Scheduling Mode) under certain restrictions to reduce redundant hardware dependency checks on gfx12+. +* Improved performance for 8-bit, 16-bit, and 32-bit batched GEMM with a better heuristic search algorithm for gfx942. + +#### Upcoming changes + +* V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``) are deprecated. + +### **hipCUB** (4.0.0) + +#### Added + +* A new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large that symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default. +* Single pass operators in `agent/single_pass_scan_operators.hpp` which contains the following API: + * `BlockScanRunningPrefixOp` + * `ScanTileStatus` + * `ScanTileState` + * `ReduceByKeyScanTileState` + * `TilePrefixCallbackOp` +* Support for gfx950. +* An overload of `BlockScan::InclusiveScan` that accepts an initial value to seed the scan. +* An overload of `WarpScan::InclusiveScan` that accepts an initial value to seed the scan. +* `UnrolledThreadLoad`, `UnrolledCopy`, and `ThreadLoadVolatilePointer` were added to align hipCUB with CUB. +* `ThreadStoreVolatilePtr` and the `IterateThreadStore` struct were added to align hipCUB with CUB. +* `hipcub::InclusiveScanInit` for CUB parity. + +#### Changed + +* The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.7.0. If they aren't found, they will be downloaded from the NVIDIA CCCL repository. +* Updated `thread_load` and `thread_store` to align hipCUB with CUB. +* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, (for example, `hipcub::HIPCUB_300400_NS::symbol` instead of `hipcub::symbol`), letting the user link multiple libraries built with different versions of hipCUB. +* Modified the broadcast kernel in warp scan benchmarks. The reported performance may be different to previous versions. +* The `hipcub::detail::accumulator_t` in rocPRIM backend has been changed to utilise `rocprim::accumulator_t`. +* The usage of `rocprim::invoke_result_binary_op_t` has been replaced with `rocprim::accumulator_t`. + +#### Removed + +* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you want to build for these architectures, specify them explicitly in the `AMDGPU_TARGETS` cmake option. +* Deprecated `hipcub::AsmThreadLoad` is removed, use `hipcub::ThreadLoad` instead. +* Deprecated `hipcub::AsmThreadStore` is removed, use `hipcub::ThreadStore` instead. +* Deprecated `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed. +* This release removes support for custom builds on gfx940 and gfx941. +* Removed C++14 support. Only C++17 is supported. + +#### Resolved issues + +* Fixed an issue where `Sort(keys, compare_op, valid_items, oob_default)` in `block_merge_sort.hpp` would not fill in elements that are out of range (items after `valid_items`) with `oob_default`. +* Fixed an issue where `ScatterToStripedFlagged` in `block_exhange.hpp` was calling the wrong function. + +#### Known issues + +* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed from hipCUB's CUB backend. They were already deprecated as of version 2.12.0 of hipCUB and they were removed from CCCL (CUB) as of CCCL's 2.6.0 release. +* `BlockScan::InclusiveScan` for the NVIDIA backend does not compute the block aggregate correctly when passing an initial value parameter. This behavior is not matched by the AMD backend. + +#### Upcoming changes + +* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` were deprecated as of version 2.12.0 of hipCUB, and will be removed from the rocPRIM backend in a future release for the next ROCm major version (ROCm 7.0.0). + +### **hipFFT** (1.0.20) + +#### Added + +* Support for gfx950. + +#### Removed + +* Removed hipfft-rider legacy compatibility from clients. +* Removed support for the gfx940 and gfx941 targets from the client programs. +* Removed backward compatibility symlink for include directories. + +### **hipfort** (0.7.0) + +#### Added + +* Documentation clarifying how hipfort is built for the NVIDIA platform. + +#### Changed + +* Updated and reorganized documentation for clarity and consistency. + +### **HIPIFY** (20.0.0) + +#### Added + +* CUDA 12.9.1 support. +* cuDNN 9.11.0 support. +* cuTENSOR 2.2.0.0 support. +* LLVM 20.1.8 support. + +#### Resolved issues + +* `hipDNN` support is removed by default. +* [#1859](https://github.com/ROCm/HIPIFY/issues/1859)[hipify-perl] Fix warnings on unsupported Driver or Runtime APIs which were erroneously not reported. +* [#1930](https://github.com/ROCm/HIPIFY/issues/1930) Revise `JIT API`. +* [#1962](https://github.com/ROCm/HIPIFY/issues/1962) Support for cuda-samples helper headers. +* [#2035](https://github.com/ROCm/HIPIFY/issues/2035) Removed `const_cast;` in `hiprtcCreateProgram` and `hiprtcCompileProgram`. + +### **hipRAND** (3.0.0) + +#### Added + +* Support for gfx950. + +#### Changed + +* Deprecated the hipRAND Fortran API in favor of hipfort. + +#### Removed + +* Removed C++14 support, so only C++17 is supported. + +### **hipSOLVER** (3.0.0) + +#### Added + +* Added compatibility-only functions: + * csrlsvqr + * `hipsolverSpCcsrlsvqr`, `hipsolverSpZcsrlsvqr` + +#### Resolved issues + +* Corrected the value of `lwork` returned by various `bufferSize` functions to be consistent with NVIDIA cuSOLVER. The following functions now return `lwork` so that the workspace size (in bytes) is `sizeof(T) * lwork`, rather than `lwork`. To restore the original behavior, set the environment variable `HIPSOLVER_BUFFERSIZE_RETURN_BYTES`. + * `hipsolverXorgbr_bufferSize`, `hipsolverXorgqr_bufferSize`, `hipsolverXorgtr_bufferSize`, `hipsolverXormqr_bufferSize`, `hipsolverXormtr_bufferSize`, `hipsolverXgesvd_bufferSize`, `hipsolverXgesvdj_bufferSize`, `hipsolverXgesvdBatched_bufferSize`, `hipsolverXgesvdaStridedBatched_bufferSize`, `hipsolverXsyevd_bufferSize`, `hipsolverXsyevdx_bufferSize`, `hipsolverXsyevj_bufferSize`, `hipsolverXsyevjBatched_bufferSize`, `hipsolverXsygvd_bufferSize`, `hipsolverXsygvdx_bufferSize`, `hipsolverXsygvj_bufferSize`, `hipsolverXsytrd_bufferSize`, `hipsolverXsytrf_bufferSize`. + +### **hipSPARSE** (4.0.1) + +#### Added + +* `int8`, `int32`, and `float16` data types to `hipDataTypeToHCCDataType` so that sparse matrix descriptors can be used with them. +* Half float mixed precision to `hipsparseAxpby` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `hipsparseSpVV` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `hipsparseSpMM` where A and B use `float16` and C and the compute type use `float`. +* Half float mixed precision to `hipsparseSDDMM` where A and B use `float16` and C and the compute type use `float`. +* Half float uniform precision to the `hipsparseScatter` and `hipsparseGather` routines. +* Half float uniform precision to the `hipsparseSDDMM` routine. +* `int8` precision to the `hipsparseCsr2cscEx2` routine. +* The `almalinux` operating system name to correct the GFortran dependency. + +#### Changed + +* Switched to defaulting to C++17 when building hipSPARSE from source. Previously hipSPARSE was using C++14 by default. + +#### Resolved issues + +* Fixed a compilation [issue](https://github.com/ROCm/hipSPARSE/issues/555) related to using `std::filesystem` and C++14. +* Fixed an issue where the clients-common package was empty by moving the `hipsparse_clientmatrices.cmake` and `hipsparse_mtx2csr` files to it. + +#### Known issues + +* In `hipsparseSpSM_solve()`, the external buffer is passed as a parameter. This does not match the NVIDIA CUDA cuSPARSE API. This extra external buffer parameter will be removed in a future release. For now, this extra parameter can be ignored and nullptr passed in because it is unused internally. + +### **hipSPARSELt** (0.2.4) + +#### Added + +* Support for the LLVM target gfx950. +* Support for the following data type combinations for the LLVM target gfx950: + * `FP8`(E4M3) inputs, `F32` output, and `F32` Matrix Core accumulation. + * `BF8`(E5M2) inputs, `F32` output, and `F32` Matrix Core accumulation. +* Support for ROC-TX if `HIPSPARSELT_ENABLE_MARKER=1` is set. +* Support for the cuSPARSELt v0.6.3 backend. + +#### Removed + +* Support for LLVM targets gfx940 and gfx941 has been removed. +* `hipsparseLtDatatype_t` has been removed. + +#### Optimized + +* Improved the library loading time. +* Provided more kernels for the `FP16` data type. + +### **hipTensor** (2.0.0) + +#### Added + +* Element-wise binary operation support. +* Element-wise trinary operation support. +* Support for GPU target gfx950. +* Dynamic unary and binary operator support for element-wise operations and permutation. +* CMake check for `f8` datatype availability. +* `hiptensorDestroyOperationDescriptor` to free all resources related to the provided descriptor. +* `hiptensorOperationDescriptorSetAttribute` to set attribute of a `hiptensorOperationDescriptor_t` object. +* `hiptensorOperationDescriptorGetAttribute` to retrieve an attribute of the provided `hiptensorOperationDescriptor_t` object. +* `hiptensorCreatePlanPreference` to allocate the `hiptensorPlanPreference_t` and enabled users to limit the applicable kernels for a given plan or operation. +* `hiptensorDestroyPlanPreference` to free all resources related to the provided preference. +* `hiptensorPlanPreferenceSetAttribute` to set attribute of a `hiptensorPlanPreference_t` object. +* `hiptensorPlanGetAttribute` to retrieve information about an already-created plan. +* `hiptensorEstimateWorkspaceSize` to determine the required workspace size for the given operation. +* `hiptensorCreatePlan` to allocate a `hiptensorPlan_t` object, select an appropriate kernel for a given operation and prepare a plan that encodes the execution. +* `hiptensorDestroyPlan` to free all resources related to the provided plan. + +#### Changed + +* Removed architecture support for gfx940 and gfx941. +* Generalized opaque buffer for any descriptor. +* Replaced `hipDataType` with `hiptensorDataType_t` for all supported types, for example, `HIP_R_32F` to `HIPTENSOR_R_32F`. +* Replaced `hiptensorComputeType_t` with `hiptensorComputeDescriptor_t` for all supported types. +* Replaced `hiptensorInitTensorDescriptor` with `hiptensorCreateTensorDescriptor`. +* Changed handle type and API usage from `*handle` to `handle`. +* Replaced `hiptensorContractionDescriptor_t` with `hipTensorOperationDescriptor_t`. +* Replaced `hiptensorInitContractionDescriptor` with `hiptensorCreateContraction`. +* Replaced `hiptensorContractionFind_t` with `hiptensorPlanPreference_t`. +* Replaced `hiptensorInitContractionFind` with `hiptensorCreatePlanPreference`. +* Replaced `hiptensorContractionGetWorkspaceSize` with `hiptensorEstimateWorkspaceSize`. +* Replaced `HIPTENSOR_WORKSPACE_RECOMMENDED` with `HIPTENSOR_WORKSPACE_DEFAULT`. +* Replaced `hiptensorContractionPlan_t` with `hiptensorPlan_t`. +* Replaced `hiptensorInitContractionPlan` with `hiptensorCreatePlan`. +* Replaced `hiptensorContraction` with `hiptensorContract`. +* Replaced `hiptensorPermutation` with `hiptensorPermute`. +* Replaced `hiptensorReduction` with `hiptensorReduce`. +* Replaced `hiptensorElementwiseBinary` with `hiptensorElementwiseBinaryExecute`. +* Replaced `hiptensorElementwiseTrinary` with `hiptensorElementwiseTrinaryExecute`. +* Removed function `hiptensorReductionGetWorkspaceSize`. + +### **llvm-project** (20.0.0) + +#### Added + +* The compiler `-gsplit-dwarf` option to enable the generation of separate debug information file at compile time. When used, separate debug information files are generated for host and for each offload architecture. For additional information, see [DebugFission](https://gcc.gnu.org/wiki/DebugFission). +* `llvm-flang`, AMD's next-generation Fortran compiler. It's a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub. +* Comgr support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps to improve performance in the device library link step. +* Compiler support of a new target-specific builtin `__builtin_amdgcn_processor_is` for late or deferred queries of the current target processor, and `__builtin_amdgcn_is_invocable` to determine the current target processor ability to invoke a particular builtin. +* HIPIFY support for NVIDIA CUDA 12.9.1 APIs. Added support for all new device and host APIs, including FP4, FP6, and FP128, and support for the corresponding ROCm HIP equivalents. + +#### Changed + +* Updated clang/llvm to AMD clang version 20.0.0 (equivalent to LLVM 20.0.0 with additional out-of-tree patches). +* HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) have been removed from this release. + +#### Optimized + +* Improved compiler memory load and store instructions. + +#### Upcoming changes + +* `__AMDGCN_WAVEFRONT_SIZE__` macro and HIP’s `warpSize` variable as `constexpr` are deprecated and will be disabled in a future release. Users are encouraged to update their code if needed to ensure future compatibility. For more information, see [AMDGCN_WAVEFRONT_SIZE deprecation](#amdgpu-wavefront-size-compiler-macro-deprecation). +* The `roc-obj-ls` and `roc-obj-extract` tools are deprecated. To extract all Clang offload bundles into separate code objects use `llvm-objdump --offloading `. For more information, see [Changes to ROCm Object Tooling](#changes-to-rocm-object-tooling). + +### **MIGraphX** (2.13.0) + +#### Added + +* Support for OCP `FP8` on AMD Instinct MI350X GPUs. +* Support for PyTorch 2.7 via Torch-MIGraphX. +* Support for the Microsoft ONNX Contrib Operators (Self) Attention, RotaryEmbedding, QuickGelu, BiasAdd, BiasSplitGelu, SkipLayerNorm. +* Support for Sigmoid and AddN TensorFlow operators. +* GroupQuery Attention support for LLMs. +* Support for edge mode in the ONNX Pad operator. +* ONNX runtime Python driver. +* FLUX e2e example. +* C++ and Python APIs to save arguments to a graph as a msgpack file, and then read the file back. +* rocMLIR fusion for kv-cache attention. +* Introduced a check for file-write errors. + +#### Changed + +* `quantize_bf16` for quantizing the model to `BF16` has been made visible in the MIGraphX user API. +* Print additional kernel/module information in the event of compile failure. +* Use hipBLASLt instead of rocBLAS on newer GPUs. +* 1x1 convolutions are now rewritten to GEMMs. +* `BF16::max` is now represented by its encoding rather than its expected value. +* Direct warnings now go to `cout` rather `cerr`. +* `FP8` uses hipBLASLt rather than rocBLAS. +* ONNX models are now topologically sorted when nodes are unordered. +* Improved layout of Graphviz output. +* Enhanced debugging for migraphx-driver: consumed environment variables are printed, timestamps and duration are added to the summary. +* Add a trim size flag to the verify option for migraphx-driver. +* Node names are printed to track parsing within the ONNX graph when using the `MIGRAPHX_TRACE_ONNX_PARSER` flag. +* Update accuracy checker to output test data with the `--show-test-data` flag. +* The `MIGRAPHX_TRACE_BENCHMARKING` option now allows the problem cache file to be updated after finding the best solution. + +#### Removed + +* `ROCM_USE_FLOAT8` macro. +* The `BF16` GEMM test was removed for Navi21, as it is unsupported by rocBLAS and hipBLASLt on that platform. + +#### Optimized + +* Use common average in `compile_ops` to reduce run-to-run variations when tuning. +* Improved the performance of the TopK operator. +* Conform to a single layout (NHWC or NCHW) during compilation rather than combining two. +* Slice Channels Conv Optimization (slice output fusion). +* Horizontal fusion optimization after pointwise operations. +* Reduced the number of literals used in `GridSample` linear sampler. +* Fuse multiple outputs for pointwise operations. +* Fuse reshapes on pointwise inputs for MLIR output fusion. +* MUL operation not folded into the GEMM when the GEMM is used more than once. +* Broadcast not fused after convolution or GEMM MLIR kernels. +* Avoid reduction fusion when operator data-types mismatch. + +#### Resolved issues + +* Compilation workaround ICE in clang 20 when using `views::transform`. +* Fix bug with `reshape_lazy` in MLIR. +* Quantizelinear fixed for Nearbyint operation. +* Check for empty strings in ONNX node inputs for operations like Resize. +* Parse Resize fix: only check `keep_aspect_ratio_policy` attribute for sizes input. +* Nonmaxsuppression: fixed issue where identical boxes/scores not ordered correctly. +* Fixed a bug where events were created on the wrong device in a multi-gpu scenario. +* Fixed out of order keys in value for comparisons and hashes when caching best kernels. +* Fixed Controlnet MUL types do not match error. +* Fixed check for scales if ROI input is present in Resize operation. +* Einsum: Fixed a crash on empty squeeze operations. + +### **MIOpen** (3.5.0) + +#### Added + +* [Conv] Misa kernels for gfx950. +* [Conv] Enabled Split-K support for CK backward data solvers (2D). +* [Conv] Enabled CK wrw solver on gfx950 for the `BF16` data type. +* [BatchNorm] Enabled NHWC in OpenCL. +* Grouped convolution + activation fusion. +* Grouped convolution + bias + activation fusion. +* Composable Kernel (CK) can now be built inline as part of MIOpen. + +#### Changed + +* Changed to using the median value with outliers removed when deciding on the best solution to run. +* [Conv] Updated the igemm asm solver. + +#### Optimized + +* [BatchNorm] Optimized NHWC OpenCL kernels and improved heuristics. +* [RNN] Dynamic algorithm optimization. +* [Conv] Eliminated redundant clearing of output buffers. +* [RNN] Updated selection heuristics. +* Updated tuning for the AMD Instinct MI300 Series. + +#### Resolved issues + +* Fixed a segmentation fault when the user specified a smaller workspace than what was required. +* Fixed a layout calculation logic error that returned incorrect results and enabled less restrictive layout selection. +* Fixed memory access faults in misa kernels due to out-of-bounds memory usage. +* Fixed a performance drop on the gfx950 due to transpose kernel use. +* Fixed a memory access fault caused by not allocating enough workspace. +* Fixed a name typo that caused kernel mismatches and long startup times. + +### **MIVisionX** (3.3.0) + +#### Added + +* Support to enable/disable BatchPD code in VX_RPP extensions by checking the RPP_LEGACY_SUPPORT flag. + +#### Changed + +* VX_RPP extension : Version 3.1.0 release. +* Update the parameters and kernel API of Blur, Fog, Jitter, LensCorrection, Rain, Pixelate, Vignette and ResizeCrop wrt tensor kernels replacing the legacy BatchPD API calls in VX_RPP extensions. + +#### Known issues + +* Installation on RHEL and SLES requires the manual installation of the `FFMPEG` and `OpenCV` dev packages. + +#### Upcoming changes + +* Optimized audio augmentations support for VX_RPP. + +### **RCCL** (2.26.6) + +#### Added + +* Support for the extended fine-grained system memory pool. +* Support for gfx950. +* Support for `unroll=1` in device-code generation to improve performance. +* Set a default of 112 channels for a single node with `8 * gfx950`. +* Enabled LL128 protocol on the gfx950. +* The ability to choose the unroll factor at runtime using `RCCL_UNROLL_FACTOR`. This can be set at runtime to 1, 2, or 4. This change currently increases compilation and linking time because it triples the number of kernels generated. +* Added MSCCL support for AllGather multinode on the gfx942 and gfx950 (for instance, 16 and 32 GPUs). To enable this feature, set the environment variable `RCCL_MSCCL_FORCE_ENABLE=1`. The maximum message size for MSCCL AllGather usage is `12292 * sizeof(datatype) * nGPUs`. +* Thread thresholds for LL/LL128 are selected in Tuning Models for the AMD Instinct MI300X. This impacts the number of channels used for AllGather and ReduceScatter. The channel tuning model is bypassed if `NCCL_THREAD_THRESHOLDS`, `NCCL_MIN_NCHANNELS`, or `NCCL_MAX_NCHANNELS` are set. +* Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocols to use nontemporal vector load/store for tunable message size ranges. +* LL/LL128 usage ranges for AllReduce, AllGather, and ReduceScatter are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL. +* Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are `rcclGetAlgoInfo` and `rcclFuncMaxSendRecvCount`. However, user-level invocation requires that RCCL be built with `RCCL_EXPOSE_STATIC` enabled. + +#### Changed + +* Compatibility with NCCL 2.23.4. +* Compatibility with NCCL 2.24.3. +* Compatibility with NCCL 2.25.1. +* Compatibility with NCCL 2.26.6. + +#### Resolved issues + +* Resolved an issue when using more than 64 channels when multiple collectives are used in the same `ncclGroup()` call. +* Fixed unit test failures in tests ending with the `ManagedMem` and `ManagedMemGraph` suffixes. +* Fixed a suboptimal algorithmic switching point for AllReduce on the AMD Instinct MI300X. +* Fixed the known issue "When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault" with a design change to use `comm` instead of `rank` for `mscclStatus`. The global map for `comm` to `mscclStatus` is still not thread safe but should be explicitly handled by mutexes for read-write operations. This is tested for correctness, but there is a plan to use a thread-safe map data structure in an upcoming release. + +### **rocAL** (2.3.0) + +#### Added +* Extended support to rocAL's video decoder to use rocDecode hardware decoder. +* Setup - installs rocdecode dev packages for Ubuntu, RedHat, and SLES. +* Setup - installs turbojpeg dev package for Ubuntu and Redhat. +* rocAL's image decoder has been extended to support the rocJPEG hardware decoder. +* Numpy reader support for reading npy files in rocAL. +* Test case for numpy reader in C++ and python tests. + +#### Resolved issues +* `TurboJPEG` no longer needs to be installed manually. It is now installed by the package installer. +* Hardware decode no longer requires that ROCm be installed with the `graphics` usecase. + +#### Known issues +* Package installation on SLES requires manually installing `TurboJPEG`. +* Package installation on RHEL and SLES requires manually installing the `FFMPEG Dev` package. + +#### Upcoming changes + +* rocJPEG support for JPEG decode. + +### **rocALUTION** (4.0.0) + +#### Added + +* Support for gfx950. + +#### Changed + +* Switch to defaulting to C++17 when building rocALUTION from source. Previously rocALUTION was using C++14 by default. + +#### Optimized + +* Improved the user documentation. + +#### Resolved issues + +* Fix for GPU hashing algorithm when not compiling with -O2/O3. + +### **rocBLAS** (5.0.0) + +#### Added + +* Support for gfx950. +* Internal API logging for `gemm` debugging using `ROCBLAS_LAYER = 8`. +* Support for the AOCL 5.0 gcc build as a client reference library. +* The use of `PkgConfig` for client reference library fallback detection. + +#### Changed + +* `CMAKE_CXX_COMPILER` is now passed on during compilation for a Tensile build. +* The default atomics mode is changed from `allowed` to `not allowed`. + +#### Removed + +* Support code for non-production gfx targets. +* `rocblas_hgemm_kernel_name`, `rocblas_sgemm_kernel_name`, and `rocblas_dgemm_kernel_name` API functions. +* The use of `warpSize` as a constexpr. +* The use of deprecated behavior of `hipPeekLastError`. +* `rocblas_float8.h` and `rocblas_hip_f8_impl.h` files. +* `rocblas_gemm_ex3`, `rocblas_gemm_batched_ex3`, and `rocblas_gemm_strided_batched_ex3` API functions. + +#### Optimized + +* Optimized `gemm` by using `gemv` kernels when applicable. +* Optimized `gemv` for small `m` and `n` with a large batch count on gfx942. +* Improved the performance of Level 1 `dot` for all precisions and variants when `N > 100000000` on gfx942. +* Improved the performance of Level 1 `asum` and `nrm2` for all precisions and variants on gfx942. +* Improved the performance of Level 2 `sger` (single precision) on gfx942. +* Improved the performance of Level 3 `dgmm` for all precisions and variants on gfx942. + +#### Resolved issues + +* Fixed environment variable path-based logging to append multiple handle outputs to the same file. +* Support numerics when `trsm` is running with `rocblas_status_perf_degraded`. +* Fixed the build dependency installation of `joblib` on some operating systems. +* Return `rocblas_status_internal_error` when `rocblas_[set,get]_ [matrix,vector]` is called with a host pointer in place of a device pointer. +* Reduced the default verbosity level for internal GEMM backend information. +* Updated from the deprecated rocm-cmake to ROCmCMakeBuildTools. +* Corrected AlmaLinux GFortran package dependencies. + +#### Upcoming changes + +* Deprecated the use of negative indices to indicate the default solution is being used for `gemm_ex` with `rocblas_gemm_algo_solution_index`. + +### **ROCdbgapi** (0.77.3) + +#### Added +* Support for the `gfx950` architectures. + +#### Removed +* Support for the `gfx940` and `gfx941` architectures. + +### **rocDecode** (1.0.0) + +#### Added + +* VP9 IVF container file parsing support in bitstream reader. +* CTest for VP9 decode on bitstream reader. +* HEVC/AVC/AV1/VP9 stream syntax error handling. +* HEVC stream bit depth change handling and DPB buffer size change handling through decoder reconfiguration. +* AVC stream DPB buffer size change handling through decoder reconfiguration. +* A new avcodec-based decoder built as a separate `rocdecode-host` library. + +#### Changed + +* rocDecode now uses the Cmake `CMAKE_PREFIX_PATH` directive. +* Changed asserts in query API calls in RocVideoDecoder utility class to error reports, to avoid hard stop during query in case error occurs and to let the caller decide actions. +* `libdrm_amdgpu` is now explicitly linked with rocdecode. + +#### Removed + +* `GetStream()` interface call from RocVideoDecoder utility class. + +#### Optimized + +* Decode session starts latency reduction. +* Bitstream type detection optimization in bitstream reader. + +#### Resolved issues + +* Fixed a bug in the `videoDecodePicFiles` picture files sample that can results in incorrect output frame count. +* Fixed a decoded frame output issue in video size change cases. +* Removed incorrect asserts of `bitdepth_minus_8` in `GetBitDepth()` and `num_chroma_planes` in `GetNumChromaPlanes()` API calls in the RocVideoDecoder utility class. + +### **rocFFT** (1.0.34) + +#### Added + +* Support for gfx950. + +#### Removed + +* Removed ``rocfft-rider`` legacy compatibility from clients. +* Removed support for the gfx940 and gfx941 targets from the client programs. +* Removed backward compatibility symlink for include directories. + +#### Optimized + +* Removed unnecessary HIP event/stream allocation and synchronization during MPI transforms. +* Implemented single-precision 1D kernels for lengths: + - 4704 + - 5488 + - 6144 + - 6561 + - 8192 +* Implemented single-kernel plans for some large 1D problem sizes, on devices with at least 160KiB of LDS. + +#### Resolved issues + +* Fixed kernel faults on multi-device transforms that gather to a single device, when the input/output bricks are not + contiguous. + +### **ROCgdb** (16.3) + +#### Added + +- Support for the `gfx950` architectures. + +#### Removed + +- Support for the `gfx940` and `gfx941` architectures. + +### **rocJPEG** (1.1.0) + +#### Added +* cmake config files. +* CTEST - New tests were introduced for JPEG batch decoding using various output formats, such as yuv_planar, y, rgb, and rgb_planar, both with and without region-of-interest (ROI). + +#### Changed +* Readme - cleanup and updates to pre-reqs. +* The `decode_params` argument of the `rocJpegDecodeBatched` API is now an array of `RocJpegDecodeParams` structs representing the decode parameters for the batch of JPEG images. +* `libdrm_amdgpu` is now explicitly linked with rocjpeg. + +#### Removed +* Dev Package - No longer installs pkg-config. + +#### Resolved issues +* Fixed a bug that prevented copying the decoded image into the output buffer when the output buffer is larger than the input image. +* Resolved an issue with resizing the internal memory pool by utilizing the explicit constructor of the vector's type during the resizing process. +* Addressed and resolved CMake configuration warnings. + +### **ROCm Bandwidth Test** (2.6.0) + +#### Added + +* Plugin architecture: + * `rocm_bandwidth_test` is now the `framework` for individual `plugins` and features. The `framework` is available at: `/opt/rocm/bin/` + + * Individual `plugins`: The `plugins` (shared libraries) are available at: `/opt/rocm/lib/rocm_bandwidth_test/plugins/` + +```{note} +Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/amd-mainline/README.md) file for details about the new options and outputs. +``` + +#### Changed + +* The `CLI` and options/parameters have changed due to the new plugin architecture, where the plugin parameters are parsed by the plugin. + +#### Removed + +- The old CLI, parameters, and switches. + +### **ROCm Compute Profiler** (3.2.3) + +#### Added + +##### CDNA4 (AMD Instinct MI350/MI355) support + +* Support for AMD Instinct MI350 Series GPUs with the addition of the following counters: + * VALU co-issue (Two VALUs are issued instructions) efficiency + * Stream Processor Instruction (SPI) Wave Occupancy + * Scheduler-Pipe Wave Utilization + * Scheduler FIFO Full Rate + * CPC ADC Utilization + * F6F4 data type metrics + * Update formula for total FLOPs while taking into account F6F4 ops + * LDS STORE, LDS LOAD, LDS ATOMIC instruction count metrics + * LDS STORE, LDS LOAD, LDS ATOMIC bandwidth metrics + * LDS FIFO full rate + * Sequencer -> TA ADDR Stall rates + * Sequencer -> TA CMD Stall rates + * Sequencer -> TA DATA Stall rates + * L1 latencies + * L2 latencies + * L2 to EA stalls + * L2 to EA stalls per channel + +* Roofline support for AMD Instinct MI350 Series GPUs. + +##### Textual User Interface (TUI) (beta version) + +* Text User Interface (TUI) support for analyze mode + * A command line based user interface to support interactive single-run analysis. + * To launch, use `--tui` option in analyze mode. For example, ``rocprof-compute analyze --tui``. + +##### PC Sampling (beta version) + +* Stochastic (hardware-based) PC sampling has been enabled for AMD Instinct MI300X Series and later GPUs. + +* Host-trap PC Sampling has been enabled for AMD Instinct MI200 Series and later GPUs. + +* Support for sorting of PC sampling by type: offset or count. + +* PC Sampling Support on CLI and TUI analysis. + +##### Roofline + +* Support for Roofline plot on CLI (single run) analysis. + +* `FP4` and `FP6` data types have been added for roofline profiling on AMD Instinct MI350 Series. + +##### rocprofv3 support + +* ``rocprofv3`` is supported as the default backend for profiling. +* Support to obtain performance information for all channels for TCC counters. +* Support for profiling on AMD Instinct MI 100 using ``rocprofv3``. +* Deprecation warning for ``rocprofv3`` interface in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. + +##### Others + +* Docker files to package the application and dependencies into a single portable and executable standalone binary file. + +* Analysis report based filtering + * ``-b`` option in profile mode now also accepts metric id(s) for analysis report based filtering. + * ``-b`` option in profile mode also accepts hardware IP block for filtering; however, this filter support will be deprecated soon. + * ``--list-metrics`` option added in profile mode to list possible metric id(s), similar to analyze mode. + +* Support MEM chart on CLI (single run). + +* ``--specs-correction`` option to provide missing system specifications for analysis. + +#### Changed + +* Changed the default ``rocprof`` version to ``rocprofv3``. This is used when environment variable ``ROCPROF`` is not set. +* Changed ``normal_unit`` default to ``per_kernel``. +* Decreased profiling time by not collecting unused counters in post-analysis. +* Updated Dash to >=3.0.0 (for web UI). +* Changed the condition when Roofline PDFs are generated during general profiling and ``--roof-only`` profiling (skip only when ``--no-roof`` option is present). +* Updated Roofline binaries: + * Rebuild using latest ROCm stack. + * Minimum OS distribution support minimum for roofline feature is now Ubuntu 22.04, RHEL 8, and SLES15 SP6. + +#### Removed + +* Roofline support for Ubuntu 20.04 and SLES below 15.6. +* Removed support for AMD Instinct MI50 and MI60. + +#### Optimized + +* ROCm Compute Profiler CLI has been improved to better display the GPU architecture analytics. + +#### Resolved issues + +* Fixed kernel name and kernel dispatch filtering when using ``rocprofv3``. +* Fixed an issue of TCC channel counters collection in ``rocprofv3``. +* Fixed peak FLOPS of `F8`, `I8`, `F16`, and `BF16` on AMD Instinct MI300. +* Fixed not detecting memory clock issue when using ``amd-smi``. +* Fixed standalone GUI crashing. +* Fixed L2 read/write/atomic bandwidths on AMD Instinct MI350 Series. + +#### Known issues + +* On AMD Instinct MI100, accumulation counters are not collected, resulting in the following metrics failing to show up in the analysis: Instruction Fetch Latency, Wavefront Occupancy, LDS Latency. + * As a workaround, use the environment variable ``ROCPROF=rocprof``, to use ``rocprof v1`` for profiling on AMD Instinct MI100. + +* GPU id filtering is not supported when using ``rocprofv3``. + +* Analysis of previously collected workload data will not work due to sysinfo.csv schema change. + * As a workaround, re-run the profiling operation for the workload and interrupt the process after 10 seconds. + Followed by copying the ``sysinfo.csv`` file from the new data folder to the old one. + This assumes your system specification hasn't changed since the creation of the previous workload data. + +* Analysis of new workloads might require providing shader/memory clock speed using +``--specs-correction`` operation if amd-smi or rocminfo does not provide clock speeds. + +* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow. + +* Roofline feature is currently not functional on Azure Linux 3.0 and Debian 12. + +#### Upcoming changes + +* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. Using ``rocprof v1/v2/v3`` interfaces will trigger a deprecation warning. + * To use ROCprofiler-SDK interface, set environment variable `ROCPROF=rocprofiler-sdk` and optionally provide profile mode option ``--rocprofiler-sdk-library-path /path/to/librocprofiler-sdk.so``. Add ``--rocprofiler-sdk-library-path`` runtime option to choose the path to ROCprofiler-SDK library to be used. +* Hardware IP block based filtering using ``-b`` option in profile mode will be removed in favor of analysis report block based filtering using ``-b`` option in profile mode. +* MongoDB database support will be removed, and a deprecation warning has been added to the application interface. +* Usage of ``rocm-smi`` is deprecated in favor of ``amd-smi``, and a deprecation warning has been added to the application interface. + +### **ROCm Data Center Tool** (1.1.0) + +#### Added + +* More profiling and monitoring metrics, especially for AMD Instinct MI300 and newer GPUs. +* Advanced logging and debugging options, including new log levels and troubleshooting guidance. + +#### Changed + +* Completed migration from legacy [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) to [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/). +* Reorganized the configuration files internally and improved [README/installation](https://github.com/ROCm/rdc/blob/amd-staging/README.md) instructions. +* Updated metrics and monitoring support for the latest AMD data center GPUs. + +#### Optimized + +- Integration with [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/) for performance metrics collection. +- Standalone and embedded operating modes, including streamlined authentication and configuration options. +- Support and documentation for diagnostic commands and GPU group management. +- [RVS](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) test integration and reporting. + +### **ROCm SMI** (7.8.0) + +#### Added + +- Support for GPU metrics 1.8. + - Added new fields for `rsmi_gpu_metrics_t` including: + - Adding the following metrics to allow new calculations for violation status: + - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts. + - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts. + - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. + - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). + - Increasing available JPEG engines to 40. + Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI. + +#### Removed + +- Removed backwards compatibility for `rsmi_dev_gpu_metrics_info_get()`'s `jpeg_activity` and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` and `xcp_stats.vcn_busy`. + - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. + - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion for users about which field to use. By removing backward compatibility, it is easier to identify the relevant field. + - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. + +```{note} +See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. +``` + +### **ROCm Systems Profiler** (1.1.0) + +#### Added + +- Profiling and metric collection capabilities for VCN engine activity, JPEG engine activity, and API tracing for rocDecode, rocJPEG, and VA-APIs. +- How-to document for VCN and JPEG activity sampling and tracing. +- Support for tracing Fortran applications. +- Support for tracing MPI API in Fortran. + +#### Changed + +- Replaced ROCm SMI backend with AMD SMI backend for collecting GPU metrics. +- ROCprofiler-SDK is now used to trace RCCL API and collect communication counters. + - Use the setting `ROCPROFSYS_USE_RCCLP = ON` to enable profiling and tracing of RCCL application data. +- Updated the Dyninst submodule to v13.0. +- Set the default value of `ROCPROFSYS_SAMPLING_CPUS` to `none`. + +#### Resolved issues + +- Fixed GPU metric collection settings with `ROCPROFSYS_AMD_SMI_METRICS`. +- Fixed a build issue with CMake 4. +- Fixed incorrect kernel names shown for kernel dispatch tracks in Perfetto. +- Fixed formatting of some output logs. + +### **ROCm Validation Suite** (1.2.0) + +#### Added + +- Support for AMD Instinct MI350X and MI355X GPUs. +- Introduced rotating buffer mechanism for GEMM operations. +- Support for read and write tests in Babel. +- Support for AMD Radeon RX9070 and RX9070GRE graphics cards. + +#### Changed + +- Migrated SMI API usage from `rocm-smi` to `amd-smi`. +- Updated `FP8` GEMM operations to use hipBLASLt instead of rocBLAS. + +### **rocPRIM** (4.0.0) + +#### Added + +* Support for gfx950. +* `rocprim::accumulator_t` to ensure parity with CCCL. +* Test for `rocprim::accumulator_t`. +* `rocprim::invoke_result_r` to ensure parity with CCCL. +* Function `is_build_in` into `rocprim::traits::get`. +* Virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. +* Initial value support to device level inclusive scans. +* New optimization to the backend for `device_transform` when the input and output are pointers. +* `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers. +* `rocprim:device_transform` for n-ary transform operations API with as input `n` number of iterators inside a `rocprim::tuple`. +* `rocprim::key_value_pair::operator==`. +* The `rocprim::unrolled_copy` thread function to copy multiple items inside a thread. +* The `rocprim::unrolled_thread_load` function to load multiple items inside a thread using `rocprim::thread_load`. +* `rocprim::int128_t` and `rocprim::uint128_t` to benchmarks for improved performance evaluation on 128-bit integers. +* `rocprim::int128_t` to the supported autotuning types to improve performance for 128-bit integers. +* The `rocprim::merge_inplace` function for merging in-place. +* Initial value support for warp- and block-level inclusive scan. +* Support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the `WITH_ROCRAND=ON` build flag. +* Tests and documentation to `lookback_scan_state`. It is still in the `detail` namespace. + +#### Changed + +* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits`, respectively. +* Marked the initialisation constructor of `rocprim::reverse_iterator` `explicit`, use `rocprim::make_reverse_iterator`. +* Merged `radix_key_codec` into type_traits system. +* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`. +* The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change. +The previous default accumulator types could lead to situations in which unexpected overflow occurred, such as when the input or initial type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing: + + * `rocprim::inclusive_scan` + * Previous default: `class AccType = typename std::iterator_traits::value_type>` + * Current default: `class AccType = rocprim::accumulator_t::value_type>` + * `rocprim::deterministic_inclusive_scan` + * Previous default: `class AccType = typename std::iterator_traits::value_type>` + * Current default: `class AccType = rocprim::accumulator_t::value_type>` + * `rocprim::exclusive_scan` + * Previous default: `class AccType = detail::input_type_t>` + * Current default: `class AccType = rocprim::accumulator_t>` + * `rocprim::deterministic_exclusive_scan` + * Previous default: `class AccType = detail::input_type_t>` + * Current default: `class AccType = rocprim::accumulator_t>` +* Undeprecated internal `detail::raw_storage`. +* A new version of `rocprim::thread_load` and `rocprim::thread_store` replaces the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. +* Renamed `rocprim::load_cs` to `rocprim::load_nontemporal` and `rocprim::store_cs` to `rocprim::store_nontemporal` to express the intent of these load and store methods better. +* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example, `rocprim::ROCPRIM_300400_NS::symbol` instead of `rocPRIM::symbol`, letting the user link multiple libraries built with different versions of rocPRIM. + +#### Removed + +* `rocprim::detail::float_bit_mask` and relative tests, use `rocprim::traits::float_bit_mask` instead. +* `rocprim::traits::is_fundamental`, use `rocprim::traits::get::is_fundamental()` directly. +* The deprecated parameters `short_radix_bits` and `ShortRadixBits` from the `segmented_radix_sort` config. They were unused, it is only an API change. +* The deprecated `operator<<` from the iterators. +* The deprecated `TwiddleIn` and `TwiddleOut`. Use `radix_key_codec` instead. +* The deprecated flags API of `block_adjacent_difference`. Use `subtract_left()` or `block_discontinuity::flag_heads()` instead. +* The deprecated `to_exclusive` functions in the warp scans. +* The `rocprim::load_cs` from the `cache_load_modifier` enum. Use `rocprim::load_nontemporal` instead. +* The `rocprim::store_cs` from the `cache_store_modifier` enum. Use `rocprim::store_nontemporal` instead. +* The deprecated header file `rocprim/detail/match_result_type.hpp`. Include `rocprim/type_traits.hpp` instead. This header included: + * `rocprim::detail::invoke_result`. Use `rocprim::invoke_result` instead. + * `rocprim::detail::invoke_result_binary_op`. Use `rocprim::invoke_result_binary_op` instead. + * `rocprim::detail::match_result_type`. Use `rocprim::invoke_result_binary_op_t` instead. +* The deprecated `rocprim::detail::radix_key_codec` function. Use `rocprim::radix_key_codec` instead. +* Removed `rocprim/detail/radix_sort.hpp`, functionality can now be found in `rocprim/thread/radix_key_codec.hpp`. +* Removed C++14 support. Only C++17 is supported. +* Due to the removal of `__AMDGCN_WAVEFRONT_SIZE` in the compiler, the following deprecated warp size-related symbols have been removed: + * `rocprim::device_warp_size()` + * For compile-time constants, this is replaced with `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()`. Use this when allocating global or shared memory. + * For run-time constants, this is replaced with `rocprim::arch::wavefront::size().` + * `rocprim::warp_size()` + * Use `rocprim::host_warp_size()`, `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. + * `ROCPRIM_WAVEFRONT_SIZE` + * Use `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. + * `__AMDGCN_WAVEFRONT_SIZE` + * This was a fallback define for the compiler's removed symbol, having the same name. +* This release removes support for custom builds on gfx940 and gfx941. + +#### Optimized + +* Improved performance of `rocprim::device_select` and `rocprim::device_partition` when using multiple streams on the AMD Instinct MI300 Series. + +#### Resolved issues + +* Fixed an issue where `device_batch_memcpy` reported benchmarking throughput being 2x lower than it was in reality. +* Fixed an issue where `device_segmented_reduce` reported autotuning throughput being 5x lower than it was in reality. +* Fixed device radix sort not returning the correct required temporary storage when a double buffer contains `nullptr`. +* Fixed constness of equality operators (`==` and `!=`) in `rocprim::key_value_pair`. +* Fixed an issue for the comparison operators in `arg_index_iterator` and `texture_cache_iterator`, where `<` and `>` comparators were swapped. +* Fixed an issue for the `rocprim::thread_reduce` not working correctly with a prefix value. + +#### Known issues + +* When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs. + +#### Upcoming changes + +* `rocprim::invoke_result_binary_op` and `rocprim::invoke_result_binary_op_t` are deprecated. Use `rocprim::accumulator_t` instead. + +### **ROCprofiler-SDK** (1.0.0) + +#### Added + +- Support for [rocJPEG](https://rocm.docs.amd.com/projects/rocJPEG/en/latest/index.html) API Tracing. +- Support for AMD Instinct MI350X and MI355X GPUs. +- `rocprofiler_create_counter` to facilitate adding custom derived counters at runtime. +- Support in `rocprofv3` for iteration based counter multiplexing. +- Perfetto support for counter collection. +- Support for negating `rocprofv3` tracing options when using aggregate options such as `--sys-trace --hsa-trace=no`. +- `--agent-index` option in `rocprofv3` to specify the agent naming convention in the output: + - absolute == node_id + - relative == logical_node_id + - type-relative == logical_node_type_id +- MI300 and MI350 stochastic (hardware-based) PC sampling support in ROCProfiler-SDK and `rocprofv3`. +- Python bindings for `rocprofiler-sdk-roctx`. +- SQLite3 output support for `rocprofv3` using `--output-format rocpd`. +- `rocprofiler-sdk-rocpd` package: + - Public API in `include/rocprofiler-sdk-rocpd/rocpd.h`. + - Library implementation in `librocprofiler-sdk-rocpd.so`. + - Support for `find_package(rocprofiler-sdk-rocpd)`. + - `rocprofiler-sdk-rocpd` DEB and RPM packages. +- `--version` option in `rocprofv3`. +- `rocpd` Python package. +- Thread trace as experimental API. +- ROCprof Trace Decoder as experimental API: + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- Thread trace option in the `rocprofv3` tool under the `--att` parameters: + - See [using thread trace with rocprofv3](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/amd-mainline/how-to/using-thread-trace.html) + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- `rocpd` output format documentation: + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- Perfetto support for scratch memory. +- Support in the `rocprofv3` avail tool for command-line arguments. +- Documentation for `rocprofv3` advanced options. +- AQLprofile is now available as open source. + +#### Changed + +- SDK to NOT to create a background thread when every tool returns a nullptr from `rocprofiler_configure`. +- `vaddr-to-file-offset` mapping in `disassembly.hpp` to use the dedicated comgr API. +- `rocprofiler_uuid_t` ABI to hold 128 bit value. +- `rocprofv3` shorthand argument for `--collection-period` to `-P` (upper-case) while `-p` (lower-case) is reserved for later use. +- Default output format for `rocprofv3` to `rocpd` (SQLite3 database). +- `rocprofv3` avail tool to be renamed from `rocprofv3_avail` to `rocprofv3-avail` tool. +- `rocprofv3` tool to facilitate thread trace and PC sampling on the same agent. + +##### Removed + +* Support for compilation of gfx940 and gfx941 targets. + +#### Resolved issues + +- Fixed missing callbacks around internal thread creation within counter collection service. +- Fixed potential data race in the ROCprofiler-SDK double buffering scheme. +- Fixed usage of std::regex in the core ROCprofiler-SDK library that caused segfaults or exceptions when used under dual ABI. +- Fixed Perfetto counter collection by introducing accumulation per dispatch. +- Fixed code object disassembly for missing function inlining information. +- Fixed queue preemption error and `HSA_STATUS_ERROR_INVALID_PACKET_FORMAT` error for stochastic PC-sampling in MI300X, leading to stabler runs. +- Fixed the system hang issue for host-trap PC-sampling on AMD Instinct MI300X. +- Fixed `rocpd` counter collection issue when counter collection alone is enabled. `rocpd_kernel_dispatch` table is updated to be populated by counters data instead of kernel_dispatch data. +- Fixed `rocprofiler_*_id_t` structs for inconsistency related to a "null" handle: + - The correct definition for a null handle is `.handle = 0` while some definitions previously used `UINT64_MAX`. +- Fixed kernel trace csv output generated by `rocpd`. + +### **rocPyDecode** (0.6.0) + +#### Added + +* ``rocpyjpegdecode`` package. +* ``src/rocjpeg`` source new subfolder. +* ``samples/rocjpeg`` new subfolder. + +#### Changed +* Minimum version for rocdecode and rocjpeg updated to V1.0.0. + +### **rocRAND** (4.0.0) + +#### Added + +* Support for gfx950. +* Additional unit tests for `test_log_normal_distribution.cpp`, `test_normal_distribution.cpp`, `test_rocrand_mtgp32_prng.cpp`, `test_rocrand_scrambled_sobol32_qrng.cpp`, `test_rocrand_scrambled_sobol64_qrng.cpp`, `test_rocrand_sobol32_qrng.cpp`, `test_rocrand_sobol64_qrng.cpp`, `test_rocrand_threefry2x32_20_prng.cpp`, `test_rocrand_threefry2x64_20_prng.cpp`, `test_rocrand_threefry4x32_20_prng.cpp`, `test_rocrand_threefry4x64_20_prng.cpp`, and `test_uniform_distribution.cpp`. +* New unit tests for `include/rocrand/rocrand_discrete.h` in `test_rocrand_discrete.cpp`, `include/rocrand/rocrand_mrg31k3p.h` in `test_rocrand_mrg31k3p_prng.cpp`, `include/rocrand/rocrand_mrg32k3a.h` in `test_rocrand_mrg32k3a_prng.cpp`, and `include/rocrand/rocrand_poisson.h` in `test_rocrand_poisson.cpp`. + +#### Changed + +* Changed the return type for `rocrand_generate_poisson` for the `SOBOL64` and `SCRAMBLED_SOBOL64` engines. +* Changed the unnecessarily large 64-bit data type for constants used for skipping in `MRG32K3A` to the 32-bit data type. +* Updated several `gfx942` auto tuning parameters. +* Modified error handling and expanded the error information for the case of double-deallocation of the (scrambled) sobol32 and sobol64 constants and direction vectors. + +#### Removed + +* Removed inline assembly and the `ENABLE_INLINE_ASM` CMake option. Inline assembly was used to optimize multiplication in the Mrg32k3a and Philox 4x32-10 generators. It is no longer needed because the current HIP compiler is able to produce code with the same or better performance. +* Removed instances of the deprecated clang definition `__AMDGCN_WAVEFRONT_SIZE`. +* Removed C++14 support. Beginning with this release, only C++17 is supported. +* Directly accessing the (scrambled) sobol32 and sobol64 constants and direction vectors is no longer supported. For: + * `h_scrambled_sobol32_constants`, use `rocrand_get_scramble_constants32` instead. + * `h_scrambled_sobol64_constants`, use `rocrand_get_scramble_constants64` instead. + * `rocrand_h_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. + * `rocrand_h_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. + * `rocrand_h_scrambled_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. + * `rocrand_h_scrambled_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. + +#### Resolved issues + +* Fixed an issue where `mt19937.hpp` would cause kernel errors during auto tuning. + +#### Upcoming changes + +* Deprecated the rocRAND Fortran API in favor of hipfort. + +### **ROCr Debug Agent** (2.1.0) + +#### Added + +* The `-e` and `--precise-alu-exceptions` flags to enable precise ALU exceptions reporting on supported configurations. + +### **ROCr Runtime** (1.18.0) + +#### Added + +* New API `hsa_amd_memory_get_preferred_copy_engine` to get preferred copy engine that can be used to when calling `hsa_amd_memory_async_copy_on_engine`. +* New API `hsa_amd_portable_export_dmabuf_v2` extension of existing `hsa_amd_portable_export_dmabuf` API to support new flags parameter. This allows specifying the new `HSA_AMD_DMABUF_MAPPING_TYPE_PCIE` flag when exporting dma-bufs. +* New flag `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` adds support for new `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` when calling `hsa_amd_vmem_address_reserve` API. This allows virtual address range reservations for SVM allocations to be tracked when running in ASAN mode. +* New sub query `HSA_AMD_AGENT_INFO_CLOCK_COUNTERS` returns a snapshot of the underlying driver's clock counters that can be used for profiling. + +### **rocSHMEM** (3.0.0) + +#### Added + +* Reverse Offload conduit. +* New APIs: `rocshmem_ctx_barrier`, `rocshmem_ctx_barrier_wave`, `rocshmem_ctx_barrier_wg`, `rocshmem_barrier_all`, `rocshmem_barrier_all_wave`, `rocshmem_barrier_all_wg`, `rocshmem_ctx_sync`, `rocshmem_ctx_sync_wave`, `rocshmem_ctx_sync_wg`, `rocshmem_sync_all`, `rocshmem_sync_all_wave`, `rocshmem_sync_all_wg`, `rocshmem_init_attr`, `rocshmem_get_uniqueid`, and `rocshmem_set_attr_uniqueid_args`. +* `dlmalloc` based allocator. +* XNACK support. +* Support for initialization with MPI communicators other than `MPI_COMM_WORLD`. + +#### Changed + +* Changed collective APIs to use `_wg` suffix rather than `_wg_` infix. + +#### Resolved issues + +* Resolved segfault in `rocshmem_wg_ctx_create`, now provides `nullptr` if `ctx` cannot be created. + +### **rocSOLVER** (3.30.0) + +#### Added + +* Hybrid computation support for existing routines: STEQR + +#### Optimized + +* Improved the performance of BDSQR and downstream functions, such as GESVD. +* Improved the performance of STEQR and downstream functions, such as SYEV/HEEV. +* Improved the performance of LARFT and downstream functions, such as GEQR2 and GEQRF. + +#### Resolved issues + +* Fixed corner cases that can produce NaNs in SYEVD for valid input matrices. + +### **rocSPARSE** (4.0.2) + +#### Added + +* The `SpGEAM` generic routine for computing sparse matrix addition in CSR format. +* The `v2_SpMV` generic routine for computing sparse matrix vector multiplication. As opposed to the deprecated `rocsparse_spmv` routine, this routine does not use a fallback algorithm if a non-implemented configuration is encountered and will return an error in such a case. For the deprecated `rocsparse_spmv` routine, the user can enable warning messages in situations where a fallback algorithm is used by either calling the `rocsparse_enable_debug` routine upfront or exporting the variable `ROCSPARSE_DEBUG` (with the shell command `export ROCSPARSE_DEBUG=1`). +* Half float mixed precision to `rocsparse_axpby` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `rocsparse_spvv` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `rocsparse_spmv` where A and X use `float16` and Y and the compute type use `float`. +* Half float mixed precision to `rocsparse_spmm` where A and B use `float16` and C and the compute type use `float`. +* Half float mixed precision to `rocsparse_sddmm` where A and B use `float16` and C and the compute type use `float`. +* Half float uniform precision to the `rocsparse_scatter` and `rocsparse_gather` routines. +* Half float uniform precision to the `rocsparse_sddmm` routine. +* The `rocsparse_spmv_alg_csr_rowsplit` algorithm. +* Support for gfx950. +* ROC-TX instrumentation support in rocSPARSE (not available on Windows or in the static library version on Linux). +* The `almalinux` operating system name to correct the GFortran dependency. + +#### Changed + +* Switch to defaulting to C++17 when building rocSPARSE from source. Previously rocSPARSE was using C++14 by default. + +#### Removed + +* The deprecated `rocsparse_spmv_ex` routine. +* The deprecated `rocsparse_sbsrmv_ex`, `rocsparse_dbsrmv_ex`, `rocsparse_cbsrmv_ex`, and `rocsparse_zbsrmv_ex` routines. +* The deprecated `rocsparse_sbsrmv_ex_analysis`, `rocsparse_dbsrmv_ex_analysis`, `rocsparse_cbsrmv_ex_analysis`, and `rocsparse_zbsrmv_ex_analysis` routines. + +#### Optimized + +* Reduced the number of template instantiations in the library to further reduce the shared library binary size and improve compile times. +* Allow SpGEMM routines to use more shared memory when available. This can speed up performance for matrices with a large number of intermediate products. +* Use of the `rocsparse_spmv_alg_csr_adaptive` or `rocsparse_spmv_alg_csr_default` algorithms in `rocsparse_spmv` to perform transposed sparse matrix multiplication (`C=alpha*A^T*x+beta*y`) resulted in unnecessary analysis on A and needless slowdown during the analysis phase. This has been improved by skipping the analysis when performing the transposed sparse matrix multiplication. +* Improved the user documentation. + +#### Resolved issues + +* Fixed an issue in the public headers where `extern "C"` was not wrapped by `#ifdef __cplusplus`, which caused failures when building C programs with rocSPARSE. +* Fixed a memory access fault in the `rocsparse_Xbsrilu0` routines. +* Fixed failures that could occur in `rocsparse_Xbsrsm_solve` or `rocsparse_spsm` with BSR format when using host pointer mode. +* Fixed ASAN compilation failures. +* Fixed a failure that occurred when using const descriptor `rocsparse_create_const_csr_descr` with the generic routine `rocsparse_sparse_to_sparse`. The issue was not observed when using non-const descriptor `rocsparse_create_csr_descr` with `rocsparse_sparse_to_sparse`. +* Fixed a memory leak in the rocSPARSE handle. + +#### Upcoming changes + +* Deprecated the `rocsparse_spmv` routine. Use the `rocsparse_v2_spmv` routine instead. +* Deprecated the `rocsparse_spmv_alg_csr_stream` algorithm. Use the `rocsparse_spmv_alg_csr_rowsplit` algorithm instead. +* Deprecated the `rocsparse_itilu0_alg_sync_split_fusion` algorithm. Use one of `rocsparse_itilu0_alg_async_inplace`, `rocsparse_itilu0_alg_async_split`, or `rocsparse_itilu0_alg_sync_split` instead. + +### **rocThrust** (4.0.0) + +#### Added + +* Additional unit tests for: binary_search, complex, c99math, catrig, ccosh, cexp, clog, csin, csqrt, and ctan. +* `test_param_fixtures.hpp` to store all the parameters for typed test suites. +* `test_real_assertions.hpp` to handle unit test assertions for real numbers. +* `test_imag_assertions.hpp` to handle unit test assertions for imaginary numbers. +* `clang++` is now used to compile google benchmarks on Windows. +* Support for gfx950. +* Merged changes from upstream CCCL/thrust 2.6.0. + +#### Changed + +* Updated the required version of Google Benchmark from 1.8.0 to 1.9.0. +* Renamed `cpp14_required.h` to `cpp_version_check.h`. +* Refactored `test_header.hpp` into `test_param_fixtures.hpp`, `test_real_assertions.hpp`, `test_imag_assertions.hpp`, and `test_utils.hpp`. This is done to prevent unit tests from having access to modules that they're not testing. This will improve the accuracy of code coverage reports. + +#### Removed + +* `device_malloc_allocator.h` has been removed. This header file was unused and should not impact users. +* Removed C++14 support. Only C++17 is now supported. +* `test_header.hpp` has been removed. The `HIP_CHECK` function, as well as the `test` and `inter_run_bwr` namespaces, have been moved to `test_utils.hpp`. +* `test_assertions.hpp` has been split into `test_real_assertions.hpp` and `test_imag_assertions.hpp`. + +#### Resolved issues + +* Fixed an issue with internal calls to unqualified `distance()` which would be ambiguous due to the visible implementation through ADL. + +#### Known issues + +* The order of the values being compared by `thrust::exclusive_scan_by_key` and `thrust::inclusive_scan_by_key` can change between runs when integers are being compared. This can cause incorrect output when a non-commutative operator such as division is being used. + +#### Upcoming changes + +* `thrust::device_malloc_allocator` is deprecated as of this version. It will be removed in an upcoming version. + +### **rocWMMA** (2.0.0) + +#### Added + +* Internal register layout transforms to support interleaved MMA layouts. +* Support for the gfx950 target. +* Mixed input `BF8`/`FP8` types for MMA support. +* Fragment scheduler API objects to embed thread block cooperation properties in fragments. + +#### Changed + +* Augmented load/store/MMA internals with static loop unrolling. +* Updated linkage of `rocwmma::synchronize_workgroup` to inline. +* rocWMMA `mma_sync` API now supports `wave tile` fragment sizes. +* rocWMMA cooperative fragments are now expressed with fragment scheduler template arguments. +* rocWMMA cooperative fragments now use the same base API as non-cooperative fragments. +* rocWMMA cooperative fragments register usage footprint has been reduced. +* rocWMMA fragments now support partial tile sizes with padding. + +#### Removed + +* Support for the gfx940 and gfx941 targets. +* The rocWMMA cooperative API. +* Wave count template parameters from transforms APIs. + +#### Optimized + +* Added internal flow control barriers to improve assembly code generation and overall performance. +* Enabled interleaved layouts by default in MMA to improve overall performance. + +#### Resolved issues + +* Fixed a validation issue for small precision compute types `< B32` on gfx9. +* Fixed CMake validation of compiler support for `BF8`/`FP8` types. + +### **RPP** (2.0.0) + +#### Added + +* Bitwise NOT, Bitwise AND, and Bitwise OR augmentations on HOST (CPU) and HIP backends. +* Tensor Concat augmentation on HOST (CPU) and HIP backends. +* JPEG Compression Distortion augmentation on HIP backend. +* `log1p`, defined as `log (1 + x)`, tensor augmentation support on HOST (CPU) and HIP backends. +* JPEG Compression Distortion augmentation on HOST (CPU) backend. + +#### Changed + +* Handle creation and destruction APIs have been consolidated. Use `rppCreate()` for handle initialization and `rppDestroy()` for handle destruction. +* The `logical_operations` function category has been renamed to `bitwise_operations`. +* TurboJPEG package installation enabled for RPP Test Suite with `sudo apt-get install libturbojpeg0-dev`. Instructions have been updated in utilities/test_suite/README.md. +* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to an unsigned int tensor), that provides the permutation order to swap the RGB channels of each input image in the batch in any order: + + `RppStatus rppt_swap_channels_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, rppHandle_t rppHandle);` + + changed to: + + `RppStatus rppt_channel_permute_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, Rpp32u *permutationTensor , rppHandle_t rppHandle);` + +#### Removed + +* Older versions of RPP handle creation inlcuding `rppCreateWithBatchSize()`, `rppCreateWithStream()`, and `rppCreateWithStreamAndBatchSize()`. These have been replaced with `rppCreate()`. +* Older versions of RPP handle destruction API including `rppDestroyGPU()` and `rppDestroyHost()`. These have been replaced with `rppDestroy()`. + +#### Resolved issues + +* Test package - Debian packages will install required dependencies. + +### **Tensile** (4.44.0) + +#### Added + +- Support for gfx950. +- Code object compression via bundling. +- Support for non-default HIP SDK installations on Windows. +- Master solution library documentation. +- Compiler version-dependent assembler and architecture capabilities. +- Documentation from GitHub Wiki to ROCm docs. + +#### Changed + +- Loosened check for CLI compiler choices. +- Introduced 4-tuple targets for bundler invocations. +- Introduced PATHEXT extensions on Windows when searching for toolchain components. +- Enabled passing fully qualified paths to toolchain components. +- Enabled environment variable overrides when searching for a ROCm stack. +- Improved default toolchain configuration. +- Ignored f824 flake errors. + +#### Removed + +- Support for the gfx940 and gfx941 targets. +- Unused tuning files. +- Disabled tests. + +#### Resolved issues + +- Fixed configure time path not being invoked at build. +- Fixed find_package for msgpack to work with versions 5 and 6. +- Fixed RHEL 9 testing. +- Fixed gfx908 builds. +- Fixed the 'argument list too long' error. +- Fixed version typo in 6.3 changelog. +- Fixed improper use of aliases as nested namespace specifiers. + ## ROCm 6.4.3 See the [ROCm 6.4.3 release notes](https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html) @@ -374,7 +2040,7 @@ for a complete overview of this release. - Changed the name of the `power` field to `energy_accumulator` in the Python API for `amdsmi_get_energy_count()`. - Added violation status output for Graphics Clock Below Host Limit to `amd-smi` CLI: `amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`. - Users can retrieve violation status through either our Python or C++ APIs. Only available for MI300 series+ ASICs. + Users can retrieve violation status through either our Python or C++ APIs. Only available for MI300 Series+ ASICs. - Updated API `amdsmi_get_violation_status()` structure and CLI `amdsmi_violation_status_t` to include GFX Clk below host limit. @@ -394,7 +2060,7 @@ for a complete overview of this release. #### Resolved issues -- Fixed `amdsmi_get_gpu_asic_info` and `amd-smi static --asic` not displaying graphics version correctly for Instinct MI200 series, Instinct MI100 series, and RDNA3-based GPUs. +- Fixed `amdsmi_get_gpu_asic_info` and `amd-smi static --asic` not displaying graphics version correctly for Instinct MI200 Series, Instinct MI100 Series, and RDNA3-based GPUs. #### Known issues @@ -1033,7 +2699,7 @@ The following lists the backward incompatible changes planned for upcoming major #### Resolved issues -- Fixed `rsmi_dev_target_graphics_version_get`, `rocm-smi --showhw`, and `rocm-smi --showprod` not displaying graphics version correctly for Instinct MI200 series, MI100 series, and RDNA3-based GPUs. +- Fixed `rsmi_dev_target_graphics_version_get`, `rocm-smi --showhw`, and `rocm-smi --showprod` not displaying graphics version correctly for Instinct MI200 Series, MI100 Series, and RDNA3-based GPUs. > [!NOTE] > See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. @@ -4369,7 +6035,7 @@ for a complete overview of this release. #### Resolved issues * Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112). -* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. +* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-Series hardware. ## ROCm 6.1.1 @@ -4810,8 +6476,8 @@ for a complete overview of this release. #### Added * Added support for additional GPU architectures. - * Navi 3 series: gfx1100, gfx1101, and gfx1102. - * MI300 series: gfx942. + * Navi 3 Series: gfx1100, gfx1101, and gfx1102. + * MI300 Series: gfx942. ### **ROCm SMI** (6.0.0) diff --git a/RELEASE.md b/RELEASE.md index 9d8835de8..6fbadb5af 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -10,13 +10,15 @@ -# ROCm 6.4.3 release notes +# ROCm 7.0.0 release notes The release notes provide a summary of notable changes since the previous ROCm release. - [Release highlights](#release-highlights) -- [Operating system and hardware support changes](#operating-system-and-hardware-support-changes) +- [Operating system, hardware, and virtualization support changes](#operating-system-hardware-and-virtualization-support-changes) + +- [User space, driver, and firmware dependent changes](#user-space-driver-and-firmware-dependent-changes) - [ROCm components versioning](#rocm-components) @@ -24,6 +26,8 @@ The release notes provide a summary of notable changes since the previous ROCm r - [ROCm known issues](#rocm-known-issues) +- [ROCm resolved issues](#rocm-resolved-issues) + - [ROCm upcoming changes](#rocm-upcoming-changes) ```{note} @@ -33,47 +37,512 @@ documentation to verify compatibility and system requirements. ## Release highlights -ROCm 6.4.3 is a quality release that resolves the following issues. For changes to individual components, see [Detailed component changes](#detailed-component-changes). +The following are notable new features and improvements in ROCm 7.0.0. For changes to individual components, see +[Detailed component changes](#detailed-component-changes). -### AMDGPU driver updates +### Operating system, hardware, and virtualization support changes -* Resolved an issue causing performance degradation in communication operations, caused by increased latency in certain RCCL applications. The fix prevents unnecessary queue eviction during the fork process. -* Fixed an issue in the AMDGPU driver’s scheduler constraints that could cause queue preemption to fail during workload execution. +ROCm 7.0.0 adds support for [AMD Instinct MI355X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html) and [MI350X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi350x.html). For details, see the full list of [Supported GPUs (Linux)](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#supported-gpus). -### ROCm SMI update -* Fixed the failure to load GPU data like System Clock (SCLK) by adjusting the logic for retrieving GPU board voltage. +ROCm 7.0.0 adds support for the following operating systems and kernel versions: -### ROCm documentation updates +* Ubuntu 24.04.3 (kernel: 6.8 [GA], 6.14 [HWE]) +* Rocky Linux 9 (kernel: 5.14.0-570) -ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases. +ROCm 7.0.0 marks the end of support (EoS) for Ubuntu 24.04.2 (kernel: 6.8 [GA], 6.11 [HWE]) and SLES 15 SP6. -* [Tutorials for AI developers](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) have been expanded with the following five new tutorials: - * Inference tutorials - * [ChatQnA vLLM deployment and performance evaluation](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/opea_deployment_and_evaluation.html) - * [Text-to-video generation with ComfyUI](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/t2v_comfyui_radeon.html) - * [DeepSeek Janus Pro on CPU or GPU](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/deepseek_janus_cpu_gpu.html) - * [DeepSeek-R1 with vLLM V1](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/vllm_v1_DSR1.html) - * GPU development and optimization tutorial: [MLA decoding kernel of AITER library](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/gpu_dev_optimize/aiter_mla_decode_kernel.html) - - For more information about the changes, see [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html). - -* ROCm provides a comprehensive ecosystem for deep learning development. For more details, see [Deep learning frameworks for ROCm](https://rocm.docs.amd.com/en/docs-6.4.3/how-to/deep-learning-rocm.html). AMD ROCm adds support for the following deep learning framework: - - * Megablocks is a light-weight library for mixture-of-experts (MoE) training. The core of the system is efficient "dropless-MoE" and standard MoE layers. Megablocks is integrated with Megatron-LM, where data and pipeline parallel training of MoEs is supported. It is currently supported on ROCm 6.3.0. For more information, see [Megablocks compatibility](https://rocm.docs.amd.com/en/docs-6.4.3/compatibility/ml-compatibility/megablocks-compatibility.html). - -* The [Data types and precision support](https://rocm.docs.amd.com/en/latest/reference/precision-support.html) topic now includes new hardware and library support information. - -## Operating system and hardware support changes - -Operating system and hardware support remain unchanged in this release. +For more information about supported operating systems, see [Supported operating systems](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#supported-operating-systems) and [install instructions](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/). See the [Compatibility matrix](../../docs/compatibility/compatibility-matrix.rst) for more information about operating system and hardware compatibility. +#### Virtualization support + +ROCm 7.0.0 introduces support for KVM Passthrough for AMD Instinct MI350X and MI355X GPUs. + +All KVM-based SR-IOV supported configurations require the GIM SR-IOV driver version 8.4.0.K. In addition, support for VMware ESXi 8 has been introduced for AMD Instinct MI300X GPUs. For more information, see [Virtualization Support](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#virtualization-support). + +### Deep learning and AI framework updates + +ROCm provides a comprehensive ecosystem for deep learning development. For more information, see [Deep learning frameworks for ROCm](https://rocm.docs.amd.com/en/latest/how-to/deep-learning-rocm.html) and the [Compatibility +matrix](../../docs/compatibility/compatibility-matrix.rst) for the complete list of Deep learning and AI framework versions tested for compatibility with ROCm. + + +#### Updated framework support + +ROCm 7.0.0 introduces several newly supported versions of Deep learning and AI frameworks: + +##### PyTorch + +ROCm 7.0.0 enables the following PyTorch features: + +* Support for PyTorch 2.7. +* Integrated Fused Rope kernels in APEX. +* Compilation of Python C++ extensions using ``amdclang++``. +* Support for channels-last NHWC format for convolutions via MIOpen. + +##### JAX + +ROCm 7.0.0 enables support for JAX 0.6.0. + +##### Megatron-LM + +Megatron-LM for ROCm now supports: + +* Fused Gradient Accumulation via APEX. + +* Fused Rope Kernel in APEX. + +* Fused_bias_swiglu kernel. + +##### TensorFlow + +ROCm 7.0.0 enables support for TensorFlow 2.19.1. + +##### ONNX Runtime + +ROCm 7.0.0 enables support for ONNX Runtime 1.22.0. + +##### vLLM + +* Support for Open Compute Project (OCP) `FP8` data type. +* `FP4` precision for Llama 3.1 405B. + +##### Triton + +ROCm 7.0.0 enables support for Triton 3.3.0. + +#### New frameworks + +AMD ROCm has officially added support for the following Deep learning and AI frameworks: + +* Ray is a unified framework for scaling AI and Python applications from your laptop to a full cluster, without changing your code. Ray consists of a core distributed runtime and a set of AI libraries for simplifying machine learning computations. It is currently supported on ROCm 6.4.1. For more information, see [Ray compatibility](https://advanced-micro-devices-rocm-internal--500.com.readthedocs.build/en/500/compatibility/ml-compatibility/ray-compatibility.html). + +* llama.cpp is an open-source framework for Large Language Model (LLM) inference that runs on both central processing units (CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing a simple, dependency-free setup. It is currently supported on ROCm 6.4.0. For more information, see [llama.cpp compatibility](https://advanced-micro-devices-rocm-internal--500.com.readthedocs.build/en/500/compatibility/ml-compatibility/llama-cpp-compatibility.html). + + +### AMD GPU Driver/ROCm packaging separation + +The AMD GPU Driver (amdgpu) is now distributed separately from the ROCm software stack and is stored under in its own location ``/amdgpu/`` in the package repository at [repo.radeon.com](https://repo.radeon.com/amdgpu/). The first release is designated as AMD GPU Driver (amdgpu) version 30.10. See the [User and kernel-space support matrix](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/user-kernel-space-compat-matrix.html) for more information. + +[AMD SMI](https://github.com/ROCm/amdsmi) continues to stay with the ROCm software stack under the ROCm organization repository. + +### Consolidation of ROCm library repositories + +The following ROCm library repositories are migrating from multiple repositories under {fab}`github` [ROCm](https://github.com/ROCm) to a single repository under {fab}`github` [rocm-libraries](https://github.com/ROCm/rocm-libraries) in the ROCm organization GitHub: [hipBLAS](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblas), [hipBLASLt](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt) +, [hipCUB](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipcub), [hipFFT](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipfft), [hipRAND](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hiprand), [hipSPARSE](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipsparse), [hipSPARSELt](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipsparselt), [MIOpen](https://github.com/ROCm/rocm-libraries/tree/develop/projects/miopen), [rocBLAS](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocblas), [rocFFT](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocfft), [rocPRIM](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocprim), [rocRAND](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocrand), [rocSPARSE](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocsparse), [rocThrust](https://github.com/ROCm/rocm-libraries/tree/develop/projects/rocthrust), and [Tensile](https://github.com/ROCm/rocm-libraries/tree/develop/shared/tensile). + +Use the new ROCm Libraries repository to access source code, clone projects, and contribute to the code base and documentation.The change helps to streamline development, CI, and integration. For more information about working with the ROCm Libraries repository, see [Contributing to the ROCm Libraries](https://github.com/ROCm/rocm-libraries/blob/develop/CONTRIBUTING.md) in GitHub. + +Other ROCm libraries are also in the process of migration along with ROCm tools to {fab}`github` [rocm-systems](https://github.com/ROCm/rocm-systems). For latest status information, see the [README file](https://github.com/ROCm/rocm-systems/blob/develop/README.md). The official completion of migration will be communicated in a future ROCm release. + +### HIP API compatibility improvements + +To improve code portability between AMD ROCm and other programming models, HIP API has been updated in ROCm 7.0.0 to simplify cross-platform programming. These changes are incompatible with prior ROCm releases and might require recompiling existing HIP applications for use with ROCm 7.0.0. For more information, see the [HIP API 7.0.0 changes](https://rocm.docs.amd.com/projects/HIP/en/docs-develop/hip-7-changes.html) and the [HIP changelog](#hip-7-0-0) below. + +### HIP runtime updates + +The HIP runtime now includes support for: + +* Open Compute Project (OCP) MX floating-point `FP4`, `FP6`, and `FP8` data types and APIs. +* Improved logging by adding more precise pointer information and launch arguments for better tracking and debugging in dispatch methods. +* `constexpr` operators for `FP16` and `BF16`. +* `__syncwarp` operation. +* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`. +* Added warp level primitives: `__syncwarp` and reduce intrinsics (for example, `__reduce_add_sync()`). +* Support for the flags in APIs as following, now allows uncached memory allocation. + - `hipExtHostRegisterUncached`, used in `hipHostRegister`. + - `hipHostMallocUncached` and `hipHostAllocUncached`, used in `hipHostMalloc` and `hipHostAlloc`. +* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement. + +Additionally, the HIP runtime includes functional improvements, which improve functionality, runtime performance, and the user experience. For more information, see [HIP changelog](#hip-7-0-0) below. + +### Compiler changes and improvements + +ROCm 7.0.0 introduces the AMD Next-Gen Fortran compiler. ``llvm-flang`` (sometimes called ``new-flang`` or ``flang-18``) is a re-implementation of the Fortran frontend. It is a strategic replacement for ``classic-flang`` and is developed in LLVM’s upstream repo at [llvm/llvm-project](https://github.com/llvm/llvm-project/tree/main/flang). + +Key compiler enhancements include: + +* Compiler: + * Improved memory load and store instructions. + * Updated clang/llvm to AMD clang version 20.0.0git (equivalent to LLVM 20.0.0 with additional out-of-tree patches). + * Support added for separate debug file generation for device code. + * `llvm-strip` now supports AMD GPU device code objects (EM_AMDGPU). +* Comgr: + * Added support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps. This is designed to improve performance by reducing on-disk file I/O. Currently, VFS is supported only for the device library link step, with plans for expanded support in future releases. +* SPIR-V: + * Improved [target-specific extensions](https://github.com/ROCm/llvm-project/blob/c2535466c6e40acd5ecf6ba1676a4e069c6245cc/clang/docs/LanguageExtensions.rst): + * Added a new target-specific builtin ``__builtin_amdgcn_processor_is`` for late or deferred queries of the current target processor. + * Added a new target-specific builtin ``__builtin_amdgcn_is_invocable``, enabling fine-grained, per-builtin feature availability. +* The compiler driver now uses parallel code generation by default when compiling using full LTO (including when using the `-fgpu-rdc` option) for HIP. This divides the optimized LLVM IR module into roughly equal partitions before instruction selection and lowering, which can help improve build times. + + Each kernel in the linked LTO module can be put in a separate partition, and any non-inlined function it depends on can be copied alongside it. Thus, while parallel code generation can improve build time, it can duplicate non-inlined, non-kernel functions across multiple partitions, potentially increasing the binary size of the final object file. + + * Compiler option `-flto-partitions=`: + + Equivalent to the `--lto-partitions=` LLD option. Controls the number of partitions used for parallel code generation when using full LTO (including when using `-fgpu-rdc`). The number of partitions must be greater than 0, and a value of 1 turns off the feature. The default value is 8. + + Developers are encouraged to experiment with different numbers of partitions using the `-flto-partitions` Clang command line option. Recommended values are 1 to 16 partitions, with especially large projects containing many kernels potentially benefiting from up to 64 partitions. It is not recommended to use a value greater than the number of threads on the machine. Smaller projects, or those containing only a few kernels, might not benefit at all from partitioning and might even experience a slight increase in build time due to the small overhead of analyzing and partitioning the modules. + +* HIPIFY now supports NVIDIA CUDA 12.9.1 APIs: + * Added support for all new device and host APIs, including `FP4`, `FP6`, and `FP128`– including support for the corresponding ROCm HIP equivalents. + +* The HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) have been removed in this release. + +### Library changes and improvements + +#### New data type support + +MX-compliant data types bring microscaling support to ROCm. For more information, see the [OCP Microscaling (MX) Formats Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf). ROCm 7.0.0 enables functional support for MX data types `FP4`, `FP6`, and `FP8` on AMD Instinct MI350 Series GPUs in these ROCm libraries: + +* Composable Kernel (`FP4`, `FP6`, and `FP8` only) +* hipBLASLt + +The following libraries are updated to support the Open Compute Project (OCP) floating-point `FP8` format on MI350 Series GPUs instead of the NANOO `FP8` format: + +* Composable Kernel +* hipBLASLt +* hipSPARSELt +* MIGraphX +* rocWMMA + +For more information about data types, see [Data types and precision support](https://rocm.docs.amd.com/en/latest/reference/precision-support.html). + +#### hipBLASLt improvement + +GEMM performance has been improved for `FP8`, `FP16`, `BF16`, and `FP32` data types. + +For more information about hipBLASLt changes, see the [hipBLASLt changelog](#hipblaslt-1-0-0) below. + +#### MIGraphX improvements + +* Support for OCP `FP8` on AMD Instinct MI350X and MI355X GPUs. +* Support for PyTorch 2.7 via Torch-MIGraphX. +* Improved performance of Generative AI models. +* Added additional MSFT Contrib Operators for improved ONNX Runtime Experience. + +For more information about MIGraphX changes, see the [MIGraphX changelog](migraphx-2-13-0) below. + +#### rocSHMEM Reverse Offload conduit inter-node support + +The rocSHMEM communications library has added the RO (Reverse Offload) inter-node communication backend which enables communication between GPUs on different nodes through a NIC, using a host-based CPU proxy to forward communication orders to and from the GPU. Inter-node communication requires MPI, and is tested with Open MPI and CX7 IB NICs. For more information, see [available network backends](https://rocm.docs.amd.com/projects/rocSHMEM/en/develop/install.html#available-network-backends) for installting rocSHMEM. + +See the [rocSHMEM changelog](#rocshmem-3-0-0) for more details. + +### Tool changes and improvements + +#### AMD SMI + +Key enhancements to AMD SMI include the ability to reload the AMD GPU driver from the +CLI or API. The `amd-smi` command-line interface gains a new default view, `amd-smi` topology support +in guest environments, and performance optimizations. Additionally, AMD SMI library APIs +have been refined for improved usability. See the [AMD SMI changelog](#amd-smi-26-0-0) for more details. + +#### ROCgdb + +ROCgdb now supports `FP4`, `FP6`, and `FP8` micro-scaling (MX) data types with AMD Instinct MI350 Series GPUs. + +See the [ROCgdb changelog](#rocgdb-16-3) for more details. + +#### ROCm Compute Profiler + +ROCm Compute Profiler includes the following key changes: + +* Interactive command line with a Textual User Interface (TUI) has been added to analyze mode. For more details, see [TUI analysis](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/amd-staging/how-to/analyze/tui.html). +* Support added for advanced data types: `FP4` and `FP6` +* Support for AMD Instinct MI355X and MI350X with addition of performance counters: CPC, SPI, SQ, TA/TD/TCP, and TCC. +* Roofline enhancement added for AMD Instinct MI350 Series. +* Improved support for Selective Kernel profiling. +* Program Counter (PC) sampling (Software-based) feature has been enabled for AMD Instinct MI200, MI300X, MI350X, and MI355X GPUs. This feature helps in GPU profiling to understand code execution patterns and hotspots during GPU kernel execution. For more details, see [Using PC sampling in ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/amd-staging/how-to/pc_sampling.html). +* Program Counter (PC) sampling (Hardware-based, Stochastic) feature has been enabled for AMD Instinct MI300X, MI350, and MI355X GPUs. +* Docker files has been added to package the application and dependencies into a single portable and executable standalone binary file. + +See the [ROCm Compute Profiler changelog](#rocm-compute-profiler-3-2-3) for more details. + +#### ROCm Data Center (RDC) improvements + +The ROCm Data Center tool (RDC) streamlines the administration of AMD GPUs in cluster data center environments. ROCm 7.0.0 introduces new data center management and monitoring tools for system administrators. For more information, see [ROCm Data Center (RDC) tool documentation](https://rocm.docs.amd.com/projects/rdc/en/latest/index.html). + +#### ROCm Systems Profiler + +ROCm Systems Profiler includes the following key changes: + +* Improved profiling support for Computer Vision workloads through rocDecode and rocJPEG API tracing and engine activity sampling. +* Network profiling support has been added to AMD Instinct MI300X, MI350X, and MI355X. +* Improved profiling of the communication layer with RCCL and MPI API tracing. + +See the [ROCm Systems Profiler changelog](#rocm-systems-profiler-1-1-0) for more details. + +#### ROCm Validation Suite +In ROCm 7.0.0, ROCm Validation Suite includes support for the AMD Instinct MI355X and MI350X GPUs in the IET (Integrated Execution Test), GST (GPU Stress Test), and Babel (memory bandwidth test) modules. + +See the [ROCm Validation Suite changelog](#rocm-validation-suite-1-2-0) for more details. + +#### ROCprofiler-SDK + +##### Core SDK enhancements + +* ROCprofiler-SDK is now compatible with the HIP 7.0.0 API. +* ROCprofiler-SDK adds support for AMD Instinct MI350X and MI355X GPUs. +* The stochastic and host-trap PC sampling support has been added for all AMD Instinct MI300 and MI350 Series GPUs, which +provides information particularly useful for understanding stalls during kernel execution. +* The added support for tracing events surfaced by AMD's Kernel Fusion Driver (KFD) captures low-level driver routines involved in mapping, invalidation, and migration of data between CPU and GPU memories. Such events are central to the support for [Unified Memory](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/memory_management/unified_memory.html) on AMD systems. Tracing of KFD events helps to detect performance problems arising from excessive data migration. +* New APIs are added for profiling applications using thread traces (beta) +which facilitates profiling wavefronts at the instruction timing level. + +##### rocpd + +The ROCm Profiling Data (``rocpd``) is now the default output format for ``rocprofv3``. +A subproject of the ROCprofiler-SDK, ``rocpd`` enables saving profiling results to a SQLite3 database, providing a structured and +efficient foundation for analysis and post-processing. + +##### rocprofv3 CLI tool enhancements + +* Added stochastic and host-trap PC sampling support for all AMD Instinct MI300 and MI350 Series GPUs. +* HIP streams translate to Queues in Time Traces in Perfetto output. +* Support for thread trace service. + +See the [ROCprofiler-SDK changelog](#rocprofiler-sdk-1-0-0) for more details. + +### ROCm Offline Installer Creator updates + +The ROCm Offline Installer Creator 7.0.0 includes the following features and improvements: + +* Added support for Rocky Linux 9.6. +* Added support for the new graphics repo structure for graphics/Mesa related packages. +* Improvements to kernel header version matching for AMDGPU driver installation. +* Added support for creating an offline installer when the kernel version of the target operating system differs from the operating system of the host creating the installer (for Ubuntu 22.04 and 24.04 only). + +See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information. + +### ROCm Runfile Installer updates + +The ROCm Runfile Installer 7.0.0 adds the following features and improvements: + +* Added support for Rocky Linux 9.6. +* Added `untar` mode for the `.run` file to allow extraction of ROCm to a given directory, similar to a normal tarball. +* Added an RVS test script. +* Fixes to the rocm-examples test script. +* Fixes for `clinfo` and OpenCL use after installation. + +For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html). + +### ROCm documentation updates + +ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases. + +* The ROCm AI [training](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/index.html) and + [inference](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/index.html) + benchmarking guides have been updated with expanded model coverage and + optimized Docker environments. Highlights include: + + * The [Training a model with Primus and Megatron](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.html) benchmarking guide + now leverages the unified AMD Primus framework with the Megatron backend. See [Primus: A Lightweight, Unified Training Framework for Large Models on AMD + GPUs](https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html) for an introduction to Primus. + + * The [Training a model with PyTorch](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.html) benchmarking guide + now includes fine-tuning for OpenAI GPT OSS and Qwen models. It also includes a multi-node training example. + + * The [Training a model with JAX MaxText](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.html) benchmarking guide + now supports [MAD](https://github.com/ROCm/MAD)-integrated benchmarking. The MaxText training environment now uses JAX 0.6.0 or 0.5.0. FP8 quantized training is supported with JAX 0.5.0. + + * The [vLLM inference performance testing](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/benchmark-docker/vllm.html) documentation + now features clearer serving and throughput benchmarking commands -- for improved transparency of model benchmarking configurations. The vLLM inference + environment now uses vLLM 0.10.1 and includes improved default configurations. + + These training and inference resources will continue to grow with ongoing improvements and expanded model coverage. + For a searchable view of supported frameworks and models, see [AMD Infinity Hub](https://www.amd.com/en/developer/resources/infinity-hub.html). + +* [Tutorials for AI developers](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) have been expanded with the following new inference tutorial: [PD disaggregation with SGLang](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/SGlang_PD_Disagg_On_AMD_GPU.html) + + In addition, the [AI agent with MCPs using vLLM and PydanticAI](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/build_airbnb_agent_mcp.html) tutorial has been updated. For more information about the changes, see [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html). + +* Documentation for [rocCV](https://advanced-micro-devices-roccv--28.com.readthedocs.build/en/28/), an efficient GPU-accelerated library for image pre- and post-processing, has been added. rocCV is in an early access state, and using it on production workloads is not recommended. + +* ROCm Math libraries support a wide range of data types, enabling optimized performance across various precision requirements. The following Math libraries are now updated with new precision content. For more information, click the Math library’s link: + + ::::{grid} 2 + :margin: auto 0 auto auto + :::{grid} + :margin: auto 0 auto auto + * [hipBLAS](https://rocm.docs.amd.com/projects/hipBLAS/en/develop/reference/data-type-support.html) + * [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/develop/reference/data-type-support.html) + * [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/develop/reference/precision.html) + ::: + :::{grid} + :margin: auto 0 auto auto + * [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/develop/reference/precision.html) + * [Tensile](https://rocm.docs.amd.com/projects/Tensile/en/develop/src/reference/precision-support.html#precision-support) + ::: + :::: + +* ROCm offers a comprehensive ecosystem for deep learning development, featuring libraries optimized for deep learning operations and ROCm-aware versions of popular deep learning frameworks and libraries. The following deep learning frameworks' content now includes release notes and known issues: + + ::::{grid} 1 + :margin: auto 0 auto auto + :::{grid-item} + :margin: auto 0 auto auto + * [PyTorch](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/pytorch-compatibility.html) + * [JAX](https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/jax-compatibility.html) + ::: + :::: + +* ROCm components support a wide range of environment variables that can be used for testing, logging, debugging, experimental features, and more. The following components have been updated with new environment variable content. For more information, click the component’s link: + + ::::{grid} 2 + :margin: auto 0 auto auto + :::{grid-item} + :margin: auto 0 auto auto + * [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/develop/reference/env-variables.html) + * [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/develop/reference/env-variables.html) + * [ROCm Performance Primitives (RPP)](https://rocm.docs.amd.com/projects/rpp/en/develop/reference/rpp-env-variables.html) + ::: + :::{grid-item} + :margin: auto 0 auto auto + * [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/develop/reference/env_variables.html) + * [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/develop/reference/env_variables.html) + * [Tensile](https://rocm.docs.amd.com/projects/Tensile/en/develop/src/reference/environment-variables.html) + ::: + :::: + +* Modern computing tasks often require balancing numerical precision against hardware resources and processing speed. Low precision floating point number formats in HIP include `FP4` (4-bit) and `FP6` (6-bit), which reduce memory and bandwidth requirements. For more information, see the updated [Low precision floating point types](https://rocm.docs.amd.com/projects/HIP/en/docs-develop/reference/low_fp_types.html) topic. + +## User space, driver, and firmware dependent changes + +GPU Software for AMD datacenter GPU products requires you to maintain a hardware and software stack with interdependencies between the GPU and baseboard firmware, AMD GPU drivers, and the ROCm user space software. Starting ROCm 7.0.0 release, we are publicly documenting these interdependencies. Note that while AMD publishes drivers and ROCm user space, your server or infrastructure provider publishes the GPU and baseboard firmware by bundling AMD’s firmware releases via AMD's Platform Level Data Model (PLDM) bundle (Firmware), which includes Integrated Firmware Image (IFWI). + +The GPU and baseboard firmware releases numbering may vary by GPU family. Note that, ROCm 7.0.0 release is the first release where the AMD GPU Driver (amdgpu) is versioned independently of ROCm. + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

ROCm Version

+
+

GPU

+
+

PLDM Bundle (Firmware)

+
+

AMD GPU Driver (amdgpu)

+
+

AMD GPU
+ Virtualization Driver (GIM)

+
ROCm 7.0.0MI355X + 01.25.13.04 (or later)
+ 01.25.11.02 +
30.108.4.0.K
MI350X + 01.25.13.04 (or later)
+ 01.25.11.02 +
30.10
MI325X + 01.25.04.00 (or later)
+ 01.25.03.03 +
+ 30.10
+ 6.4.z where z (0-3)
+ 6.3.y where y (1-3) +
MI300X01.25.03.02 (or later) + 30.10
+ 6.4.z where z (0–3)
+ 6.3.y where y (0–3)
+ 6.2.x where x (1–4) +
8.4.0.K
MI300A26 (or later)Not Applicable
MI250XIFWI 47 (or later)
MI250MU5 w/ IFWI 75 (or later)
MI210MU5 w/ IFWI 758.4.0.K
MI100VBIOS D3430401-037Not Applicable
+
+ +### New feature details + +#### AMD SMI changes dependent on PLDM bundles (firmware) + +New APIs introduced in AMD SMI for ROCm 7.0.0 provide additional data for the AMD Instinct products. To support these features, the following firmware for each GPUs are required: + +* AMD Instinct MI355X - PLDM bundle 01.25.13.04 + +* AMD Instinct MI350X - PLDM bundle 01.25.13.04 + +* AMD Instinct MI325X - PLDM bundle 01.25.04.00 + +* AMD Instinct MI300X - PLDM bundle 01.25.03.12 + +If ROCm 7.0.0 is applied on system with prior version of PLDM bundles (firmware), the new APIs will return `N/A` to indicate lack of support for these items. + +#### Enhanced temperature telemetry introduced in AMD SMI for MI355X and MI350X GPUs + +AMD SMI in ROCm 7.0.0 provides support for enhanced temperature metrics and temperature anomaly detection for AMD Instinct MI350X and MI355X GPUs when paired with: PLDM bundle 01.25.13.04. + +For more information on these features, see [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.0/CHANGELOG.md). + +#### KVM SR-IOV virtualization changes dependent on open source AMD GPU Virtualization Driver (GIM) + +KVM SR-IOV support for all Instinct GPUs require the open source AMD GPU Virtualization Driver (GIM) 8.4.0.K. For detailed support information, see [virtualization support](https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html#virtualization-support) and [GIM Release Note](https://github.com/amd/MxGPU-Virtualization/releases). + +#### GPU partitioning support for AMD Instinct MI355X and MI350X GPUs + +NPS2 and DPX partitioning on bare metal is enabled on AMD Instinct MI355X and MI350X GPUs on ROCm 7.0.0 when paired with: PLDM bundle 01.25.13.04. + ## ROCm components -The following table lists the versions of ROCm components for ROCm 6.4.3. +The following table lists the versions of ROCm components for ROCm 7.0.0, including any version +changes from 6.4.3 to 7.0.0. Click the component's updated version to go to a list of its changes. + Click {fab}`github` to go to the component's source code on GitHub.
@@ -96,47 +565,47 @@ Click {fab}`github` to go to the component's source code on GitHub. Libraries Machine learning and computer vision Composable Kernel - 1.1.0 + 1.1.0 ⇒ 1.1.0 MIGraphX - 2.12.0 + 2.12.0 ⇒ 2.13.0 MIOpen - 3.4.0 - + 3.4.0 ⇒ 3.5.0 + MIVisionX - 3.2.0 + 3.2.0 ⇒ 3.3.0 rocAL - 2.2.0 + 2.2.0 ⇒ 2.3.0 rocDecode - 0.10.0 + 0.10.0 ⇒ 1.0.0 rocJPEG - 0.8.0 + 0.8.0 ⇒ 1.1.0 rocPyDecode - 0.3.1 + 0.3.1 ⇒ 0.6.0 RPP - 1.9.10 + 1.9.10 ⇒ 2.0.0 @@ -145,12 +614,12 @@ Click {fab}`github` to go to the component's source code on GitHub. Communication RCCL - 2.22.3 + 2.22.3 ⇒ 2.26.6 rocSHMEM - 2.0.1 + 2.0.1 ⇒ 3.0.0 @@ -159,83 +628,83 @@ Click {fab}`github` to go to the component's source code on GitHub. Math hipBLAS - 2.4.0 - + 2.4.0 ⇒ 3.0.0 + hipBLASLt - 0.12.1 - + 0.12.1 ⇒ 1.0.0 + hipFFT - 1.0.18 - + 1.0.18 ⇒ 1.0.20 + hipfort - 0.6.0 + 0.6.0 ⇒ 0.7.0 hipRAND - 2.12.0 - + 2.12.0 ⇒ 3.0.0 + hipSOLVER - 2.4.0 + 2.4.0 ⇒ 3.0.0 hipSPARSE - 3.2.0 - + 3.2.0 ⇒ 4.0.1 + hipSPARSELt - 0.2.3 - + 0.2.3 ⇒ 0.2.4 + rocALUTION - 3.2.3 + 3.2.3 ⇒ 4.0.0 rocBLAS - 4.4.1 - + 4.4.1 ⇒ 5.0.0 + rocFFT - 1.0.32 - + 1.0.32 ⇒ 1.0.34 + rocRAND - 3.3.0 - + 3.3.0 ⇒ 4.0.0 + rocSOLVER - 3.28.2 + 3.28.2 ⇒ 3.30.0 rocSPARSE - 3.4.0 - + 3.4.0 ⇒ 4.0.2 + rocWMMA - 1.7.0 + 1.7.0 ⇒ 2.0.0 Tensile - 4.43.0 - + 4.43.0 ⇒ 4.44.0 + @@ -243,23 +712,23 @@ Click {fab}`github` to go to the component's source code on GitHub. Primitives hipCUB - 3.4.0 - + 3.4.0 ⇒ 4.0.0 + hipTensor - 1.5.0 + 1.5.0 ⇒ 2.0.0 rocPRIM - 3.4.1 - + 3.4.1 ⇒ 4.0.0 + rocThrust - 3.3.0 - + 3.3.0 ⇒ 4.0.0 + @@ -267,12 +736,12 @@ Click {fab}`github` to go to the component's source code on GitHub. Tools System management AMD SMI - 25.5.1 + 25.5.1 ⇒ 26.0.0 ROCm Data Center Tool - 0.3.0 + 0.3.0 ⇒ 1.1.0 @@ -282,12 +751,12 @@ Click {fab}`github` to go to the component's source code on GitHub. ROCm SMI - 7.5.0 ⇒ 7.7.0 + 7.7.0 ⇒ 7.8.0 ROCm Validation Suite - 1.1.0 + 1.1.0 ⇒ 1.2.0 @@ -297,19 +766,19 @@ Click {fab}`github` to go to the component's source code on GitHub. Performance ROCm Bandwidth Test - 1.4.0 + 1.4.0 ⇒ 2.6.0 ROCm Compute Profiler - 3.1.1 + 3.1.1 ⇒ 3.2.3 ROCm Systems Profiler - 1.0.2 + 1.0.2 ⇒ 1.1.0 @@ -321,7 +790,7 @@ Click {fab}`github` to go to the component's source code on GitHub. ROCprofiler-SDK - 0.6.0 + 0.6.0 ⇒ 1.0.0 @@ -337,13 +806,13 @@ Click {fab}`github` to go to the component's source code on GitHub. Development HIPIFY - 19.0.0 + 19.0.0 ⇒ 20.0.0 ROCdbgapi - 0.77.2 + 0.77.2 ⇒ 0.77.3 @@ -356,14 +825,14 @@ Click {fab}`github` to go to the component's source code on GitHub. ROCm Debugger (ROCgdb) - 15.2 + 15.2 ⇒ 16.3 ROCr Debug Agent - 2.0.4 + 2.0.4 ⇒ 2.1.0 @@ -378,7 +847,7 @@ Click {fab}`github` to go to the component's source code on GitHub. llvm-project - 19.0.0 + 19.0.0 ⇒ 20.0.0 @@ -387,12 +856,12 @@ Click {fab}`github` to go to the component's source code on GitHub. Runtimes HIP - 6.4.3 + 6.4.3 ⇒ 7.0.0 ROCr Runtime - 1.15.0 + 1.15.0 ⇒ 1.18.0 @@ -407,29 +876,1721 @@ The following sections describe key changes to ROCm components. For a historical overview of ROCm component updates, see the {doc}`ROCm consolidated changelog `. ``` -### **ROCm SMI** (7.7.0) +### **AMD SMI** (26.0.0) #### Added -- Support for getting the GPU Board voltage. +* Ability to restart the AMD GPU driver from the CLI and API. + - `amdsmi_gpu_driver_reload()` API and `amd-smi reset --reload-driver` or `amd-smi reset -r` CLI options. + - Driver reload functionality is now separated from memory partition + functions; memory partition change requests should now be followed by a driver reload. + - Driver reload requires all GPU activity on all devices to be stopped. + +* Default command: + + A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through alternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`. + +* Support for GPU metrics 1.8: + - Added new fields for `amdsmi_gpu_xcp_metrics_t` including: + - Metrics to allow new calculations for violation status: + - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts + - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts + - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. + - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). + - Increased available JPEG engines to 40. Current ASICs might not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI. + +* Bad page threshold count. + - Added `amdsmi_get_gpu_bad_page_threshold` to Python API and CLI; root/sudo permissions are required to display the count. + +* CPU model name for RDC. + - Added new C and Python API `amdsmi_get_cpu_model_name`. + - Not sourced from esmi library. + +* New API `amdsmi_get_cpu_affinity_with_scope()`. + +* `socket power` to `amdsmi_get_power_info` + - Previously, the C API had the value in the `amdsmi_power_info` structure, but was unused. + - The value is representative of the socket's power agnostic of the the GPU version. + +* New event notification types to `amdsmi_evt_notification_type_t`. + The following values were added to the `amdsmi_evt_notification_type_t` enum: + - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_START` + - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_END` + - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_START` + - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_END` + - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_EVICTION` + - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_RESTORE` + - `AMDSMI_EVT_NOTIF_EVENT_UNMAP_FROM_GPU` + - `AMDSMI_EVT_NOTIF_PROCESS_START` + - `AMDSMI_EVT_NOTIF_PROCESS_END` + +- Power cap to `amd-smi monitor`. + - `amd-smi monitor -p` will display the power cap along with power. + +#### Changed + +* Separated driver reload functionality from `amdsmi_set_gpu_memory_partition()` and + `amdsmi_set_gpu_memory_partition_mode()` APIs -- and from the CLI `amd-smi set -M `. + +* Disabled `amd-smi monitor --violation` on guests. Modified `amd-smi metric -T/--throttle` to alias to `amd-smi metric -v/--violation`. + +* Updated `amdsmi_get_clock_info` in `amdsmi_interface.py`. + - The `clk_deep_sleep` field now returns the sleep integer value. + +* The `amd-smi topology` command has been enabled for guest environments. + - This includes full functionality so users can use the command just as they would in bare metal environments. + +* Expanded violation status tracking for GPU metrics 1.8. + - The driver will no longer be supporting existing single-value GFX clock below host limit fields (`acc_gfx_clk_below_host_limit`, `per_gfx_clk_below_host_limit`, `active_gfx_clk_below_host_limit`), they are now changed in favor of new per-XCP/XCC arrays. + - Added new fields to `amdsmi_violation_status_t` and related interfaces for enhanced violation breakdown: + - Per-XCP/XCC accumulators and status for: + - GFX clock below host limit (power, thermal, and total) + - Low utilization + - Added 2D arrays to track per-XCP/XCC accumulators, percentage, and active status: + - `acc_gfx_clk_below_host_limit_pwr`, `acc_gfx_clk_below_host_limit_thm`, `acc_gfx_clk_below_host_limit_total` + - `per_gfx_clk_below_host_limit_pwr`, `per_gfx_clk_below_host_limit_thm`, `per_gfx_clk_below_host_limit_total` + - `active_gfx_clk_below_host_limit_pwr`, `active_gfx_clk_below_host_limit_thm`, `active_gfx_clk_below_host_limit_total` + - `acc_low_utilization`, `per_low_utilization`, `active_low_utilization` + - Python API and CLI now report these expanded fields. + +* The char arrays in the following structures have been changed. + - `amdsmi_vbios_info_t` member `build_date` changed from `AMDSMI_MAX_DATE_LENGTH` to `AMDSMI_MAX_STRING_LENGTH`. + - `amdsmi_dpm_policy_entry_t` member `policy_description` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. + - `amdsmi_name_value_t` member `name` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. + +* For backwards compatibility, updated `amdsmi_bdf_t` union to have an identical unnamed struct. + +* Updated `amdsmi_get_temp_metric` and `amdsmi_temperature_type_t` with new values. + - Added new values to `amdsmi_temperature_type_t` representing various baseboard and GPU board temperature measures. + - Updated `amdsmi_get_temp_metric` API to be able to take in and return the respective values for the new temperature types. + +#### Removed + +- Unnecessary API, `amdsmi_free_name_value_pairs()` + - This API is only used internally to free up memory from the Python interface and does not need to be + exposed to the user. + +- Unused definitions: + - `AMDSMI_MAX_NAME`, `AMDSMI_256_LENGTH`, `AMDSMI_MAX_DATE_LENGTH`, `MAX_AMDSMI_NAME_LENGTH`, `AMDSMI_LIB_VERSION_YEAR`, + `AMDSMI_DEFAULT_VARIANT`, `AMDSMI_MAX_NUM_POWER_PROFILES`, `AMDSMI_MAX_DRIVER_VERSION_LENGTH`. + +- Unused member `year` in struct `amdsmi_version_t`. + +- `amdsmi_io_link_type_t` has been replaced with `amdsmi_link_type_t`. + - `amdsmi_io_link_type_t` is no longer needed as `amdsmi_link_type_t` is sufficient. + - `amdsmi_link_type_t` enum has changed; primarily, the ordering of the PCI and XGMI types. + - This change will also affect `amdsmi_link_metrics_t`, where the link_type field changes from `amdsmi_io_link_type_t` to `amdsmi_link_type_t`. + +- `amdsmi_get_power_info_v2()`. + - The ``amdsmi_get_power_info()`` has been unified and the v2 function is no longer needed or used. + +- `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`. + +- The `amdsmi_get_gpu_vram_info` now provides vendor names as a string. + - `amdsmi_vram_vendor_type_t` enum structure is removed. + - `amdsmi_vram_info_t` member named `amdsmi_vram_vendor_type_t` is changed to a character string. + - `amdsmi_get_gpu_vram_info` now no longer requires decoding the vendor name as an enum. + +- Backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`. + - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. + - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion about which field to use. By removing backward compatibility, it is easier to identify the relevant field. + - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. + +#### Optimized + +- Reduced ``amd-smi`` CLI API calls needed to be called before reading or (re)setting GPU features. This + improves overall runtime performance of the CLI. + +- Removed partition information from the default `amd-smi static` CLI command. + - Users can still retrieve the same data by calling `amd-smi`, `amd-smi static -p`, or `amd-smi partition -c -m`/`sudo amd-smi partition -a`. + - Reading `current_compute_partition` may momentarily wake the GPU up. This is due to reading XCD registers, which is expected behavior. Changing partitions is not a trivial operation, `current_compute_partition` SYSFS controls this action. + +- Optimized CLI command `amd-smi topology` in partition mode. + - Reduced the number of `amdsmi_topo_get_p2p_status` API calls to one fourth. + +#### Resolved issues + +- Removed duplicated GPU IDs when receiving events using the `amd-smi event` command. + +- Fixed `amd-smi monitor` decoder utilization (`DEC%`) not showing up on MI300 Series ASICs. + +#### Known issues + +- `amd-smi monitor` on Linux Guest systems triggers an attribute error. ```{note} -See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. +See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. ``` +### **Composable Kernel** (1.1.0) + +#### Added + +* Support for `BF16`, `F32`, and `F16` for 2D and 3D NGCHW grouped convolution backward data. +* Fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels. +* Support GKCYX for layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced). +* Support for GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW). +* Support for GKCYX layout for grouped convolution backward weight (NGCHW/GKCYX/NGKHW). +* Support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW). +* Support for Stream-K version of mixed `FP8` / `BF16` GEMM. +* Support for Multiple D GEMM. +* GEMM pipeline for microscaling (MX) `FP8` / `FP6` / `FP4` data types. +* Support for `FP16` 2:4 structured sparsity to universal GEMM. +* Support for Split K for grouped convolution backward data. +* Logit soft-capping support for fMHA forward kernels. +* Support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv). +* Benchmarking support for tile engine GEMM. +* Ping-pong scheduler support for GEMM operation along the K dimension. +* Rotating buffer feature for CK_Tile GEMM. +* `int8` support for CK_TILE GEMM. +* Vectorize Transpose optimization for CK Tile. +* Asynchronous copy for gfx950. + +#### Changed + +* Replaced the raw buffer load/store intrinsics with Clang20 built-ins. +* DL and DPP kernels are now enabled by default. +* Number of instances in instance factory for grouped convolution forward NGCHW/GKYXC/NGKHW has been reduced. +* Number of instances in instance factory for grouped convolution backward weight NGCHW/GKYXC/NGKHW has been reduced. +* Number of instances in instance factory for grouped convolution backward data NGCHW/GKYXC/NGKHW has been reduced. + +#### Removed + +* Removed support for gfx940 and gfx941 targets. + +#### Optimized + +* Optimized the GEMM multiply preshuffle and lds bypass with Pack of KGroup and better instruction layout. + +### **HIP** 7.0.0 + +#### Added + +* New HIP APIs + - `hipLaunchKernelEx` dispatches the provided kernel with the given launch configuration and forwards the kernel arguments. + - `hipLaunchKernelExC` launches a HIP kernel using a generic function pointer and the specified configuration. + - `hipDrvLaunchKernelEx` dispatches the device kernel represented by a HIP function object. + - `hipMemGetHandleForAddressRange` gets a handle for the address range requested. + - `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for aritimetic reduction across lanes of a warp, and `__reduce_and_sync`, `__reduce_or_sync`, and `__reduce_xor_sync` +functions added for logical reduction. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions). +* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as follows. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html). + - Data types for `FP4`/`FP6`/`FP8`. + - HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs. + - HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs. +* New `wptr` and `rptr` values in `ClPrint`, for better logging in dispatch barrier methods. +* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`. +* Added `constexpr` operators for `fp16`/`bf16`. +* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`). +* Support for the flags in APIs as following, now allows uncached memory allocation. + - `hipExtHostRegisterUncached`, used in `hipHostRegister`. + - `hipHostMallocUncached` and `hipHostAllocUncached`, used in `hipHostMalloc` and `hipHostAlloc`. +* `num_threads` total number of threads in the group. The legacy API size is alias. +* Added PCI CHIP ID information as the device attribute. +* Added new tests applications for OCP data types `FP4`/`FP6`/`FP8`. +* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement. + +#### Changed +* Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows. +* Removal of beta warnings in HIP Graph APIs. All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported. +* `warpSize` has changed. +In order to match the CUDA specification, the `warpSize` variable is no longer `constexpr`. In general, this should be a transparent change; however, if an application was using `warpSize` as a compile-time constant, it will have to be updated to handle the new definition. For more information, see the discussion of `warpSize` within the [HIP C++ language extensions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize). +* Behavior changes + - `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution. + - Cooperative groups in `hipLaunchCooperativeKernelMultiDevice` and `hipLaunchCooperativeKernel` functions, additional input parameter validation checks are added. + - `hipPointerGetAttributes` returns `hipSuccess` instead of an error with invalid value `hipErrorInvalidValue`, in case `NULL` host or attribute pointer is passed as input parameter. It now matches the functionality of `cudaPointerGetAttributes` which changed with CUDA 11 and above releases. + - `hipFree` previously there was an implicit wait which was applicable for all memory allocations, for synchronization purpose. This wait is now disabled for allocations made with `hipMallocAsync` and `hipMallocFromPoolAsync`, to match the behavior of CUDA API `cudaFree`. + - `hipFreeAsync` now returns `hipSuccess` when the input pointer is NULL, instead of ` hipErrorInvalidValue` , to be consistent with `hipFree`. + - Exceptions occurring during a kernel execution will not abort the process anymore but will return an error unless core dump is enabled. +* Changes in hipRTC. + - Removal of `hipRTC` symbols from HIP Runtime Library. + Any application using `hipRTC` APIs should link explicitly with the `hipRTC` library. This makes the usage of `hipRTC` library on Linux the same as on Windows and matches the behavior of CUDA `nvRTC`. + - `hipRTC` compilation + The device code compilation now uses namespace `__hip_internal`, instead of the standard headers `std`, to avoid namespace collision. + - Changes of datatypes from `hipRTC`. + Datatype definitions such as `int64_t`, `uint64_t`, `int32_t`, and `uint32_t`, etc. are removed to avoid any potential conflicts in some applications. HIP now uses internal datatypes instead, prefixed with `__hip`, for example, `__hip_int64_t`. +* HIP header clean up + - Usage of STD headers, HIP header files only include necessary STL headers. + - Deprecated structure `HIP_MEMSET_NODE_PARAMS` is removed. Developers can use the definition `hipMemsetParams` instead. +* API signature/struct changes + - API signatures are adjusted in some APIs to match corresponding CUDA APIs. Impacted APIs are as folloing: + * `hiprtcCreateProgram` + * `hiprtcCompileProgram` + * `hipMemcpyHtoD` + * `hipCtxGetApiVersion` + - HIP struct change in `hipMemsetParams`, it is updated and compatible with CUDA. + - HIP vector constructor change in `hipComplex` initialization now generates correct values. The affected constructors will be small vector types such as `float2`, `int4`, etc. +* Stream Capture updates + - Restricted stream capture mode, it is made in HIP APIs via adding the macro `CHECK_STREAM_CAPTURE_SUPPORTED ()`. +In the previous HIP enumeration `hipStreamCaptureMode`, three capture modes were defined. With checking in the macro, the only supported stream capture mode is now `hipStreamCaptureModeRelaxed`. The rest are not supported, and the macro will return `hipErrorStreamCaptureUnsupported`. This update involves the following APIs, which is allowed only in relaxed stream capture mode: + * `hipMallocManaged` + * `hipMemAdvise` + - Checks stream capture mode, the following APIs check the stream capture mode and return error codes to match the behavior of CUDA. + * `hipLaunchCooperativeKernelMultiDevice` + * `hipEventQuery` + * `hipStreamAddCallback` + - Returns error during stream capture. The following HIP APIs now returns specific error `hipErrorStreamCaptureUnsupported` on the AMD platform, but not always `hipSuccess`, to match behavior with CUDA: + * `hipDeviceSetMemPool` + * `hipMemPoolCreate` + * `hipMemPoolDestroy` + * `hipDeviceSetSharedMemConfig` + * `hipDeviceSetCacheConfig` + * `hipMemcpyWithStream` +* Error code update +Returned error/value codes are updated in the following HIP APIs to match the corresponding CUDA APIs. + - Module Management Related APIs: + * `hipModuleLaunchKernel` + * `hipExtModuleLaunchKernel` + * `hipExtLaunchKernel` + * `hipDrvLaunchKernelEx` + * `hipLaunchKernel` + * `hipLaunchKernelExC` + * `hipModuleLaunchCooperativeKernel` + * `hipModuleLoad` + - Texture Management Related APIs: +The following APIs update the return codes to match the behavior with CUDA: + * `hipTexObjectCreate`, supports zero width and height for 2D image. If either is zero, will not return `false`. + * `hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns `hipErrorNotFound`. + * `hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error `hipErrorInvalidChannelDescriptor`, instead of `hipErrorInvalidValue`. + * `hipGetTextureAlignmentOffset`, adds a return code `hipErrorInvalidTexture` when the texture reference pointer is NULL. + - Cooperative Group Related APIs, more calidations are added in the following API implementation: + * `hipLaunchCooperativeKernelMultiDevice` + * `hipLaunchCooperativeKernel` +* Invalid stream input parameter handling +In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Previously, the HIP runtime returned an error code `hipErrorContextIsDestroyed` if the stream was invalid. In CUDA version 12 and later, the equivalent behavior is to raise a segmentation fault. HIP runtime now matches the CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows: + - Stream Management Related APIs + * `hipStreamGetCaptureInfo` + * `hipStreamGetPriority` + * `hipStreamGetFlags` + * `hipStreamDestroy` + * `hipStreamAddCallback` + * `hipStreamQuery` + * `hipLaunchHostFunc` + - Graph Management Related APIs + * `hipGraphUpload` + * `hipGraphLaunch` + * `hipStreamBeginCaptureToGraph` + * `hipStreamBeginCapture` + * `hipStreamIsCapturing` + * `hipStreamGetCaptureInfo` + * `hipGraphInstantiateWithParams` + - Memory Management Related APIs + * `hipMemcpyPeerAsync` + * `hipMemcpy2DValidateParams` + * `hipMallocFromPoolAsync` + * `hipFreeAsync` + * `hipMallocAsync` + * `hipMemcpyAsync` + * `hipMemcpyToSymbolAsync` + * `hipStreamAttachMemAsync` + * `hipMemPrefetchAsync` + * `hipDrvMemcpy3D` + * `hipDrvMemcpy3DAsync` + * `hipDrvMemcpy2DUnaligned` + * `hipMemcpyParam2D` + * `hipMemcpyParam2DAsync` + * `hipMemcpy2DArrayToArray` + * `hipMemcpy2D` + * `hipMemcpy2DAsync` + * `hipDrvMemcpy2DUnaligned` + * `hipMemcpy3D` + - Event Management Related APIs + * `hipEventRecord` + * `hipEventRecordWithFlags` + +#### Optimized + +HIP runtime has the following functional improvements which improves runtime performance and user experience: + +* Reduced usage of the lock scope in events and kernel handling. + - Switches to `shared_mutex` for event validation, uses `std::unique_lock` in HIP runtime to create/destroy event, instead of `scopedLock`. + - Reduces the `scopedLock` in handling of kernel execution. HIP runtime now calls `scopedLock` during kernel binary creation/initialization, doesn't call it again during kernel vector iteration before launch. +* Implementation of unifying managed buffer and kernel argument buffer so HIP runtime doesn't need to create/load a separate kernel argument buffer. +* Refactored memory validation, creates a unique function to validate a variety of memory copy operations. +* Improved kernel logging using demangling shader names. +* Advanced support for SPIRV, now kernel compilation caching is enabled by default. This feature is controlled by the environment variable `AMD_COMGR_CACHE`, for details, see [hip_rtc document](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_rtc.html). +* Programmatic support for scratch limits on the AMD Instinct MI300 and MI350 Series up GPU devices. More enumeration values were added in `hipLimit_t` as following: + - `hipExtLimitScratchMin`, minimum allowed value in bytes for scratch limit on the device. + - `hipExtLimitScratchMax`, maximum allowed value in bytes for scratch limit on the device. + - `hipExtLimitScratchCurrent`, current scratch limit threshold in bytes on the device. Must be between the value `hipExtLimitScratchMin` and `hipExtLimitScratchMax`. + Developers can now use the environment variable `HSA_SCRATCH_SINGLE_LIMIT_ASYNC` to change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP API `hipDeviceSetLimit(hipExtLimitScratchCurrent, value)` to reset the scratch limit value. +* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth. +* Improved launch latency for `D2D` copies and `memset` on MI300 Series. +* Introduced a threshold to handle the command submission patch to the GPU device(s), considering the synchronization with CPU, for performance improvement. + +#### Resolved issues + +* Error of "unable to find modules" in HIP clean up for code object module. +* The issue of incorrect return error `hipErrorNoDevice`, when a crash occurred on GPU device due to illegal operation or memory violation. HIP runtime now handles the failure on the GPU side properly and reports the precise error code based on the last error seen on the GPU. +* Failures in some framework test applications, HIP runtime fixed the bug in retrieving a memory object from the IPC memory handle. +* A crash in TensorFlow related application. HIP runtime now combines multiple definitions of `callbackQueue` into a single function, in case of an exception, passes its handler to the application and provides corresponding error code. +* Fixed issue of handling the kernel parameters for the graph launch. +* Failures in roc-obj tools. HIP runtime now makes `DEPRECATED` message in roc-obj tools as `STDERR`. +* Support of `hipDeviceMallocContiguous` flags in `hipExtMallocWithFlags()`. It now enables `HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG` in the memory pool allocation on GPU device. +* Compilation failure, HIP runtime refactored the vector type alignment with `__hip_vec_align_v`. +* A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture. +* A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments. +* Compilation error when using bfloat16 functions. HIP runtime removed the anonymous namespace from FP16 functions to resolve this issue. + +#### Known issues + +* `hipLaunchHostFunc` returns an error during stream capture. Any application using `hipLaunchHostFunc` might fail to capture graphs during stream capture, instead, it returns `hipErrorStreamCaptureUnsupported`. +* Compilation failure in kernels via hiprtc when using option `std=c++11`. + +### **hipBLAS** (3.0.0) + +#### Added + +* Added the `hipblasSetWorkspace()` API. +* Support for codecoverage tests. + +#### Changed + +* HIPBLAS_V2 API is the only available API using the `hipComplex` and `hipDatatype` types. +* Documentation updates. +* Verbose compilation for `hipblas.cpp`. + +#### Removed + +* `hipblasDatatype_t` type. +* `hipComplex` and `hipDoubleComplex` types. +* Support code for non-production gfx targets. + +#### Resolved issues + +* The build time `CMake` configuration for the dependency on `hipBLAS-common` is fixed. +* Compiler warnings for unhandled enumerations have been resolved. + +### **hipBLASLt** (1.0.0) + +#### Added + +* Stream-K GEMM support has been enabled for the `FP32`, `FP16`, `BF16`, `FP8`, and `BF8` data types on the Instinct MI300A APU. To activate this feature, set the `TENSILE_SOLUTION_SELECTION_METHOD` environment variable to `2`, for example, `export TENSILE_SOLUTION_SELECTION_METHOD=2`. +* Fused Swish/SiLU GEMM (enabled by ``HIPBLASLT_EPILOGUE_SWISH_EXT`` and ``HIPBLASLT_EPILOGUE_SWISH_BIAS_EXT``). +* Support for ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS`` for gfx942. +* `HIPBLASLT_TUNING_USER_MAX_WORKSPACE` to constrain the maximum workspace size for user offline tuning. +* ``HIPBLASLT_ORDER_COL16_4R16`` and ``HIPBLASLT_ORDER_COL16_4R8`` to ``hipblasLtOrder_t`` to support `FP16`/`BF16` swizzle GEMM and `FP8` / `BF8` swizzle GEMM respectively. +* TF32 emulation on gfx950. +* Support for `FP6`, `BF6`, and `FP4` on gfx950. +* Support for block scaling by setting `HIPBLASLT_MATMUL_DESC_A_SCALE_MODE` and `HIPBLASLT_MATMUL_DESC_B_SCALE_MODE` to `HIPBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0`. + +#### Changed + +* The non-V2 APIs (``GemmPreference``, ``GemmProblemType``, ``GemmEpilogue``, ``GemmTuning``, ``GemmInputs``) in the cpp header are now the same as the V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``). The original non-V2 APIs are removed. + +#### Removed + +* ``HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXT`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXT`` are removed. Use the ``HIPBLASLT_MATMUL_DESC_A_SCALE_MODE`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_MODE`` attributes to set scalar (``HIPBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F``) or vector (``HIPBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F``) attributes. +* The `hipblasltExtAMaxWithScale` API is removed. + +#### Optimized + +* Improved performance for 8-bit (`FP8` / `BF8` / `I8`) NN/NT cases by adding ``s_delay_alu`` to reduce stalls from dependent ALU operations on gfx12+. +* Improved performance for 8-bit and 16-bit (`FP16` / `BF16`) TN cases by enabling software dependency checks (Expert Scheduling Mode) under certain restrictions to reduce redundant hardware dependency checks on gfx12+. +* Improved performance for 8-bit, 16-bit, and 32-bit batched GEMM with a better heuristic search algorithm for gfx942. + +#### Upcoming changes + +* V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``) are deprecated. + +### **hipCUB** (4.0.0) + +#### Added + +* A new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large that symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default. +* Single pass operators in `agent/single_pass_scan_operators.hpp` which contains the following API: + * `BlockScanRunningPrefixOp` + * `ScanTileStatus` + * `ScanTileState` + * `ReduceByKeyScanTileState` + * `TilePrefixCallbackOp` +* Support for gfx950. +* An overload of `BlockScan::InclusiveScan` that accepts an initial value to seed the scan. +* An overload of `WarpScan::InclusiveScan` that accepts an initial value to seed the scan. +* `UnrolledThreadLoad`, `UnrolledCopy`, and `ThreadLoadVolatilePointer` were added to align hipCUB with CUB. +* `ThreadStoreVolatilePtr` and the `IterateThreadStore` struct were added to align hipCUB with CUB. +* `hipcub::InclusiveScanInit` for CUB parity. + +#### Changed + +* The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.7.0. If they aren't found, they will be downloaded from the NVIDIA CCCL repository. +* Updated `thread_load` and `thread_store` to align hipCUB with CUB. +* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, (for example, `hipcub::HIPCUB_300400_NS::symbol` instead of `hipcub::symbol`), letting the user link multiple libraries built with different versions of hipCUB. +* Modified the broadcast kernel in warp scan benchmarks. The reported performance may be different to previous versions. +* The `hipcub::detail::accumulator_t` in rocPRIM backend has been changed to utilise `rocprim::accumulator_t`. +* The usage of `rocprim::invoke_result_binary_op_t` has been replaced with `rocprim::accumulator_t`. + +#### Removed + +* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you want to build for these architectures, specify them explicitly in the `AMDGPU_TARGETS` cmake option. +* Deprecated `hipcub::AsmThreadLoad` is removed, use `hipcub::ThreadLoad` instead. +* Deprecated `hipcub::AsmThreadStore` is removed, use `hipcub::ThreadStore` instead. +* Deprecated `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed. +* This release removes support for custom builds on gfx940 and gfx941. +* Removed C++14 support. Only C++17 is supported. + +#### Resolved issues + +* Fixed an issue where `Sort(keys, compare_op, valid_items, oob_default)` in `block_merge_sort.hpp` would not fill in elements that are out of range (items after `valid_items`) with `oob_default`. +* Fixed an issue where `ScatterToStripedFlagged` in `block_exhange.hpp` was calling the wrong function. + +#### Known issues + +* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed from hipCUB's CUB backend. They were already deprecated as of version 2.12.0 of hipCUB and they were removed from CCCL (CUB) as of CCCL's 2.6.0 release. +* `BlockScan::InclusiveScan` for the NVIDIA backend does not compute the block aggregate correctly when passing an initial value parameter. This behavior is not matched by the AMD backend. + +#### Upcoming changes + +* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` were deprecated as of version 2.12.0 of hipCUB, and will be removed from the rocPRIM backend in a future release for the next ROCm major version (ROCm 7.0.0). + +### **hipFFT** (1.0.20) + +#### Added + +* Support for gfx950. + +#### Removed + +* Removed hipfft-rider legacy compatibility from clients. +* Removed support for the gfx940 and gfx941 targets from the client programs. +* Removed backward compatibility symlink for include directories. + +### **hipfort** (0.7.0) + +#### Added + +* Documentation clarifying how hipfort is built for the NVIDIA platform. + +#### Changed + +* Updated and reorganized documentation for clarity and consistency. + +### **HIPIFY** (20.0.0) + +#### Added + +* CUDA 12.9.1 support. +* cuDNN 9.11.0 support. +* cuTENSOR 2.2.0.0 support. +* LLVM 20.1.8 support. + +#### Resolved issues + +* `hipDNN` support is removed by default. +* [#1859](https://github.com/ROCm/HIPIFY/issues/1859)[hipify-perl] Fix warnings on unsupported Driver or Runtime APIs which were erroneously not reported. +* [#1930](https://github.com/ROCm/HIPIFY/issues/1930) Revise `JIT API`. +* [#1962](https://github.com/ROCm/HIPIFY/issues/1962) Support for cuda-samples helper headers. +* [#2035](https://github.com/ROCm/HIPIFY/issues/2035) Removed `const_cast;` in `hiprtcCreateProgram` and `hiprtcCompileProgram`. + +### **hipRAND** (3.0.0) + +#### Added + +* Support for gfx950. + +#### Changed + +* Deprecated the hipRAND Fortran API in favor of hipfort. + +#### Removed + +* Removed C++14 support, so only C++17 is supported. + +### **hipSOLVER** (3.0.0) + +#### Added + +* Added compatibility-only functions: + * csrlsvqr + * `hipsolverSpCcsrlsvqr`, `hipsolverSpZcsrlsvqr` + +#### Resolved issues + +* Corrected the value of `lwork` returned by various `bufferSize` functions to be consistent with NVIDIA cuSOLVER. The following functions now return `lwork` so that the workspace size (in bytes) is `sizeof(T) * lwork`, rather than `lwork`. To restore the original behavior, set the environment variable `HIPSOLVER_BUFFERSIZE_RETURN_BYTES`. + * `hipsolverXorgbr_bufferSize`, `hipsolverXorgqr_bufferSize`, `hipsolverXorgtr_bufferSize`, `hipsolverXormqr_bufferSize`, `hipsolverXormtr_bufferSize`, `hipsolverXgesvd_bufferSize`, `hipsolverXgesvdj_bufferSize`, `hipsolverXgesvdBatched_bufferSize`, `hipsolverXgesvdaStridedBatched_bufferSize`, `hipsolverXsyevd_bufferSize`, `hipsolverXsyevdx_bufferSize`, `hipsolverXsyevj_bufferSize`, `hipsolverXsyevjBatched_bufferSize`, `hipsolverXsygvd_bufferSize`, `hipsolverXsygvdx_bufferSize`, `hipsolverXsygvj_bufferSize`, `hipsolverXsytrd_bufferSize`, `hipsolverXsytrf_bufferSize`. + +### **hipSPARSE** (4.0.1) + +#### Added + +* `int8`, `int32`, and `float16` data types to `hipDataTypeToHCCDataType` so that sparse matrix descriptors can be used with them. +* Half float mixed precision to `hipsparseAxpby` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `hipsparseSpVV` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `hipsparseSpMM` where A and B use `float16` and C and the compute type use `float`. +* Half float mixed precision to `hipsparseSDDMM` where A and B use `float16` and C and the compute type use `float`. +* Half float uniform precision to the `hipsparseScatter` and `hipsparseGather` routines. +* Half float uniform precision to the `hipsparseSDDMM` routine. +* `int8` precision to the `hipsparseCsr2cscEx2` routine. +* The `almalinux` operating system name to correct the GFortran dependency. + +#### Changed + +* Switched to defaulting to C++17 when building hipSPARSE from source. Previously hipSPARSE was using C++14 by default. + +#### Resolved issues + +* Fixed a compilation [issue](https://github.com/ROCm/hipSPARSE/issues/555) related to using `std::filesystem` and C++14. +* Fixed an issue where the clients-common package was empty by moving the `hipsparse_clientmatrices.cmake` and `hipsparse_mtx2csr` files to it. + +#### Known issues + +* In `hipsparseSpSM_solve()`, the external buffer is passed as a parameter. This does not match the NVIDIA CUDA cuSPARSE API. This extra external buffer parameter will be removed in a future release. For now, this extra parameter can be ignored and nullptr passed in because it is unused internally. + +### **hipSPARSELt** (0.2.4) + +#### Added + +* Support for the LLVM target gfx950. +* Support for the following data type combinations for the LLVM target gfx950: + * `FP8`(E4M3) inputs, `F32` output, and `F32` Matrix Core accumulation. + * `BF8`(E5M2) inputs, `F32` output, and `F32` Matrix Core accumulation. +* Support for ROC-TX if `HIPSPARSELT_ENABLE_MARKER=1` is set. +* Support for the cuSPARSELt v0.6.3 backend. + +#### Removed + +* Support for LLVM targets gfx940 and gfx941 has been removed. +* `hipsparseLtDatatype_t` has been removed. + +#### Optimized + +* Improved the library loading time. +* Provided more kernels for the `FP16` data type. + +### **hipTensor** (2.0.0) + +#### Added + +* Element-wise binary operation support. +* Element-wise trinary operation support. +* Support for GPU target gfx950. +* Dynamic unary and binary operator support for element-wise operations and permutation. +* CMake check for `f8` datatype availability. +* `hiptensorDestroyOperationDescriptor` to free all resources related to the provided descriptor. +* `hiptensorOperationDescriptorSetAttribute` to set attribute of a `hiptensorOperationDescriptor_t` object. +* `hiptensorOperationDescriptorGetAttribute` to retrieve an attribute of the provided `hiptensorOperationDescriptor_t` object. +* `hiptensorCreatePlanPreference` to allocate the `hiptensorPlanPreference_t` and enabled users to limit the applicable kernels for a given plan or operation. +* `hiptensorDestroyPlanPreference` to free all resources related to the provided preference. +* `hiptensorPlanPreferenceSetAttribute` to set attribute of a `hiptensorPlanPreference_t` object. +* `hiptensorPlanGetAttribute` to retrieve information about an already-created plan. +* `hiptensorEstimateWorkspaceSize` to determine the required workspace size for the given operation. +* `hiptensorCreatePlan` to allocate a `hiptensorPlan_t` object, select an appropriate kernel for a given operation and prepare a plan that encodes the execution. +* `hiptensorDestroyPlan` to free all resources related to the provided plan. + +#### Changed + +* Removed architecture support for gfx940 and gfx941. +* Generalized opaque buffer for any descriptor. +* Replaced `hipDataType` with `hiptensorDataType_t` for all supported types, for example, `HIP_R_32F` to `HIPTENSOR_R_32F`. +* Replaced `hiptensorComputeType_t` with `hiptensorComputeDescriptor_t` for all supported types. +* Replaced `hiptensorInitTensorDescriptor` with `hiptensorCreateTensorDescriptor`. +* Changed handle type and API usage from `*handle` to `handle`. +* Replaced `hiptensorContractionDescriptor_t` with `hipTensorOperationDescriptor_t`. +* Replaced `hiptensorInitContractionDescriptor` with `hiptensorCreateContraction`. +* Replaced `hiptensorContractionFind_t` with `hiptensorPlanPreference_t`. +* Replaced `hiptensorInitContractionFind` with `hiptensorCreatePlanPreference`. +* Replaced `hiptensorContractionGetWorkspaceSize` with `hiptensorEstimateWorkspaceSize`. +* Replaced `HIPTENSOR_WORKSPACE_RECOMMENDED` with `HIPTENSOR_WORKSPACE_DEFAULT`. +* Replaced `hiptensorContractionPlan_t` with `hiptensorPlan_t`. +* Replaced `hiptensorInitContractionPlan` with `hiptensorCreatePlan`. +* Replaced `hiptensorContraction` with `hiptensorContract`. +* Replaced `hiptensorPermutation` with `hiptensorPermute`. +* Replaced `hiptensorReduction` with `hiptensorReduce`. +* Replaced `hiptensorElementwiseBinary` with `hiptensorElementwiseBinaryExecute`. +* Replaced `hiptensorElementwiseTrinary` with `hiptensorElementwiseTrinaryExecute`. +* Removed function `hiptensorReductionGetWorkspaceSize`. + +### **llvm-project** (20.0.0) + +#### Added + +* The compiler `-gsplit-dwarf` option to enable the generation of separate debug information file at compile time. When used, separate debug information files are generated for host and for each offload architecture. For additional information, see [DebugFission](https://gcc.gnu.org/wiki/DebugFission). +* `llvm-flang`, AMD's next-generation Fortran compiler. It's a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub. +* Comgr support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps to improve performance in the device library link step. +* Compiler support of a new target-specific builtin `__builtin_amdgcn_processor_is` for late or deferred queries of the current target processor, and `__builtin_amdgcn_is_invocable` to determine the current target processor ability to invoke a particular builtin. +* HIPIFY support for NVIDIA CUDA 12.9.1 APIs. Added support for all new device and host APIs, including FP4, FP6, and FP128, and support for the corresponding ROCm HIP equivalents. + +#### Changed + +* Updated clang/llvm to AMD clang version 20.0.0 (equivalent to LLVM 20.0.0 with additional out-of-tree patches). +* HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) have been removed from this release. + +#### Optimized + +* Improved compiler memory load and store instructions. + +#### Upcoming changes + +* `__AMDGCN_WAVEFRONT_SIZE__` macro and HIP’s `warpSize` variable as `constexpr` are deprecated and will be disabled in a future release. Users are encouraged to update their code if needed to ensure future compatibility. For more information, see [AMDGCN_WAVEFRONT_SIZE deprecation](#amdgpu-wavefront-size-compiler-macro-deprecation). +* The `roc-obj-ls` and `roc-obj-extract` tools are deprecated. To extract all Clang offload bundles into separate code objects use `llvm-objdump --offloading `. For more information, see [Changes to ROCm Object Tooling](#changes-to-rocm-object-tooling). + +### **MIGraphX** (2.13.0) + +#### Added + +* Support for OCP `FP8` on AMD Instinct MI350X GPUs. +* Support for PyTorch 2.7 via Torch-MIGraphX. +* Support for the Microsoft ONNX Contrib Operators (Self) Attention, RotaryEmbedding, QuickGelu, BiasAdd, BiasSplitGelu, SkipLayerNorm. +* Support for Sigmoid and AddN TensorFlow operators. +* GroupQuery Attention support for LLMs. +* Support for edge mode in the ONNX Pad operator. +* ONNX runtime Python driver. +* FLUX e2e example. +* C++ and Python APIs to save arguments to a graph as a msgpack file, and then read the file back. +* rocMLIR fusion for kv-cache attention. +* Introduced a check for file-write errors. + +#### Changed + +* `quantize_bf16` for quantizing the model to `BF16` has been made visible in the MIGraphX user API. +* Print additional kernel/module information in the event of compile failure. +* Use hipBLASLt instead of rocBLAS on newer GPUs. +* 1x1 convolutions are now rewritten to GEMMs. +* `BF16::max` is now represented by its encoding rather than its expected value. +* Direct warnings now go to `cout` rather `cerr`. +* `FP8` uses hipBLASLt rather than rocBLAS. +* ONNX models are now topologically sorted when nodes are unordered. +* Improved layout of Graphviz output. +* Enhanced debugging for migraphx-driver: consumed environment variables are printed, timestamps and duration are added to the summary. +* Add a trim size flag to the verify option for migraphx-driver. +* Node names are printed to track parsing within the ONNX graph when using the `MIGRAPHX_TRACE_ONNX_PARSER` flag. +* Update accuracy checker to output test data with the `--show-test-data` flag. +* The `MIGRAPHX_TRACE_BENCHMARKING` option now allows the problem cache file to be updated after finding the best solution. + +#### Removed + +* `ROCM_USE_FLOAT8` macro. +* The `BF16` GEMM test was removed for Navi21, as it is unsupported by rocBLAS and hipBLASLt on that platform. + +#### Optimized + +* Use common average in `compile_ops` to reduce run-to-run variations when tuning. +* Improved the performance of the TopK operator. +* Conform to a single layout (NHWC or NCHW) during compilation rather than combining two. +* Slice Channels Conv Optimization (slice output fusion). +* Horizontal fusion optimization after pointwise operations. +* Reduced the number of literals used in `GridSample` linear sampler. +* Fuse multiple outputs for pointwise operations. +* Fuse reshapes on pointwise inputs for MLIR output fusion. +* MUL operation not folded into the GEMM when the GEMM is used more than once. +* Broadcast not fused after convolution or GEMM MLIR kernels. +* Avoid reduction fusion when operator data-types mismatch. + +#### Resolved issues + +* Compilation workaround ICE in clang 20 when using `views::transform`. +* Fix bug with `reshape_lazy` in MLIR. +* Quantizelinear fixed for Nearbyint operation. +* Check for empty strings in ONNX node inputs for operations like Resize. +* Parse Resize fix: only check `keep_aspect_ratio_policy` attribute for sizes input. +* Nonmaxsuppression: fixed issue where identical boxes/scores not ordered correctly. +* Fixed a bug where events were created on the wrong device in a multi-gpu scenario. +* Fixed out of order keys in value for comparisons and hashes when caching best kernels. +* Fixed Controlnet MUL types do not match error. +* Fixed check for scales if ROI input is present in Resize operation. +* Einsum: Fixed a crash on empty squeeze operations. + +### **MIOpen** (3.5.0) + +#### Added + +* [Conv] Misa kernels for gfx950. +* [Conv] Enabled Split-K support for CK backward data solvers (2D). +* [Conv] Enabled CK wrw solver on gfx950 for the `BF16` data type. +* [BatchNorm] Enabled NHWC in OpenCL. +* Grouped convolution + activation fusion. +* Grouped convolution + bias + activation fusion. +* Composable Kernel (CK) can now be built inline as part of MIOpen. + +#### Changed + +* Changed to using the median value with outliers removed when deciding on the best solution to run. +* [Conv] Updated the igemm asm solver. + +#### Optimized + +* [BatchNorm] Optimized NHWC OpenCL kernels and improved heuristics. +* [RNN] Dynamic algorithm optimization. +* [Conv] Eliminated redundant clearing of output buffers. +* [RNN] Updated selection heuristics. +* Updated tuning for the AMD Instinct MI300 Series. + +#### Resolved issues + +* Fixed a segmentation fault when the user specified a smaller workspace than what was required. +* Fixed a layout calculation logic error that returned incorrect results and enabled less restrictive layout selection. +* Fixed memory access faults in misa kernels due to out-of-bounds memory usage. +* Fixed a performance drop on the gfx950 due to transpose kernel use. +* Fixed a memory access fault caused by not allocating enough workspace. +* Fixed a name typo that caused kernel mismatches and long startup times. + +### **MIVisionX** (3.3.0) + +#### Added + +* Support to enable/disable BatchPD code in VX_RPP extensions by checking the RPP_LEGACY_SUPPORT flag. + +#### Changed + +* VX_RPP extension : Version 3.1.0 release. +* Update the parameters and kernel API of Blur, Fog, Jitter, LensCorrection, Rain, Pixelate, Vignette and ResizeCrop wrt tensor kernels replacing the legacy BatchPD API calls in VX_RPP extensions. + +#### Known issues + +* Installation on RHEL and SLES requires the manual installation of the `FFMPEG` and `OpenCV` dev packages. + +#### Upcoming changes + +* Optimized audio augmentations support for VX_RPP. + +### **RCCL** (2.26.6) + +#### Added + +* Support for the extended fine-grained system memory pool. +* Support for gfx950. +* Support for `unroll=1` in device-code generation to improve performance. +* Set a default of 112 channels for a single node with `8 * gfx950`. +* Enabled LL128 protocol on the gfx950. +* The ability to choose the unroll factor at runtime using `RCCL_UNROLL_FACTOR`. This can be set at runtime to 1, 2, or 4. This change currently increases compilation and linking time because it triples the number of kernels generated. +* Added MSCCL support for AllGather multinode on the gfx942 and gfx950 (for instance, 16 and 32 GPUs). To enable this feature, set the environment variable `RCCL_MSCCL_FORCE_ENABLE=1`. The maximum message size for MSCCL AllGather usage is `12292 * sizeof(datatype) * nGPUs`. +* Thread thresholds for LL/LL128 are selected in Tuning Models for the AMD Instinct MI300X. This impacts the number of channels used for AllGather and ReduceScatter. The channel tuning model is bypassed if `NCCL_THREAD_THRESHOLDS`, `NCCL_MIN_NCHANNELS`, or `NCCL_MAX_NCHANNELS` are set. +* Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocols to use nontemporal vector load/store for tunable message size ranges. +* LL/LL128 usage ranges for AllReduce, AllGather, and ReduceScatter are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL. +* Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are `rcclGetAlgoInfo` and `rcclFuncMaxSendRecvCount`. However, user-level invocation requires that RCCL be built with `RCCL_EXPOSE_STATIC` enabled. + +#### Changed + +* Compatibility with NCCL 2.23.4. +* Compatibility with NCCL 2.24.3. +* Compatibility with NCCL 2.25.1. +* Compatibility with NCCL 2.26.6. + +#### Resolved issues + +* Resolved an issue when using more than 64 channels when multiple collectives are used in the same `ncclGroup()` call. +* Fixed unit test failures in tests ending with the `ManagedMem` and `ManagedMemGraph` suffixes. +* Fixed a suboptimal algorithmic switching point for AllReduce on the AMD Instinct MI300X. +* Fixed the known issue "When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault" with a design change to use `comm` instead of `rank` for `mscclStatus`. The global map for `comm` to `mscclStatus` is still not thread safe but should be explicitly handled by mutexes for read-write operations. This is tested for correctness, but there is a plan to use a thread-safe map data structure in an upcoming release. + +### **rocAL** (2.3.0) + +#### Added +* Extended support to rocAL's video decoder to use rocDecode hardware decoder. +* Setup - installs rocdecode dev packages for Ubuntu, RedHat, and SLES. +* Setup - installs turbojpeg dev package for Ubuntu and Redhat. +* rocAL's image decoder has been extended to support the rocJPEG hardware decoder. +* Numpy reader support for reading npy files in rocAL. +* Test case for numpy reader in C++ and python tests. + +#### Resolved issues +* `TurboJPEG` no longer needs to be installed manually. It is now installed by the package installer. +* Hardware decode no longer requires that ROCm be installed with the `graphics` usecase. + +#### Known issues +* Package installation on SLES requires manually installing `TurboJPEG`. +* Package installation on RHEL and SLES requires manually installing the `FFMPEG Dev` package. + +#### Upcoming changes + +* rocJPEG support for JPEG decode. + +### **rocALUTION** (4.0.0) + +#### Added + +* Support for gfx950. + +#### Changed + +* Switch to defaulting to C++17 when building rocALUTION from source. Previously rocALUTION was using C++14 by default. + +#### Optimized + +* Improved the user documentation. + +#### Resolved issues + +* Fix for GPU hashing algorithm when not compiling with -O2/O3. + +### **rocBLAS** (5.0.0) + +#### Added + +* Support for gfx950. +* Internal API logging for `gemm` debugging using `ROCBLAS_LAYER = 8`. +* Support for the AOCL 5.0 gcc build as a client reference library. +* The use of `PkgConfig` for client reference library fallback detection. + +#### Changed + +* `CMAKE_CXX_COMPILER` is now passed on during compilation for a Tensile build. +* The default atomics mode is changed from `allowed` to `not allowed`. + +#### Removed + +* Support code for non-production gfx targets. +* `rocblas_hgemm_kernel_name`, `rocblas_sgemm_kernel_name`, and `rocblas_dgemm_kernel_name` API functions. +* The use of `warpSize` as a constexpr. +* The use of deprecated behavior of `hipPeekLastError`. +* `rocblas_float8.h` and `rocblas_hip_f8_impl.h` files. +* `rocblas_gemm_ex3`, `rocblas_gemm_batched_ex3`, and `rocblas_gemm_strided_batched_ex3` API functions. + +#### Optimized + +* Optimized `gemm` by using `gemv` kernels when applicable. +* Optimized `gemv` for small `m` and `n` with a large batch count on gfx942. +* Improved the performance of Level 1 `dot` for all precisions and variants when `N > 100000000` on gfx942. +* Improved the performance of Level 1 `asum` and `nrm2` for all precisions and variants on gfx942. +* Improved the performance of Level 2 `sger` (single precision) on gfx942. +* Improved the performance of Level 3 `dgmm` for all precisions and variants on gfx942. + +#### Resolved issues + +* Fixed environment variable path-based logging to append multiple handle outputs to the same file. +* Support numerics when `trsm` is running with `rocblas_status_perf_degraded`. +* Fixed the build dependency installation of `joblib` on some operating systems. +* Return `rocblas_status_internal_error` when `rocblas_[set,get]_ [matrix,vector]` is called with a host pointer in place of a device pointer. +* Reduced the default verbosity level for internal GEMM backend information. +* Updated from the deprecated rocm-cmake to ROCmCMakeBuildTools. +* Corrected AlmaLinux GFortran package dependencies. + +#### Upcoming changes + +* Deprecated the use of negative indices to indicate the default solution is being used for `gemm_ex` with `rocblas_gemm_algo_solution_index`. + +### **ROCdbgapi** (0.77.3) + +#### Added +* Support for the `gfx950` architectures. + +#### Removed +* Support for the `gfx940` and `gfx941` architectures. + +### **rocDecode** (1.0.0) + +#### Added + +* VP9 IVF container file parsing support in bitstream reader. +* CTest for VP9 decode on bitstream reader. +* HEVC/AVC/AV1/VP9 stream syntax error handling. +* HEVC stream bit depth change handling and DPB buffer size change handling through decoder reconfiguration. +* AVC stream DPB buffer size change handling through decoder reconfiguration. +* A new avcodec-based decoder built as a separate `rocdecode-host` library. + +#### Changed + +* rocDecode now uses the Cmake `CMAKE_PREFIX_PATH` directive. +* Changed asserts in query API calls in RocVideoDecoder utility class to error reports, to avoid hard stop during query in case error occurs and to let the caller decide actions. +* `libdrm_amdgpu` is now explicitly linked with rocdecode. + +#### Removed + +* `GetStream()` interface call from RocVideoDecoder utility class. + +#### Optimized + +* Decode session starts latency reduction. +* Bitstream type detection optimization in bitstream reader. + +#### Resolved issues + +* Fixed a bug in the `videoDecodePicFiles` picture files sample that can results in incorrect output frame count. +* Fixed a decoded frame output issue in video size change cases. +* Removed incorrect asserts of `bitdepth_minus_8` in `GetBitDepth()` and `num_chroma_planes` in `GetNumChromaPlanes()` API calls in the RocVideoDecoder utility class. + +### **rocFFT** (1.0.34) + +#### Added + +* Support for gfx950. + +#### Removed + +* Removed ``rocfft-rider`` legacy compatibility from clients. +* Removed support for the gfx940 and gfx941 targets from the client programs. +* Removed backward compatibility symlink for include directories. + +#### Optimized + +* Removed unnecessary HIP event/stream allocation and synchronization during MPI transforms. +* Implemented single-precision 1D kernels for lengths: + - 4704 + - 5488 + - 6144 + - 6561 + - 8192 +* Implemented single-kernel plans for some large 1D problem sizes, on devices with at least 160KiB of LDS. + +#### Resolved issues + +* Fixed kernel faults on multi-device transforms that gather to a single device, when the input/output bricks are not + contiguous. + +### **ROCgdb** (16.3) + +#### Added + +- Support for the `gfx950` architectures. + +#### Removed + +- Support for the `gfx940` and `gfx941` architectures. + +### **rocJPEG** (1.1.0) + +#### Added +* cmake config files. +* CTEST - New tests were introduced for JPEG batch decoding using various output formats, such as yuv_planar, y, rgb, and rgb_planar, both with and without region-of-interest (ROI). + +#### Changed +* Readme - cleanup and updates to pre-reqs. +* The `decode_params` argument of the `rocJpegDecodeBatched` API is now an array of `RocJpegDecodeParams` structs representing the decode parameters for the batch of JPEG images. +* `libdrm_amdgpu` is now explicitly linked with rocjpeg. + +#### Removed +* Dev Package - No longer installs pkg-config. + +#### Resolved issues +* Fixed a bug that prevented copying the decoded image into the output buffer when the output buffer is larger than the input image. +* Resolved an issue with resizing the internal memory pool by utilizing the explicit constructor of the vector's type during the resizing process. +* Addressed and resolved CMake configuration warnings. + +### **ROCm Bandwidth Test** (2.6.0) + +#### Added + +* Plugin architecture: + * `rocm_bandwidth_test` is now the `framework` for individual `plugins` and features. The `framework` is available at: `/opt/rocm/bin/` + + * Individual `plugins`: The `plugins` (shared libraries) are available at: `/opt/rocm/lib/rocm_bandwidth_test/plugins/` + +```{note} +Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/amd-mainline/README.md) file for details about the new options and outputs. +``` + +#### Changed + +* The `CLI` and options/parameters have changed due to the new plugin architecture, where the plugin parameters are parsed by the plugin. + +#### Removed + +- The old CLI, parameters, and switches. + +### **ROCm Compute Profiler** (3.2.3) + +#### Added + +##### CDNA4 (AMD Instinct MI350/MI355) support + +* Support for AMD Instinct MI350 Series GPUs with the addition of the following counters: + * VALU co-issue (Two VALUs are issued instructions) efficiency + * Stream Processor Instruction (SPI) Wave Occupancy + * Scheduler-Pipe Wave Utilization + * Scheduler FIFO Full Rate + * CPC ADC Utilization + * F6F4 data type metrics + * Update formula for total FLOPs while taking into account F6F4 ops + * LDS STORE, LDS LOAD, LDS ATOMIC instruction count metrics + * LDS STORE, LDS LOAD, LDS ATOMIC bandwidth metrics + * LDS FIFO full rate + * Sequencer -> TA ADDR Stall rates + * Sequencer -> TA CMD Stall rates + * Sequencer -> TA DATA Stall rates + * L1 latencies + * L2 latencies + * L2 to EA stalls + * L2 to EA stalls per channel + +* Roofline support for AMD Instinct MI350 Series GPUs. + +##### Textual User Interface (TUI) (beta version) + +* Text User Interface (TUI) support for analyze mode + * A command line based user interface to support interactive single-run analysis. + * To launch, use `--tui` option in analyze mode. For example, ``rocprof-compute analyze --tui``. + +##### PC Sampling (beta version) + +* Stochastic (hardware-based) PC sampling has been enabled for AMD Instinct MI300X Series and later GPUs. + +* Host-trap PC Sampling has been enabled for AMD Instinct MI200 Series and later GPUs. + +* Support for sorting of PC sampling by type: offset or count. + +* PC Sampling Support on CLI and TUI analysis. + +##### Roofline + +* Support for Roofline plot on CLI (single run) analysis. + +* `FP4` and `FP6` data types have been added for roofline profiling on AMD Instinct MI350 Series. + +##### rocprofv3 support + +* ``rocprofv3`` is supported as the default backend for profiling. +* Support to obtain performance information for all channels for TCC counters. +* Support for profiling on AMD Instinct MI 100 using ``rocprofv3``. +* Deprecation warning for ``rocprofv3`` interface in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. + +##### Others + +* Docker files to package the application and dependencies into a single portable and executable standalone binary file. + +* Analysis report based filtering + * ``-b`` option in profile mode now also accepts metric id(s) for analysis report based filtering. + * ``-b`` option in profile mode also accepts hardware IP block for filtering; however, this filter support will be deprecated soon. + * ``--list-metrics`` option added in profile mode to list possible metric id(s), similar to analyze mode. + +* Support MEM chart on CLI (single run). + +* ``--specs-correction`` option to provide missing system specifications for analysis. + +#### Changed + +* Changed the default ``rocprof`` version to ``rocprofv3``. This is used when environment variable ``ROCPROF`` is not set. +* Changed ``normal_unit`` default to ``per_kernel``. +* Decreased profiling time by not collecting unused counters in post-analysis. +* Updated Dash to >=3.0.0 (for web UI). +* Changed the condition when Roofline PDFs are generated during general profiling and ``--roof-only`` profiling (skip only when ``--no-roof`` option is present). +* Updated Roofline binaries: + * Rebuild using latest ROCm stack. + * Minimum OS distribution support minimum for roofline feature is now Ubuntu 22.04, RHEL 8, and SLES15 SP6. + +#### Removed + +* Roofline support for Ubuntu 20.04 and SLES below 15.6. +* Removed support for AMD Instinct MI50 and MI60. + +#### Optimized + +* ROCm Compute Profiler CLI has been improved to better display the GPU architecture analytics. + +#### Resolved issues + +* Fixed kernel name and kernel dispatch filtering when using ``rocprofv3``. +* Fixed an issue of TCC channel counters collection in ``rocprofv3``. +* Fixed peak FLOPS of `F8`, `I8`, `F16`, and `BF16` on AMD Instinct MI300. +* Fixed not detecting memory clock issue when using ``amd-smi``. +* Fixed standalone GUI crashing. +* Fixed L2 read/write/atomic bandwidths on AMD Instinct MI350 Series. + +#### Known issues + +* On AMD Instinct MI100, accumulation counters are not collected, resulting in the following metrics failing to show up in the analysis: Instruction Fetch Latency, Wavefront Occupancy, LDS Latency. + * As a workaround, use the environment variable ``ROCPROF=rocprof``, to use ``rocprof v1`` for profiling on AMD Instinct MI100. + +* GPU id filtering is not supported when using ``rocprofv3``. + +* Analysis of previously collected workload data will not work due to sysinfo.csv schema change. + * As a workaround, re-run the profiling operation for the workload and interrupt the process after 10 seconds. + Followed by copying the ``sysinfo.csv`` file from the new data folder to the old one. + This assumes your system specification hasn't changed since the creation of the previous workload data. + +* Analysis of new workloads might require providing shader/memory clock speed using +``--specs-correction`` operation if amd-smi or rocminfo does not provide clock speeds. + +* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow. + +* Roofline feature is currently not functional on Azure Linux 3.0 and Debian 12. + +#### Upcoming changes + +* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. Using ``rocprof v1/v2/v3`` interfaces will trigger a deprecation warning. + * To use ROCprofiler-SDK interface, set environment variable `ROCPROF=rocprofiler-sdk` and optionally provide profile mode option ``--rocprofiler-sdk-library-path /path/to/librocprofiler-sdk.so``. Add ``--rocprofiler-sdk-library-path`` runtime option to choose the path to ROCprofiler-SDK library to be used. +* Hardware IP block based filtering using ``-b`` option in profile mode will be removed in favor of analysis report block based filtering using ``-b`` option in profile mode. +* MongoDB database support will be removed, and a deprecation warning has been added to the application interface. +* Usage of ``rocm-smi`` is deprecated in favor of ``amd-smi``, and a deprecation warning has been added to the application interface. + +### **ROCm Data Center Tool** (1.1.0) + +#### Added + +* More profiling and monitoring metrics, especially for AMD Instinct MI300 and newer GPUs. +* Advanced logging and debugging options, including new log levels and troubleshooting guidance. + +#### Changed + +* Completed migration from legacy [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) to [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/). +* Reorganized the configuration files internally and improved [README/installation](https://github.com/ROCm/rdc/blob/amd-staging/README.md) instructions. +* Updated metrics and monitoring support for the latest AMD data center GPUs. + +#### Optimized + +- Integration with [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/) for performance metrics collection. +- Standalone and embedded operating modes, including streamlined authentication and configuration options. +- Support and documentation for diagnostic commands and GPU group management. +- [RVS](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) test integration and reporting. + +### **ROCm SMI** (7.8.0) + +#### Added + +- Support for GPU metrics 1.8. + - Added new fields for `rsmi_gpu_metrics_t` including: + - Adding the following metrics to allow new calculations for violation status: + - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts. + - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts. + - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. + - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). + - Increasing available JPEG engines to 40. + Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI. + +#### Removed + +- Removed backwards compatibility for `rsmi_dev_gpu_metrics_info_get()`'s `jpeg_activity` and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` and `xcp_stats.vcn_busy`. + - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. + - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion for users about which field to use. By removing backward compatibility, it is easier to identify the relevant field. + - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. + +```{note} +See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. +``` + +### **ROCm Systems Profiler** (1.1.0) + +#### Added + +- Profiling and metric collection capabilities for VCN engine activity, JPEG engine activity, and API tracing for rocDecode, rocJPEG, and VA-APIs. +- How-to document for VCN and JPEG activity sampling and tracing. +- Support for tracing Fortran applications. +- Support for tracing MPI API in Fortran. + +#### Changed + +- Replaced ROCm SMI backend with AMD SMI backend for collecting GPU metrics. +- ROCprofiler-SDK is now used to trace RCCL API and collect communication counters. + - Use the setting `ROCPROFSYS_USE_RCCLP = ON` to enable profiling and tracing of RCCL application data. +- Updated the Dyninst submodule to v13.0. +- Set the default value of `ROCPROFSYS_SAMPLING_CPUS` to `none`. + +#### Resolved issues + +- Fixed GPU metric collection settings with `ROCPROFSYS_AMD_SMI_METRICS`. +- Fixed a build issue with CMake 4. +- Fixed incorrect kernel names shown for kernel dispatch tracks in Perfetto. +- Fixed formatting of some output logs. + +### **ROCm Validation Suite** (1.2.0) + +#### Added + +- Support for AMD Instinct MI350X and MI355X GPUs. +- Introduced rotating buffer mechanism for GEMM operations. +- Support for read and write tests in Babel. +- Support for AMD Radeon RX9070 and RX9070GRE graphics cards. + +#### Changed + +- Migrated SMI API usage from `rocm-smi` to `amd-smi`. +- Updated `FP8` GEMM operations to use hipBLASLt instead of rocBLAS. + +### **rocPRIM** (4.0.0) + +#### Added + +* Support for gfx950. +* `rocprim::accumulator_t` to ensure parity with CCCL. +* Test for `rocprim::accumulator_t`. +* `rocprim::invoke_result_r` to ensure parity with CCCL. +* Function `is_build_in` into `rocprim::traits::get`. +* Virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. +* Initial value support to device level inclusive scans. +* New optimization to the backend for `device_transform` when the input and output are pointers. +* `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers. +* `rocprim:device_transform` for n-ary transform operations API with as input `n` number of iterators inside a `rocprim::tuple`. +* `rocprim::key_value_pair::operator==`. +* The `rocprim::unrolled_copy` thread function to copy multiple items inside a thread. +* The `rocprim::unrolled_thread_load` function to load multiple items inside a thread using `rocprim::thread_load`. +* `rocprim::int128_t` and `rocprim::uint128_t` to benchmarks for improved performance evaluation on 128-bit integers. +* `rocprim::int128_t` to the supported autotuning types to improve performance for 128-bit integers. +* The `rocprim::merge_inplace` function for merging in-place. +* Initial value support for warp- and block-level inclusive scan. +* Support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the `WITH_ROCRAND=ON` build flag. +* Tests and documentation to `lookback_scan_state`. It is still in the `detail` namespace. + +#### Changed + +* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits`, respectively. +* Marked the initialisation constructor of `rocprim::reverse_iterator` `explicit`, use `rocprim::make_reverse_iterator`. +* Merged `radix_key_codec` into type_traits system. +* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`. +* The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change. +The previous default accumulator types could lead to situations in which unexpected overflow occurred, such as when the input or initial type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing: + + * `rocprim::inclusive_scan` + * Previous default: `class AccType = typename std::iterator_traits::value_type>` + * Current default: `class AccType = rocprim::accumulator_t::value_type>` + * `rocprim::deterministic_inclusive_scan` + * Previous default: `class AccType = typename std::iterator_traits::value_type>` + * Current default: `class AccType = rocprim::accumulator_t::value_type>` + * `rocprim::exclusive_scan` + * Previous default: `class AccType = detail::input_type_t>` + * Current default: `class AccType = rocprim::accumulator_t>` + * `rocprim::deterministic_exclusive_scan` + * Previous default: `class AccType = detail::input_type_t>` + * Current default: `class AccType = rocprim::accumulator_t>` +* Undeprecated internal `detail::raw_storage`. +* A new version of `rocprim::thread_load` and `rocprim::thread_store` replaces the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. +* Renamed `rocprim::load_cs` to `rocprim::load_nontemporal` and `rocprim::store_cs` to `rocprim::store_nontemporal` to express the intent of these load and store methods better. +* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example, `rocprim::ROCPRIM_300400_NS::symbol` instead of `rocPRIM::symbol`, letting the user link multiple libraries built with different versions of rocPRIM. + +#### Removed + +* `rocprim::detail::float_bit_mask` and relative tests, use `rocprim::traits::float_bit_mask` instead. +* `rocprim::traits::is_fundamental`, use `rocprim::traits::get::is_fundamental()` directly. +* The deprecated parameters `short_radix_bits` and `ShortRadixBits` from the `segmented_radix_sort` config. They were unused, it is only an API change. +* The deprecated `operator<<` from the iterators. +* The deprecated `TwiddleIn` and `TwiddleOut`. Use `radix_key_codec` instead. +* The deprecated flags API of `block_adjacent_difference`. Use `subtract_left()` or `block_discontinuity::flag_heads()` instead. +* The deprecated `to_exclusive` functions in the warp scans. +* The `rocprim::load_cs` from the `cache_load_modifier` enum. Use `rocprim::load_nontemporal` instead. +* The `rocprim::store_cs` from the `cache_store_modifier` enum. Use `rocprim::store_nontemporal` instead. +* The deprecated header file `rocprim/detail/match_result_type.hpp`. Include `rocprim/type_traits.hpp` instead. This header included: + * `rocprim::detail::invoke_result`. Use `rocprim::invoke_result` instead. + * `rocprim::detail::invoke_result_binary_op`. Use `rocprim::invoke_result_binary_op` instead. + * `rocprim::detail::match_result_type`. Use `rocprim::invoke_result_binary_op_t` instead. +* The deprecated `rocprim::detail::radix_key_codec` function. Use `rocprim::radix_key_codec` instead. +* Removed `rocprim/detail/radix_sort.hpp`, functionality can now be found in `rocprim/thread/radix_key_codec.hpp`. +* Removed C++14 support. Only C++17 is supported. +* Due to the removal of `__AMDGCN_WAVEFRONT_SIZE` in the compiler, the following deprecated warp size-related symbols have been removed: + * `rocprim::device_warp_size()` + * For compile-time constants, this is replaced with `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()`. Use this when allocating global or shared memory. + * For run-time constants, this is replaced with `rocprim::arch::wavefront::size().` + * `rocprim::warp_size()` + * Use `rocprim::host_warp_size()`, `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. + * `ROCPRIM_WAVEFRONT_SIZE` + * Use `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. + * `__AMDGCN_WAVEFRONT_SIZE` + * This was a fallback define for the compiler's removed symbol, having the same name. +* This release removes support for custom builds on gfx940 and gfx941. + +#### Optimized + +* Improved performance of `rocprim::device_select` and `rocprim::device_partition` when using multiple streams on the AMD Instinct MI300 Series. + +#### Resolved issues + +* Fixed an issue where `device_batch_memcpy` reported benchmarking throughput being 2x lower than it was in reality. +* Fixed an issue where `device_segmented_reduce` reported autotuning throughput being 5x lower than it was in reality. +* Fixed device radix sort not returning the correct required temporary storage when a double buffer contains `nullptr`. +* Fixed constness of equality operators (`==` and `!=`) in `rocprim::key_value_pair`. +* Fixed an issue for the comparison operators in `arg_index_iterator` and `texture_cache_iterator`, where `<` and `>` comparators were swapped. +* Fixed an issue for the `rocprim::thread_reduce` not working correctly with a prefix value. + +#### Known issues + +* When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs. + +#### Upcoming changes + +* `rocprim::invoke_result_binary_op` and `rocprim::invoke_result_binary_op_t` are deprecated. Use `rocprim::accumulator_t` instead. + +### **ROCprofiler-SDK** (1.0.0) + +#### Added + +- Support for [rocJPEG](https://rocm.docs.amd.com/projects/rocJPEG/en/latest/index.html) API Tracing. +- Support for AMD Instinct MI350X and MI355X GPUs. +- `rocprofiler_create_counter` to facilitate adding custom derived counters at runtime. +- Support in `rocprofv3` for iteration based counter multiplexing. +- Perfetto support for counter collection. +- Support for negating `rocprofv3` tracing options when using aggregate options such as `--sys-trace --hsa-trace=no`. +- `--agent-index` option in `rocprofv3` to specify the agent naming convention in the output: + - absolute == node_id + - relative == logical_node_id + - type-relative == logical_node_type_id +- MI300 and MI350 stochastic (hardware-based) PC sampling support in ROCProfiler-SDK and `rocprofv3`. +- Python bindings for `rocprofiler-sdk-roctx`. +- SQLite3 output support for `rocprofv3` using `--output-format rocpd`. +- `rocprofiler-sdk-rocpd` package: + - Public API in `include/rocprofiler-sdk-rocpd/rocpd.h`. + - Library implementation in `librocprofiler-sdk-rocpd.so`. + - Support for `find_package(rocprofiler-sdk-rocpd)`. + - `rocprofiler-sdk-rocpd` DEB and RPM packages. +- `--version` option in `rocprofv3`. +- `rocpd` Python package. +- Thread trace as experimental API. +- ROCprof Trace Decoder as experimental API: + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- Thread trace option in the `rocprofv3` tool under the `--att` parameters: + - See [using thread trace with rocprofv3](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/amd-mainline/how-to/using-thread-trace.html) + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- `rocpd` output format documentation: + - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). +- Perfetto support for scratch memory. +- Support in the `rocprofv3` avail tool for command-line arguments. +- Documentation for `rocprofv3` advanced options. +- AQLprofile is now available as open source. + +#### Changed + +- SDK to NOT to create a background thread when every tool returns a nullptr from `rocprofiler_configure`. +- `vaddr-to-file-offset` mapping in `disassembly.hpp` to use the dedicated comgr API. +- `rocprofiler_uuid_t` ABI to hold 128 bit value. +- `rocprofv3` shorthand argument for `--collection-period` to `-P` (upper-case) while `-p` (lower-case) is reserved for later use. +- Default output format for `rocprofv3` to `rocpd` (SQLite3 database). +- `rocprofv3` avail tool to be renamed from `rocprofv3_avail` to `rocprofv3-avail` tool. +- `rocprofv3` tool to facilitate thread trace and PC sampling on the same agent. + +##### Removed + +* Support for compilation of gfx940 and gfx941 targets. + +#### Resolved issues + +- Fixed missing callbacks around internal thread creation within counter collection service. +- Fixed potential data race in the ROCprofiler-SDK double buffering scheme. +- Fixed usage of std::regex in the core ROCprofiler-SDK library that caused segfaults or exceptions when used under dual ABI. +- Fixed Perfetto counter collection by introducing accumulation per dispatch. +- Fixed code object disassembly for missing function inlining information. +- Fixed queue preemption error and `HSA_STATUS_ERROR_INVALID_PACKET_FORMAT` error for stochastic PC-sampling in MI300X, leading to stabler runs. +- Fixed the system hang issue for host-trap PC-sampling on AMD Instinct MI300X. +- Fixed `rocpd` counter collection issue when counter collection alone is enabled. `rocpd_kernel_dispatch` table is updated to be populated by counters data instead of kernel_dispatch data. +- Fixed `rocprofiler_*_id_t` structs for inconsistency related to a "null" handle: + - The correct definition for a null handle is `.handle = 0` while some definitions previously used `UINT64_MAX`. +- Fixed kernel trace csv output generated by `rocpd`. + +### **rocPyDecode** (0.6.0) + +#### Added + +* ``rocpyjpegdecode`` package. +* ``src/rocjpeg`` source new subfolder. +* ``samples/rocjpeg`` new subfolder. + +#### Changed +* Minimum version for rocdecode and rocjpeg updated to V1.0.0. + +### **rocRAND** (4.0.0) + +#### Added + +* Support for gfx950. +* Additional unit tests for `test_log_normal_distribution.cpp`, `test_normal_distribution.cpp`, `test_rocrand_mtgp32_prng.cpp`, `test_rocrand_scrambled_sobol32_qrng.cpp`, `test_rocrand_scrambled_sobol64_qrng.cpp`, `test_rocrand_sobol32_qrng.cpp`, `test_rocrand_sobol64_qrng.cpp`, `test_rocrand_threefry2x32_20_prng.cpp`, `test_rocrand_threefry2x64_20_prng.cpp`, `test_rocrand_threefry4x32_20_prng.cpp`, `test_rocrand_threefry4x64_20_prng.cpp`, and `test_uniform_distribution.cpp`. +* New unit tests for `include/rocrand/rocrand_discrete.h` in `test_rocrand_discrete.cpp`, `include/rocrand/rocrand_mrg31k3p.h` in `test_rocrand_mrg31k3p_prng.cpp`, `include/rocrand/rocrand_mrg32k3a.h` in `test_rocrand_mrg32k3a_prng.cpp`, and `include/rocrand/rocrand_poisson.h` in `test_rocrand_poisson.cpp`. + +#### Changed + +* Changed the return type for `rocrand_generate_poisson` for the `SOBOL64` and `SCRAMBLED_SOBOL64` engines. +* Changed the unnecessarily large 64-bit data type for constants used for skipping in `MRG32K3A` to the 32-bit data type. +* Updated several `gfx942` auto tuning parameters. +* Modified error handling and expanded the error information for the case of double-deallocation of the (scrambled) sobol32 and sobol64 constants and direction vectors. + +#### Removed + +* Removed inline assembly and the `ENABLE_INLINE_ASM` CMake option. Inline assembly was used to optimize multiplication in the Mrg32k3a and Philox 4x32-10 generators. It is no longer needed because the current HIP compiler is able to produce code with the same or better performance. +* Removed instances of the deprecated clang definition `__AMDGCN_WAVEFRONT_SIZE`. +* Removed C++14 support. Beginning with this release, only C++17 is supported. +* Directly accessing the (scrambled) sobol32 and sobol64 constants and direction vectors is no longer supported. For: + * `h_scrambled_sobol32_constants`, use `rocrand_get_scramble_constants32` instead. + * `h_scrambled_sobol64_constants`, use `rocrand_get_scramble_constants64` instead. + * `rocrand_h_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. + * `rocrand_h_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. + * `rocrand_h_scrambled_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. + * `rocrand_h_scrambled_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. + +#### Resolved issues + +* Fixed an issue where `mt19937.hpp` would cause kernel errors during auto tuning. + +#### Upcoming changes + +* Deprecated the rocRAND Fortran API in favor of hipfort. + +### **ROCr Debug Agent** (2.1.0) + +#### Added + +* The `-e` and `--precise-alu-exceptions` flags to enable precise ALU exceptions reporting on supported configurations. + +### **ROCr Runtime** (1.18.0) + +#### Added + +* New API `hsa_amd_memory_get_preferred_copy_engine` to get preferred copy engine that can be used to when calling `hsa_amd_memory_async_copy_on_engine`. +* New API `hsa_amd_portable_export_dmabuf_v2` extension of existing `hsa_amd_portable_export_dmabuf` API to support new flags parameter. This allows specifying the new `HSA_AMD_DMABUF_MAPPING_TYPE_PCIE` flag when exporting dma-bufs. +* New flag `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` adds support for new `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` when calling `hsa_amd_vmem_address_reserve` API. This allows virtual address range reservations for SVM allocations to be tracked when running in ASAN mode. +* New sub query `HSA_AMD_AGENT_INFO_CLOCK_COUNTERS` returns a snapshot of the underlying driver's clock counters that can be used for profiling. + +### **rocSHMEM** (3.0.0) + +#### Added + +* Reverse Offload conduit. +* New APIs: `rocshmem_ctx_barrier`, `rocshmem_ctx_barrier_wave`, `rocshmem_ctx_barrier_wg`, `rocshmem_barrier_all`, `rocshmem_barrier_all_wave`, `rocshmem_barrier_all_wg`, `rocshmem_ctx_sync`, `rocshmem_ctx_sync_wave`, `rocshmem_ctx_sync_wg`, `rocshmem_sync_all`, `rocshmem_sync_all_wave`, `rocshmem_sync_all_wg`, `rocshmem_init_attr`, `rocshmem_get_uniqueid`, and `rocshmem_set_attr_uniqueid_args`. +* `dlmalloc` based allocator. +* XNACK support. +* Support for initialization with MPI communicators other than `MPI_COMM_WORLD`. + +#### Changed + +* Changed collective APIs to use `_wg` suffix rather than `_wg_` infix. + +#### Resolved issues + +* Resolved segfault in `rocshmem_wg_ctx_create`, now provides `nullptr` if `ctx` cannot be created. + +### **rocSOLVER** (3.30.0) + +#### Added + +* Hybrid computation support for existing routines: STEQR + +#### Optimized + +* Improved the performance of BDSQR and downstream functions, such as GESVD. +* Improved the performance of STEQR and downstream functions, such as SYEV/HEEV. +* Improved the performance of LARFT and downstream functions, such as GEQR2 and GEQRF. + +#### Resolved issues + +* Fixed corner cases that can produce NaNs in SYEVD for valid input matrices. + +### **rocSPARSE** (4.0.2) + +#### Added + +* The `SpGEAM` generic routine for computing sparse matrix addition in CSR format. +* The `v2_SpMV` generic routine for computing sparse matrix vector multiplication. As opposed to the deprecated `rocsparse_spmv` routine, this routine does not use a fallback algorithm if a non-implemented configuration is encountered and will return an error in such a case. For the deprecated `rocsparse_spmv` routine, the user can enable warning messages in situations where a fallback algorithm is used by either calling the `rocsparse_enable_debug` routine upfront or exporting the variable `ROCSPARSE_DEBUG` (with the shell command `export ROCSPARSE_DEBUG=1`). +* Half float mixed precision to `rocsparse_axpby` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `rocsparse_spvv` where X and Y use `float16` and the result and compute type use `float`. +* Half float mixed precision to `rocsparse_spmv` where A and X use `float16` and Y and the compute type use `float`. +* Half float mixed precision to `rocsparse_spmm` where A and B use `float16` and C and the compute type use `float`. +* Half float mixed precision to `rocsparse_sddmm` where A and B use `float16` and C and the compute type use `float`. +* Half float uniform precision to the `rocsparse_scatter` and `rocsparse_gather` routines. +* Half float uniform precision to the `rocsparse_sddmm` routine. +* The `rocsparse_spmv_alg_csr_rowsplit` algorithm. +* Support for gfx950. +* ROC-TX instrumentation support in rocSPARSE (not available on Windows or in the static library version on Linux). +* The `almalinux` operating system name to correct the GFortran dependency. + +#### Changed + +* Switch to defaulting to C++17 when building rocSPARSE from source. Previously rocSPARSE was using C++14 by default. + +#### Removed + +* The deprecated `rocsparse_spmv_ex` routine. +* The deprecated `rocsparse_sbsrmv_ex`, `rocsparse_dbsrmv_ex`, `rocsparse_cbsrmv_ex`, and `rocsparse_zbsrmv_ex` routines. +* The deprecated `rocsparse_sbsrmv_ex_analysis`, `rocsparse_dbsrmv_ex_analysis`, `rocsparse_cbsrmv_ex_analysis`, and `rocsparse_zbsrmv_ex_analysis` routines. + +#### Optimized + +* Reduced the number of template instantiations in the library to further reduce the shared library binary size and improve compile times. +* Allow SpGEMM routines to use more shared memory when available. This can speed up performance for matrices with a large number of intermediate products. +* Use of the `rocsparse_spmv_alg_csr_adaptive` or `rocsparse_spmv_alg_csr_default` algorithms in `rocsparse_spmv` to perform transposed sparse matrix multiplication (`C=alpha*A^T*x+beta*y`) resulted in unnecessary analysis on A and needless slowdown during the analysis phase. This has been improved by skipping the analysis when performing the transposed sparse matrix multiplication. +* Improved the user documentation. + +#### Resolved issues + +* Fixed an issue in the public headers where `extern "C"` was not wrapped by `#ifdef __cplusplus`, which caused failures when building C programs with rocSPARSE. +* Fixed a memory access fault in the `rocsparse_Xbsrilu0` routines. +* Fixed failures that could occur in `rocsparse_Xbsrsm_solve` or `rocsparse_spsm` with BSR format when using host pointer mode. +* Fixed ASAN compilation failures. +* Fixed a failure that occurred when using const descriptor `rocsparse_create_const_csr_descr` with the generic routine `rocsparse_sparse_to_sparse`. The issue was not observed when using non-const descriptor `rocsparse_create_csr_descr` with `rocsparse_sparse_to_sparse`. +* Fixed a memory leak in the rocSPARSE handle. + +#### Upcoming changes + +* Deprecated the `rocsparse_spmv` routine. Use the `rocsparse_v2_spmv` routine instead. +* Deprecated the `rocsparse_spmv_alg_csr_stream` algorithm. Use the `rocsparse_spmv_alg_csr_rowsplit` algorithm instead. +* Deprecated the `rocsparse_itilu0_alg_sync_split_fusion` algorithm. Use one of `rocsparse_itilu0_alg_async_inplace`, `rocsparse_itilu0_alg_async_split`, or `rocsparse_itilu0_alg_sync_split` instead. + +### **rocThrust** (4.0.0) + +#### Added + +* Additional unit tests for: binary_search, complex, c99math, catrig, ccosh, cexp, clog, csin, csqrt, and ctan. +* `test_param_fixtures.hpp` to store all the parameters for typed test suites. +* `test_real_assertions.hpp` to handle unit test assertions for real numbers. +* `test_imag_assertions.hpp` to handle unit test assertions for imaginary numbers. +* `clang++` is now used to compile google benchmarks on Windows. +* Support for gfx950. +* Merged changes from upstream CCCL/thrust 2.6.0. + +#### Changed + +* Updated the required version of Google Benchmark from 1.8.0 to 1.9.0. +* Renamed `cpp14_required.h` to `cpp_version_check.h`. +* Refactored `test_header.hpp` into `test_param_fixtures.hpp`, `test_real_assertions.hpp`, `test_imag_assertions.hpp`, and `test_utils.hpp`. This is done to prevent unit tests from having access to modules that they're not testing. This will improve the accuracy of code coverage reports. + +#### Removed + +* `device_malloc_allocator.h` has been removed. This header file was unused and should not impact users. +* Removed C++14 support. Only C++17 is now supported. +* `test_header.hpp` has been removed. The `HIP_CHECK` function, as well as the `test` and `inter_run_bwr` namespaces, have been moved to `test_utils.hpp`. +* `test_assertions.hpp` has been split into `test_real_assertions.hpp` and `test_imag_assertions.hpp`. + +#### Resolved issues + +* Fixed an issue with internal calls to unqualified `distance()` which would be ambiguous due to the visible implementation through ADL. + +#### Known issues + +* The order of the values being compared by `thrust::exclusive_scan_by_key` and `thrust::inclusive_scan_by_key` can change between runs when integers are being compared. This can cause incorrect output when a non-commutative operator such as division is being used. + +#### Upcoming changes + +* `thrust::device_malloc_allocator` is deprecated as of this version. It will be removed in an upcoming version. + +### **rocWMMA** (2.0.0) + +#### Added + +* Internal register layout transforms to support interleaved MMA layouts. +* Support for the gfx950 target. +* Mixed input `BF8`/`FP8` types for MMA support. +* Fragment scheduler API objects to embed thread block cooperation properties in fragments. + +#### Changed + +* Augmented load/store/MMA internals with static loop unrolling. +* Updated linkage of `rocwmma::synchronize_workgroup` to inline. +* rocWMMA `mma_sync` API now supports `wave tile` fragment sizes. +* rocWMMA cooperative fragments are now expressed with fragment scheduler template arguments. +* rocWMMA cooperative fragments now use the same base API as non-cooperative fragments. +* rocWMMA cooperative fragments register usage footprint has been reduced. +* rocWMMA fragments now support partial tile sizes with padding. + +#### Removed + +* Support for the gfx940 and gfx941 targets. +* The rocWMMA cooperative API. +* Wave count template parameters from transforms APIs. + +#### Optimized + +* Added internal flow control barriers to improve assembly code generation and overall performance. +* Enabled interleaved layouts by default in MMA to improve overall performance. + +#### Resolved issues + +* Fixed a validation issue for small precision compute types `< B32` on gfx9. +* Fixed CMake validation of compiler support for `BF8`/`FP8` types. + +### **RPP** (2.0.0) + +#### Added + +* Bitwise NOT, Bitwise AND, and Bitwise OR augmentations on HOST (CPU) and HIP backends. +* Tensor Concat augmentation on HOST (CPU) and HIP backends. +* JPEG Compression Distortion augmentation on HIP backend. +* `log1p`, defined as `log (1 + x)`, tensor augmentation support on HOST (CPU) and HIP backends. +* JPEG Compression Distortion augmentation on HOST (CPU) backend. + +#### Changed + +* Handle creation and destruction APIs have been consolidated. Use `rppCreate()` for handle initialization and `rppDestroy()` for handle destruction. +* The `logical_operations` function category has been renamed to `bitwise_operations`. +* TurboJPEG package installation enabled for RPP Test Suite with `sudo apt-get install libturbojpeg0-dev`. Instructions have been updated in utilities/test_suite/README.md. +* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to an unsigned int tensor), that provides the permutation order to swap the RGB channels of each input image in the batch in any order: + + `RppStatus rppt_swap_channels_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, rppHandle_t rppHandle);` + + changed to: + + `RppStatus rppt_channel_permute_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, Rpp32u *permutationTensor , rppHandle_t rppHandle);` + +#### Removed + +* Older versions of RPP handle creation inlcuding `rppCreateWithBatchSize()`, `rppCreateWithStream()`, and `rppCreateWithStreamAndBatchSize()`. These have been replaced with `rppCreate()`. +* Older versions of RPP handle destruction API including `rppDestroyGPU()` and `rppDestroyHost()`. These have been replaced with `rppDestroy()`. + +#### Resolved issues + +* Test package - Debian packages will install required dependencies. + +### **Tensile** (4.44.0) + +#### Added + +- Support for gfx950. +- Code object compression via bundling. +- Support for non-default HIP SDK installations on Windows. +- Master solution library documentation. +- Compiler version-dependent assembler and architecture capabilities. +- Documentation from GitHub Wiki to ROCm docs. + +#### Changed + +- Loosened check for CLI compiler choices. +- Introduced 4-tuple targets for bundler invocations. +- Introduced PATHEXT extensions on Windows when searching for toolchain components. +- Enabled passing fully qualified paths to toolchain components. +- Enabled environment variable overrides when searching for a ROCm stack. +- Improved default toolchain configuration. +- Ignored f824 flake errors. + +#### Removed + +- Support for the gfx940 and gfx941 targets. +- Unused tuning files. +- Disabled tests. + +#### Resolved issues + +- Fixed configure time path not being invoked at build. +- Fixed find_package for msgpack to work with versions 5 and 6. +- Fixed RHEL 9 testing. +- Fixed gfx908 builds. +- Fixed the 'argument list too long' error. +- Fixed version typo in 6.3 changelog. +- Fixed improper use of aliases as nested namespace specifiers. + ## ROCm known issues ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known issues related to individual components, review the [Detailed component changes](#detailed-component-changes). +### A memory error in the kernel might lead to applications using the ROCr library becoming unresponsive + +Applications using the ROCr library might become unresponsive if a memory error occurs in the launched kernel when the queue from which it was launched is destroyed. The application is unable to receive further signal, resulting in the stall condition. The issue will be fixed in a future ROCm release. + +### Applications using stream capture APIs might fail during stream capture + +Applications using ``hipLaunchHostFunc`` with stream capture APIs might fail to capture graphs during stream capture, and return `hipErrorStreamCaptureUnsupported`. This issue resulted from an update in ``hipStreamAddCallback``. This issue will be fixed in a future ROCm release. + +### Compilation failure via hipRTC when compiling with std=c++11 + +Applications compiling kernels using `hipRTC` might fail while passing the `std=c++11` compiler option. This issue will be fixed in a future ROCm release. + +### Compilation failure when referencing std::array if _GLIBCXX_ASSERTIONS is defined + +Compiling from a device kernel or function results in failure when attempting to reference `std::array` if `_GLIBCXX_ASSERTIONS` is defined. The issue occurs because there's no device definition for `std::__glibcxx_asert_fail()`. This issue will be resolved in a future ROCm release with the implementation of `std::__glibcxx_assert_fail()`. + +### Segmentation fault in ROCprofiler-SDK due to ABI mismatch affecting std::regex + +Starting with GCC 5.1, GNU `libstdc++` introduced a dual Application Binary Interface (ABI) to adopt `C++11`, primarily affecting the `std::string` and its dependencies, including `std::regex`. If your code is compiled against headers expecting one ABI but linked or run with the other, it can cause problems with `std::string` and `std::regex`, leading to a segmentation fault in ROCprofiler-SDK, which uses `std::regex`. This issue is resolved in the [ROCm Systems `develop` branch](https://github.com/ROCm/rocm-systems) and will be part of a future ROCm release. + +### Decline in performance of batched GEMM operation for applications using hipBLASLT kernels + +Default batched General Matrix Multiplications (GEMM) operations for rocBLAS and hipBLAS on gfx1200 and gfx1201 may have a decline in performance in comparison with non-batched and strided_batched GEMM operations. By default, the batched GEMM uses hipBLASLT kernels, and switching to the Tensile kernel resolves the performance decline issue. The issue will be fixed in a future ROCm release. As a workaround, you can set the environment variable `ROCBLAS_USE_HIPBLASLT=0` before the batched GEMM operation is performed on gfx1200 and gfx1201. After completing the batched operation, reset the variable to `ROCBLAS_USE_HIPBLASLT=1` before calling non-batched or strided_batched operations. + +### Failure to declare out-of-bound CPERs for bad memory page + +Exceeding bad memory page threshold fails to declare Out-Of-Band Common Platform Error Records (CPERs). This issue affects all AMD Instinct MI300 Series and MI350 Series GPUs, and will be fixed in a future AMD GPU Driver release. + +## ROCm resolved issues + +The following are previously known issues resolved in this release. For resolved issues related to +individual components, review the [Detailed component changes](#detailed-component-changes). + +### Failure when using a generic target with compression and vice versa + +An issue where compiling a generic target resulted in compression failing has been resolved in this release. This issue prevented you from compiling a generic target and using compression simultaneously. See [GitHub issue #4602](https://github.com/ROCm/ROCm/issues/4602). + +### Limited support for Sparse API and Pallas functionality in JAX + +An issue where due to limited support for Sparse API in JAX, some of the functionality of the Pallas extension were restricted has been resolved. See [GitHub issue #4608](https://github.com/ROCm/ROCm/issues/4608). + +### Failure to use –kokkos-trace option in ROCm Compute Profiler + +An issue where using of the ``--kokkos-trace`` option resulted in a difference between the output of the ``--kokkos-trace`` and the ``counter_collection.csv`` output file has been resolved. Due to this issue, the program exited with a warning message if the ``-kokkos-trace`` option was detected in the ROCm Compute Profiler. This issue was due to the partial implementation of ``--kokkos-trace`` in ``rocprofv3`` tool. See [GitHub issue #4604](https://github.com/ROCm/ROCm/issues/4604). + ## ROCm upcoming changes The following changes to the ROCm software stack are anticipated for future releases. -### AMD SMI migration to AMDGPU driver repository - -In a future release, [AMD SMI](https://github.com/ROCm/amdsmi) will be relocated from the ROCm organization repository to a new AMDTools repository to better align with its system-level functionality. `amd-smi-lib` will no longer be included in the `rocm-developer-tools` meta-package included with your standard ROCm installation. Instead, it will be packaged with the AMDGPU driver installation. - ### ROCm SMI deprecation [ROCm SMI](https://github.com/ROCm/rocm_smi_lib) will be phased out in an @@ -452,13 +2613,12 @@ It's anticipated that ROCTracer, ROCProfiler, `rocprof`, and `rocprofv2` will re ### AMDGPU wavefront size compiler macro deprecation Access to the wavefront size as a compile-time constant via the `__AMDGCN_WAVEFRONT_SIZE` -and `__AMDGCN_WAVEFRONT_SIZE__` macros or the `constexpr warpSize` variable is deprecated -and will be disabled in a future release. +and `__AMDGCN_WAVEFRONT_SIZE__` macros are deprecated and will be disabled in a future release. In ROCm 7.0.0 `warpSize` is only available as a non-`constextpr` variable. You're encouraged to update your code if needed to ensure future compatibility. * The `__AMDGCN_WAVEFRONT_SIZE__` macro and `__AMDGCN_WAVEFRONT_SIZE` alias will be removed in an upcoming release. It is recommended to remove any use of this macro. For more information, see [AMDGPU support](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.3/LLVM/clang/html/AMDGPUSupport.html). -* `warpSize` will only be available as a non-`constexpr` variable. Where required, +* `warpSize` is only available as a non-`constexpr` variable. Where required, the wavefront size should be queried via the `warpSize` variable in device code, or via `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/docs-6.4.3/how-to/hip_cpp_language_extensions.html#warpsize). * For cases where compile-time evaluation of the wavefront size cannot be avoided, @@ -474,13 +2634,9 @@ and will be disabled in a future release. #endif ``` -### HIPCC Perl scripts deprecation - -The HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) will be removed in an upcoming release. - ### Changes to ROCm Object Tooling -ROCm Object Tooling tools ``roc-obj-ls``, ``roc-obj-extract``, and ``roc-obj`` are +ROCm Object Tooling tools ``roc-obj-ls``, ``roc-obj-extract``, and ``roc-obj`` were deprecated in ROCm 6.4, and will be removed in a future release. Functionality has been added to the ``llvm-objdump --offloading`` tool option to extract all clang-offload-bundles into individual code objects found within the objects @@ -488,11 +2644,3 @@ or executables passed as input. The ``llvm-objdump --offloading`` tool option a supports the ``--arch-name`` option, and only extracts code objects found with the specified target architecture. See [llvm-objdump](https://llvm.org/docs/CommandGuide/llvm-objdump.html) for more information. - -### HIP runtime API changes - -There are a number of upcoming changes planned for HIP runtime API in an upcoming major release -that are not backward compatible with prior releases. Most of these changes increase -alignment between HIP and CUDA APIs or behavior. Some of the upcoming changes are to -clean up header files, remove namespace collision, and have a clear separation between -`hipRTC` and HIP runtime. For more information, see [HIP 7.0 Is Coming: What You Need to Know to Stay Ahead](https://rocm.blogs.amd.com/ecosystems-and-partners/transition-to-hip-7.0-blog/README.html). diff --git a/default.xml b/default.xml index 74f068837..d20ce2103 100644 --- a/default.xml +++ b/default.xml @@ -1,7 +1,7 @@ - @@ -9,6 +9,7 @@ + @@ -22,7 +23,7 @@ - + @@ -37,36 +38,26 @@ - - - - - - - - - - - + + - - - - - diff --git a/docs/compatibility/compatibility-matrix-historical-6.0.csv b/docs/compatibility/compatibility-matrix-historical-6.0.csv index 54f5ceb50..61ac9027b 100644 --- a/docs/compatibility/compatibility-matrix-historical-6.0.csv +++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv @@ -1,133 +1,134 @@ -ROCm Version,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0 - :ref:`Operating systems & kernels `,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,, - ,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2" - ,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5" - ,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2" - ,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8" - ,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4" - ,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9 - ,"Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,, - ,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,, - ,Azure Linux 3.0 [#mi300x-past-60]_,Azure Linux 3.0 [#mi300x-past-60]_,Azure Linux 3.0 [#mi300x-past-60]_,Azure Linux 3.0 [#mi300x-past-60]_,Azure Linux 3.0 [#mi300x-past-60]_,Azure Linux 3.0 [#mi300x-past-60]_,,,,,,,,,,,, - ,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`Architecture `,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3 - ,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2 - ,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA - ,RDNA4,RDNA4,RDNA4,,,,,,,,,,,,,,, - ,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3 - ,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2 - ,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`GPU / LLVM target `,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, - ,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, -,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, - ,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100 - ,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030 - ,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_ - ,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a - ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908 -,,,,,,,,,,,,,,,,,, - FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13" - :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1" - :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26 - :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A - :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A, - :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat]_,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat]_,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat]_,N/A,N/A,N/A,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - `ONNX Runtime `_,1.2,1.2,1.2,1.2,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1 -,,,,,,,,,,,,,,,,,, - ,,,,,,,,,,,,,,,,,, - THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - `UCC `_,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0 - `UCX `_,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1 - ,,,,,,,,,,,,,,,,,, - THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - Thrust,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 - CUB,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 -,,,,,,,,,,,,,,,,,, - KMD & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`KMD versions `,"6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x" - ,,,,,,,,,,,,,,,,,, - ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`Composable Kernel `,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0 - :doc:`MIGraphX `,2.12.0,2.12.0,2.12.0,2.12.0,2.11.0,2.11.0,2.11.0,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0 - :doc:`MIOpen `,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`MIVisionX `,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0 - :doc:`rocAL `,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - :doc:`rocDecode `,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A - :doc:`rocJPEG `,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`rocPyDecode `,0.3.1,0.3.1,0.3.1,0.3.1,0.2.0,0.2.0,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`RPP `,1.9.10,1.9.10,1.9.10,1.9.10,1.9.1,1.9.1,1.9.1,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0 - ,,,,,,,,,,,,,,,,,, - COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`RCCL `,2.22.3,2.22.3,2.22.3,2.22.3,2.21.5,2.21.5,2.21.5,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3 - :doc:`rocSHMEM `,2.0.1,2.0.1,2.0.0,2.0.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - ,,,,,,,,,,,,,,,,,, - MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - `half `_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0 - :doc:`hipBLAS `,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0 - :doc:`hipBLASLt `,0.12.1,0.12.1,0.12.1,0.12.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0 - :doc:`hipFFT `,1.0.18,1.0.18,1.0.18,1.0.18,1.0.17,1.0.17,1.0.17,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13 - :doc:`hipfort `,0.6.0,0.6.0,0.6.0,0.6.0,0.5.1,0.5.1,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0 - :doc:`hipRAND `,2.12.0,2.12.0,2.12.0,2.12.0,2.11.1,2.11.1,2.11.1,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16 - :doc:`hipSOLVER `,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0 - :doc:`hipSPARSE `,3.2.0,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 - :doc:`hipSPARSELt `,0.2.3,0.2.3,0.2.3,0.2.3,0.2.2,0.2.2,0.2.2,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0 - :doc:`rocALUTION `,3.2.3,3.2.3,3.2.3,3.2.2,3.2.1,3.2.1,3.2.1,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3 - :doc:`rocBLAS `,4.4.1,4.4.1,4.4.0,4.4.0,4.3.0,4.3.0,4.3.0,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0 - :doc:`rocFFT `,1.0.32,1.0.32,1.0.32,1.0.32,1.0.31,1.0.31,1.0.31,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23 - :doc:`rocRAND `,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17 - :doc:`rocSOLVER `,3.28.2,3.28.2,3.28.0,3.28.0,3.27.0,3.27.0,3.27.0,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0 - :doc:`rocSPARSE `,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2 - :doc:`rocWMMA `,1.7.0,1.7.0,1.7.0,1.7.0,1.6.0,1.6.0,1.6.0,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0 - :doc:`Tensile `,4.43.0,4.43.0,4.43.0,4.43.0,4.42.0,4.42.0,4.42.0,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0 - ,,,,,,,,,,,,,,,,,, - PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`hipCUB `,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`hipTensor `,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0 - :doc:`rocPRIM `,3.4.1,3.4.1,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`rocThrust `,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 - ,,,,,,,,,,,,,,,,,, - SUPPORT LIBS,,,,,,,,,,,,,,,,,, - `hipother `_,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - `rocm-core `_,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0 - `ROCT-Thunk-Interface `_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245 - ,,,,,,,,,,,,,,,,,, - SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`AMD SMI `,25.5.1,25.5.1,25.4.2,25.3.0,24.7.1,24.7.1,24.7.1,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2 - :doc:`ROCm Data Center Tool `,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0 - :doc:`rocminfo `,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - :doc:`ROCm SMI `,7.7.0,7.5.0,7.5.0,7.5.0,7.4.0,7.4.0,7.4.0,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0 - :doc:`ROCm Validation Suite `,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000 - ,,,,,,,,,,,,,,,,,, - PERFORMANCE TOOLS,,,,,,,,,,,,,,,,,, - :doc:`ROCm Bandwidth Test `,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0 - :doc:`ROCm Compute Profiler `,3.1.1,3.1.1,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCm Systems Profiler `,1.0.2,1.0.2,1.0.1,1.0.0,0.1.2,0.1.1,0.1.0,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCProfiler `,2.0.60403,2.0.60402,2.0.60401,2.0.60400,2.0.60303,2.0.60302,2.0.60301,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000 - :doc:`ROCprofiler-SDK `,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCTracer `,4.1.60403,4.1.60402,4.1.60401,4.1.60400,4.1.60303,4.1.60302,4.1.60301,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000 - ,,,,,,,,,,,,,,,,,, - DEVELOPMENT TOOLS,,,,,,,,,,,,,,,,,, - :doc:`HIPIFY `,19.0.0,19.0.0,19.0.0,19.0.0,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - :doc:`ROCm CMake `,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0 - :doc:`ROCdbgapi `,0.77.2,0.77.2,0.77.2,0.77.2,0.77.0,0.77.0,0.77.0,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0 - :doc:`ROCm Debugger (ROCgdb) `,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0 - `rocprofiler-register `_,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A - :doc:`ROCr Debug Agent `,2.0.4,2.0.4,2.0.4,2.0.4,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3 - ,,,,,,,,,,,,,,,,,, - COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - `clang-ocl `_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0 - :doc:`hipCC `,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - `Flang `_,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - :doc:`llvm-project `,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - `OpenMP `_,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 -,,,,,,,,,,,,,,,,,, - RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,, - :doc:`AMD CLR `,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - :doc:`HIP `,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - `OpenCL Runtime `_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0 - :doc:`ROCr Runtime `,1.15.0,1.15.0,1.15.0,1.15.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0 +ROCm Version,7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0 + :ref:`Operating systems & kernels `,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,, + ,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2" + ,,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5" + ,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2" + ,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8" + ,SLES 15 SP7,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4" + ,,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9 + ,"Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,, + ,Debian 12,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,, + ,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,,,,,,,,,,,, +,Rocky Linux 9,,,,,,,,,,,,,,,,,, + ,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`Architecture `,CDNA4,,,,,,,,,,,,,,,,,, +,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3 + ,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2 + ,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA + ,RDNA4,RDNA4,RDNA4,RDNA4,,,,,,,,,,,,,,, + ,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3 + ,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2 + ,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`GPU / LLVM target `,gfx950,,,,,,,,,,,,,,,,,, +,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, + ,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, +,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, + ,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100 + ,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030 + ,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_ + ,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a + ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908 +,,,,,,,,,,,,,,,,,,, + FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.7, 2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13" + :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1" + :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26 + :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A, + :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>`,N/A,N/A,N/A,N/A,N/A,85f95ae,85f95ae,85f95ae,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A, + :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A, + :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`,N/A,N/A,N/A,N/A,N/A,0.7.0,0.7.0,0.7.0,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A, + :doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat]_,N/A,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A, + `ONNX Runtime `_,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1 +,,,,,,,,,,,,,,,,,,, + ,,,,,,,,,,,,,,,,,,, + THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + `UCC `_,>=1.4.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0 + `UCX `_,>=1.17.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1 + ,,,,,,,,,,,,,,,,,,, + THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + Thrust,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 + CUB,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 +,,,,,,,,,,,,,,,,,,, + KMD & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`KMD versions `,"30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x" + ,,,,,,,,,,,,,,,,,,, + ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`Composable Kernel `,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0 + :doc:`MIGraphX `,2.13.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.0,2.11.0,2.11.0,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0 + :doc:`MIOpen `,3.5.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 + :doc:`MIVisionX `,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0 + :doc:`rocAL `,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 + :doc:`rocDecode `,1.0.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A + :doc:`rocJPEG `,1.1.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A + :doc:`rocPyDecode `,0.6.0,0.3.1,0.3.1,0.3.1,0.3.1,0.2.0,0.2.0,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A + :doc:`RPP `,2.0.0,1.9.10,1.9.10,1.9.10,1.9.10,1.9.1,1.9.1,1.9.1,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0 + ,,,,,,,,,,,,,,,,,,, + COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`RCCL `,2.26.6,2.22.3,2.22.3,2.22.3,2.22.3,2.21.5,2.21.5,2.21.5,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3 + :doc:`rocSHMEM `,3.0.0,2.0.1,2.0.1,2.0.0,2.0.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A + ,,,,,,,,,,,,,,,,,,, + MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + `half `_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0 + :doc:`hipBLAS `,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0 + :doc:`hipBLASLt `,1.0.0,0.12.1,0.12.1,0.12.1,0.12.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0 + :doc:`hipFFT `,1.0.20,1.0.18,1.0.18,1.0.18,1.0.18,1.0.17,1.0.17,1.0.17,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13 + :doc:`hipfort `,0.7.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.1,0.5.1,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0 + :doc:`hipRAND `,3.0.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.1,2.11.1,2.11.1,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16 + :doc:`hipSOLVER `,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0 + :doc:`hipSPARSE `,4.0.1,3.2.0,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 + :doc:`hipSPARSELt `,0.2.4,0.2.3,0.2.3,0.2.3,0.2.3,0.2.2,0.2.2,0.2.2,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0 + :doc:`rocALUTION `,4.0.0,3.2.3,3.2.3,3.2.3,3.2.2,3.2.1,3.2.1,3.2.1,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3 + :doc:`rocBLAS `,5.0.0,4.4.1,4.4.1,4.4.0,4.4.0,4.3.0,4.3.0,4.3.0,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0 + :doc:`rocFFT `,1.0.34,1.0.32,1.0.32,1.0.32,1.0.32,1.0.31,1.0.31,1.0.31,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23 + :doc:`rocRAND `,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17 + :doc:`rocSOLVER `,3.30.0,3.28.2,3.28.2,3.28.0,3.28.0,3.27.0,3.27.0,3.27.0,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0 + :doc:`rocSPARSE `,4.0.2,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2 + :doc:`rocWMMA `,2.0.0,1.7.0,1.7.0,1.7.0,1.7.0,1.6.0,1.6.0,1.6.0,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0 + :doc:`Tensile `,4.44.0,4.43.0,4.43.0,4.43.0,4.43.0,4.42.0,4.42.0,4.42.0,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0 + ,,,,,,,,,,,,,,,,,,, + PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`hipCUB `,4.0.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 + :doc:`hipTensor `,2.0.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0 + :doc:`rocPRIM `,4.0.0,3.4.1,3.4.1,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 + :doc:`rocThrust `,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 + ,,,,,,,,,,,,,,,,,,, + SUPPORT LIBS,,,,,,,,,,,,,,,,,,, + `hipother `_,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 + `rocm-core `_,7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0 + `ROCT-Thunk-Interface `_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245 + ,,,,,,,,,,,,,,,,,,, + SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`AMD SMI `,26.0.0,25.5.1,25.5.1,25.4.2,25.3.0,24.7.1,24.7.1,24.7.1,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2 + :doc:`ROCm Data Center Tool `,1.1.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0 + :doc:`rocminfo `,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 + :doc:`ROCm SMI `,7.8.0,7.7.0,7.5.0,7.5.0,7.5.0,7.4.0,7.4.0,7.4.0,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0 + :doc:`ROCm Validation Suite `,1.2.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000 + ,,,,,,,,,,,,,,,,,,, + PERFORMANCE TOOLS,,,,,,,,,,,,,,,,,,, + :doc:`ROCm Bandwidth Test `,2.6.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0 + :doc:`ROCm Compute Profiler `,3.2.3,3.1.1,3.1.1,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A + :doc:`ROCm Systems Profiler `,1.1.0,1.0.2,1.0.2,1.0.1,1.0.0,0.1.2,0.1.1,0.1.0,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A + :doc:`ROCProfiler `,2.0.70000,2.0.60403,2.0.60402,2.0.60401,2.0.60400,2.0.60303,2.0.60302,2.0.60301,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000 + :doc:`ROCprofiler-SDK `,1.0.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A + :doc:`ROCTracer `,4.1.70000,4.1.60403,4.1.60402,4.1.60401,4.1.60400,4.1.60303,4.1.60302,4.1.60301,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000 + ,,,,,,,,,,,,,,,,,,, + DEVELOPMENT TOOLS,,,,,,,,,,,,,,,,,,, + :doc:`HIPIFY `,20.0.0,19.0.0,19.0.0,19.0.0,19.0.0,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 + :doc:`ROCm CMake `,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0 + :doc:`ROCdbgapi `,0.77.3,0.77.2,0.77.2,0.77.2,0.77.2,0.77.0,0.77.0,0.77.0,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0 + :doc:`ROCm Debugger (ROCgdb) `,16.3.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0 + `rocprofiler-register `_,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A + :doc:`ROCr Debug Agent `,2.1.0,2.0.4,2.0.4,2.0.4,2.0.4,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3 + ,,,,,,,,,,,,,,,,,,, + COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + `clang-ocl `_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0 + :doc:`hipCC `,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 + `Flang `_,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 + :doc:`llvm-project `,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 + `OpenMP `_,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 +,,,,,,,,,,,,,,,,,,, + RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,, + :doc:`AMD CLR `,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 + :doc:`HIP `,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 + `OpenCL Runtime `_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0 + :doc:`ROCr Runtime `,1.18.0,1.15.0,1.15.0,1.15.0,1.15.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0 diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index fb1ffad43..292cc5285 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -23,26 +23,29 @@ compatibility and system requirements. .. container:: format-big-table .. csv-table:: - :header: "ROCm Version", "6.4.3", "6.4.2", "6.3.0" + :header: "ROCm Version", "7.0.0", "6.4.3", "6.3.0" :stub-columns: 1 - :ref:`Operating systems & kernels `,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2 + :ref:`Operating systems & kernels `,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2 ,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5 ,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.5, 9.4" ,RHEL 8.10,RHEL 8.10,RHEL 8.10 - ,"SLES 15 SP7, SP6","SLES 15 SP7, SP6","SLES 15 SP6, SP5" - ,"Oracle Linux 9, 8 [#mi300x]_","Oracle Linux 9, 8 [#mi300x]_",Oracle Linux 8.10 [#mi300x]_ - ,Debian 12 [#single-node]_,Debian 12 [#single-node]_, - ,Azure Linux 3.0 [#mi300x]_,Azure Linux 3.0 [#mi300x]_, + ,SLES 15 SP7,"SLES 15 SP7, SP6","SLES 15 SP6, SP5" + ,"Oracle Linux 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_",Oracle Linux 8.10 [#ol-mi300x]_ + ,Debian 12,Debian 12 [#single-node]_, + ,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_, + ,Rocky Linux 9,, ,.. _architecture-support-compatibility-matrix:,, - :doc:`Architecture `,CDNA3,CDNA3,CDNA3 + :doc:`Architecture `,CDNA4,, + ,CDNA3,CDNA3,CDNA3 ,CDNA2,CDNA2,CDNA2 ,CDNA,CDNA,CDNA ,RDNA4,RDNA4, ,RDNA3,RDNA3,RDNA3 ,RDNA2,RDNA2,RDNA2 ,.. _gpu-support-compatibility-matrix:,, - :doc:`GPU / LLVM target `,gfx1201 [#RDNA-OS]_,gfx1201 [#RDNA-OS]_, + :doc:`GPU / LLVM target `,gfx950,, + ,gfx1201 [#RDNA-OS]_,gfx1201 [#RDNA-OS]_, ,gfx1200 [#RDNA-OS]_,gfx1200 [#RDNA-OS]_, ,gfx1101 [#RDNA-OS]_ [#7700XT-OS]_,gfx1101 [#RDNA-OS]_ [#7700XT-OS]_, ,gfx1100,gfx1100,gfx1100 @@ -52,113 +55,116 @@ compatibility and system requirements. ,gfx908,gfx908,gfx908 ,,, FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,, - :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 2.1, 2.0, 1.13" - :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1" - :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.4.35,0.4.35,0.4.31 + :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.7, 2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 2.1, 2.0, 1.13" + :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1" + :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.4.35,0.4.31 + :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat]_,N/A,N/A,N/A :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>`,N/A,N/A,85f95ae + :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,N/A :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`,N/A,N/A,0.7.0 - `ONNX Runtime `_,1.2,1.2,1.17.3 + :doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat]_,N/A,N/A,N/A + `ONNX Runtime `_,1.22.0,1.20.0,1.17.3 ,,, THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix:,, - `UCC `_,>=1.3.0,>=1.3.0,>=1.3.0 - `UCX `_,>=1.15.0,>=1.15.0,>=1.15.0 + `UCC `_,>=1.4.0,>=1.3.0,>=1.3.0 + `UCX `_,>=1.17.0,>=1.15.0,>=1.15.0 ,,, THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix:,, - Thrust,2.5.0,2.5.0,2.3.2 - CUB,2.5.0,2.5.0,2.3.2 + Thrust,2.6.0,2.5.0,2.3.2 + CUB,2.6.0,2.5.0,2.3.2 ,,, KMD & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,, - :doc:`KMD versions `,"6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x" + :doc:`KMD versions `,"30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x" ,,, ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix:,, :doc:`Composable Kernel `,1.1.0,1.1.0,1.1.0 - :doc:`MIGraphX `,2.12.0,2.12.0,2.11.0 - :doc:`MIOpen `,3.4.0,3.4.0,3.3.0 - :doc:`MIVisionX `,3.2.0,3.2.0,3.1.0 - :doc:`rocAL `,2.2.0,2.2.0,2.1.0 - :doc:`rocDecode `,0.10.0,0.10.0,0.8.0 - :doc:`rocJPEG `,0.8.0,0.8.0,0.6.0 - :doc:`rocPyDecode `,0.3.1,0.3.1,0.2.0 - :doc:`RPP `,1.9.10,1.9.10,1.9.1 + :doc:`MIGraphX `,2.13.0,2.12.0,2.11.0 + :doc:`MIOpen `,3.5.0,3.4.0,3.3.0 + :doc:`MIVisionX `,3.3.0,3.2.0,3.1.0 + :doc:`rocAL `,2.3.0,2.2.0,2.1.0 + :doc:`rocDecode `,1.0.0,0.10.0,0.8.0 + :doc:`rocJPEG `,1.1.0,0.8.0,0.6.0 + :doc:`rocPyDecode `,0.6.0,0.3.1,0.2.0 + :doc:`RPP `,2.0.0,1.9.10,1.9.1 ,,, COMMUNICATION,.. _commlibs-support-compatibility-matrix:,, - :doc:`RCCL `,2.22.3,2.22.3,2.21.5 - :doc:`rocSHMEM `,2.0.1,2.0.1,N/A + :doc:`RCCL `,2.26.6,2.22.3,2.21.5 + :doc:`rocSHMEM `,3.0.0,2.0.1,N/A ,,, MATH LIBS,.. _mathlibs-support-compatibility-matrix:,, `half `_ ,1.12.0,1.12.0,1.12.0 - :doc:`hipBLAS `,2.4.0,2.4.0,2.3.0 - :doc:`hipBLASLt `,0.12.1,0.12.1,0.10.0 - :doc:`hipFFT `,1.0.18,1.0.18,1.0.17 - :doc:`hipfort `,0.6.0,0.6.0,0.5.0 - :doc:`hipRAND `,2.12.0,2.12.0,2.11.0 - :doc:`hipSOLVER `,2.4.0,2.4.0,2.3.0 - :doc:`hipSPARSE `,3.2.0,3.2.0,3.1.2 - :doc:`hipSPARSELt `,0.2.3,0.2.3,0.2.2 - :doc:`rocALUTION `,3.2.3,3.2.3,3.2.1 - :doc:`rocBLAS `,4.4.1,4.4.1,4.3.0 - :doc:`rocFFT `,1.0.32,1.0.32,1.0.31 - :doc:`rocRAND `,3.3.0,3.3.0,3.2.0 - :doc:`rocSOLVER `,3.28.2,3.28.2,3.27.0 - :doc:`rocSPARSE `,3.4.0,3.4.0,3.3.0 - :doc:`rocWMMA `,1.7.0,1.7.0,1.6.0 - :doc:`Tensile `,4.43.0,4.43.0,4.42.0 + :doc:`hipBLAS `,3.0.0,2.4.0,2.3.0 + :doc:`hipBLASLt `,1.0.0,0.12.1,0.10.0 + :doc:`hipFFT `,1.0.20,1.0.18,1.0.17 + :doc:`hipfort `,0.7.0,0.6.0,0.5.0 + :doc:`hipRAND `,3.0.0,2.12.0,2.11.0 + :doc:`hipSOLVER `,3.0.0,2.4.0,2.3.0 + :doc:`hipSPARSE `,4.0.1,3.2.0,3.1.2 + :doc:`hipSPARSELt `,0.2.4,0.2.3,0.2.2 + :doc:`rocALUTION `,4.0.0,3.2.3,3.2.1 + :doc:`rocBLAS `,5.0.0,4.4.1,4.3.0 + :doc:`rocFFT `,1.0.34,1.0.32,1.0.31 + :doc:`rocRAND `,4.0.0,3.3.0,3.2.0 + :doc:`rocSOLVER `,3.30.0,3.28.2,3.27.0 + :doc:`rocSPARSE `,4.0.2,3.4.0,3.3.0 + :doc:`rocWMMA `,2.0.0,1.7.0,1.6.0 + :doc:`Tensile `,4.44.0,4.43.0,4.42.0 ,,, PRIMITIVES,.. _primitivelibs-support-compatibility-matrix:,, - :doc:`hipCUB `,3.4.0,3.4.0,3.3.0 - :doc:`hipTensor `,1.5.0,1.5.0,1.4.0 - :doc:`rocPRIM `,3.4.1,3.4.1,3.3.0 - :doc:`rocThrust `,3.3.0,3.3.0,3.3.0 + :doc:`hipCUB `,4.0.0,3.4.0,3.3.0 + :doc:`hipTensor `,2.0.0,1.5.0,1.4.0 + :doc:`rocPRIM `,4.0.0,3.4.1,3.3.0 + :doc:`rocThrust `,4.0.0,3.3.0,3.3.0 ,,, SUPPORT LIBS,,, - `hipother `_,6.4.43483,6.4.43483,6.3.42131 - `rocm-core `_,6.4.3,6.4.2,6.3.0 + `hipother `_,7.0.51830,6.4.43483,6.3.42131 + `rocm-core `_,7.0.0,6.4.3,6.3.0 `ROCT-Thunk-Interface `_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_ ,,, SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix:,, - :doc:`AMD SMI `,25.5.1,25.5.1,24.7.1 - :doc:`ROCm Data Center Tool `,0.3.0,0.3.0,0.3.0 + :doc:`AMD SMI `,26.0.0,25.5.1,24.7.1 + :doc:`ROCm Data Center Tool `,1.1.0,0.3.0,0.3.0 :doc:`rocminfo `,1.0.0,1.0.0,1.0.0 - :doc:`ROCm SMI `,7.7.0,7.5.0,7.4.0 - :doc:`ROCm Validation Suite `,1.1.0,1.1.0,1.1.0 + :doc:`ROCm SMI `,7.8.0,7.7.0,7.4.0 + :doc:`ROCm Validation Suite `,1.2.0,1.1.0,1.1.0 ,,, PERFORMANCE TOOLS,,, - :doc:`ROCm Bandwidth Test `,1.4.0,1.4.0,1.4.0 - :doc:`ROCm Compute Profiler `,3.1.1,3.1.1,3.0.0 - :doc:`ROCm Systems Profiler `,1.0.2,1.0.2,0.1.0 - :doc:`ROCProfiler `,2.0.60403,2.0.60402,2.0.60300 - :doc:`ROCprofiler-SDK `,0.6.0,0.6.0,0.5.0 - :doc:`ROCTracer `,4.1.60403,4.1.60402,4.1.60300 + :doc:`ROCm Bandwidth Test `,2.6.0,1.4.0,1.4.0 + :doc:`ROCm Compute Profiler `,3.2.3,3.1.1,3.0.0 + :doc:`ROCm Systems Profiler `,1.1.0,1.0.2,0.1.0 + :doc:`ROCProfiler `,2.0.70000,2.0.60403,2.0.60300 + :doc:`ROCprofiler-SDK `,1.0.0,0.6.0,0.5.0 + :doc:`ROCTracer `,4.1.70000,4.1.60403,4.1.60300 ,,, DEVELOPMENT TOOLS,,, - :doc:`HIPIFY `,19.0.0,19.0.0,18.0.0.24455 + :doc:`HIPIFY `,20.0.0,19.0.0,18.0.0.24455 :doc:`ROCm CMake `,0.14.0,0.14.0,0.14.0 - :doc:`ROCdbgapi `,0.77.2,0.77.2,0.77.0 - :doc:`ROCm Debugger (ROCgdb) `,15.2.0,15.2.0,15.2.0 - `rocprofiler-register `_,0.4.0,0.4.0,0.4.0 - :doc:`ROCr Debug Agent `,2.0.4,2.0.4,2.0.3 + :doc:`ROCdbgapi `,0.77.3,0.77.2,0.77.0 + :doc:`ROCm Debugger (ROCgdb) `,16.3.0,15.2.0,15.2.0 + `rocprofiler-register `_,0.5.0,0.4.0,0.4.0 + :doc:`ROCr Debug Agent `,2.1.0,2.0.4,2.0.3 ,,, COMPILERS,.. _compilers-support-compatibility-matrix:,, - `clang-ocl `_,N/A,N/A,N/A :doc:`hipCC `,1.1.1,1.1.1,1.1.1 - `Flang `_,19.0.0.25224,19.0.0.25224,18.0.0.24455 - :doc:`llvm-project `,19.0.0.25224,19.0.0.25224,18.0.0.24491 - `OpenMP `_,19.0.0.25224,19.0.0.25224,18.0.0.24491 + `Flang `_,20.0.0.25314,19.0.0.25224,18.0.0.24455 + :doc:`llvm-project `,20.0.0.25314,19.0.0.25224,18.0.0.24491 + `OpenMP `_,20.0.0.25314,19.0.0.25224,18.0.0.24491 ,,, RUNTIMES,.. _runtime-support-compatibility-matrix:,, - :doc:`AMD CLR `,6.4.43484,6.4.43484,6.3.42131 - :doc:`HIP `,6.4.43484,6.4.43484,6.3.42131 + :doc:`AMD CLR `,7.0.51830,6.4.43484,6.3.42131 + :doc:`HIP `,7.0.51830,6.4.43484,6.3.42131 `OpenCL Runtime `_,2.0.0,2.0.0,2.0.0 - :doc:`ROCr Runtime `,1.15.0,1.15.0,1.14.0 - + :doc:`ROCr Runtime `,1.18.0,1.15.0,1.14.0 .. rubric:: Footnotes -.. [#mi300x] Oracle Linux and Azure Linux are supported only on AMD Instinct MI300X. +.. [#ol-700-mi300x] **For ROCm 7.0** - Oracle Linux 9 is supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X. +.. [#ol-mi300x] **Prior ROCm 7.0** - Oracle Linux is supported only on AMD Instinct MI300X. .. [#single-node] Debian 12 is supported only on AMD Instinct MI300X for single-node functionality. +.. [#az-mi300x] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710. .. [#RDNA-OS] Radeon AI PRO R9700, Radeon RX 9070 XT (gfx1201), Radeon RX 9060 XT (gfx1200), Radeon PRO W7700 (gfx1101), and Radeon RX 7800 XT (gfx1101) are supported only on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4. .. [#7700XT-OS] Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6. -.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD Kernel-mode GPU Driver (KMD) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The tested user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and kernel-space support matrix `_. +.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD Kernel-mode GPU Driver (KMD) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and kernel-space support matrix `_. .. [#ROCT-rocr] Starting from ROCm 6.3.0, the ROCT Thunk Interface is included as part of the ROCr runtime package. @@ -174,28 +180,30 @@ Use this lookup table to confirm which operating system and kernel versions are :widths: 40, 20, 30, 20 :stub-columns: 1 - `Ubuntu `_, 24.04.2, "6.8 GA, 6.11 HWE", 2.39 + `Ubuntu `_, 24.04.3, "6.8 [GA], 6.14 [HWE]", 2.39 ,, - `Ubuntu `_, 22.04.5, "5.15 GA, 6.8 HWE", 2.35 + `Ubuntu `_, 24.04.2, "6.8 [GA], 6.11 [HWE]", 2.39 ,, - `Red Hat Enterprise Linux (RHEL 9) `_, 9.6, 5.14+, 2.34 + `Ubuntu `_, 22.04.5, "5.15 [GA], 6.8 [HWE]", 2.35 + ,, + `Red Hat Enterprise Linux (RHEL 9) `_, 9.6, 5.14.0-570, 2.34 ,9.5, 5.14+, 2.34 - ,9.4, 5.14+, 2.34 - ,9.3, 5.14+, 2.34 + ,9.4, 5.14.0-427, 2.34 ,, - `Red Hat Enterprise Linux (RHEL 8) `_, 8.10, 4.18.0+, 2.28 - ,8.9, 4.18.0, 2.28 + `Red Hat Enterprise Linux (RHEL 8) `_, 8.10, 4.18.0-553, 2.28 ,, - `SUSE Linux Enterprise Server (SLES) `_, 15 SP7, 6.11.0+, 2.38 + `SUSE Linux Enterprise Server (SLES) `_, 15 SP7, 6.40-150700.51, 2.38 ,15 SP6, "6.5.0+, 6.4.0", 2.38 ,15 SP5, 5.14.21, 2.31 ,, - `Oracle Linux `_, 9, 5.15.0 (UEK), 2.35 + `Rocky Linux `_, 9, 5.14.0-570, 2.34 + ,, + `Oracle Linux `_, 9, 6.12.0 (UEK), 2.34 ,8, 5.15.0 (UEK), 2.28 ,, - `Debian `_,12, 6.1, 2.36 + `Debian `_,12, 6.1.0, 2.36 ,, - `Azure Linux `_,3.0, 6.6.60, 2.38 + `Azure Linux `_,3.0, 6.6.92, 2.38 ,, .. note:: @@ -228,8 +236,11 @@ Expand for full historical view of: .. rubric:: Footnotes - .. [#mi300x-past-60] Oracle Linux and Azure Linux are supported only on AMD Instinct MI300X. + .. [#ol-700-mi300x-past-60] **For ROCm 7.0** - Oracle Linux 9 is supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X. + .. [#mi300x-past-60] **Prior ROCm 7.0** - Oracle Linux is supported only on AMD Instinct MI300X. .. [#single-node-past-60] Debian 12 is supported only on AMD Instinct MI300X for single-node functionality. + .. [#az-mi300x-past-60] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710. + .. [#az-mi300x-630-past-60] **Prior ROCm 6.4.0**- Azure Linux 3.0 is supported only on AMD Instinct MI300X. .. [#RDNA-OS-past-60] Radeon AI PRO R9700, Radeon RX 9070 XT (gfx1201), Radeon RX 9060 XT (gfx1200), Radeon PRO W7700 (gfx1101), and Radeon RX 7800 XT (gfx1101) are supported only on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4. .. [#7700XT-OS-past-60] Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6. .. [#mi300_624-past-60] **For ROCm 6.2.4** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE]. diff --git a/docs/compatibility/ml-compatibility/jax-compatibility.rst b/docs/compatibility/ml-compatibility/jax-compatibility.rst index f85a3d722..cf2eabd6e 100644 --- a/docs/compatibility/ml-compatibility/jax-compatibility.rst +++ b/docs/compatibility/ml-compatibility/jax-compatibility.rst @@ -27,7 +27,7 @@ with ROCm support: - Offers AMD-validated and community :ref:`Docker images ` with ROCm and JAX preinstalled. - - ROCm JAX repository: `ROCm/jax `_ + - ROCm JAX repository: `ROCm/rocm-jax `_ - See the :doc:`ROCm JAX installation guide ` to get started. @@ -310,5 +310,54 @@ For a complete and up-to-date list of JAX public modules (for example, ``jax.num Since version 0.1.56, JAX has full support for ROCm, and the :ref:`Known issues and important notes ` section contains details about limitations specific to the ROCm backend. The list of - JAX API modules is maintained by the JAX project and is subject to change. + JAX API modules are maintained by the JAX project and is subject to change. Refer to the official Jax documentation for the most up-to-date information. + +Key features and enhancements for ROCm 7.0 +=============================================================================== + +- Upgraded XLA backend: Integrates a newer XLA version, enabling better + optimizations, broader operator support, and potential performance gains. + +- RNN support: Native RNN support (including LSTMs via ``jax.experimental.rnn``) + now available on ROCm, aiding sequence model development. + +- Comprehensive linear algebra capabilities: Offers robust ``jax.linalg`` + operations, essential for scientific and machine learning tasks. + +- Expanded AMD GPU architecture support: Provides ongoing support for gfx1101 + GPUs and introduces support for gfx950 and gfx12xx GPUs. + +- Mixed FP8 precision support: Enables ``lax.dot_general`` operations with mixed FP8 + types, offering pathways for memory and compute efficiency. + +- Streamlined PyPi packaging: Provides reliable PyPi wheels for JAX on ROCm, + simplifying the installation process. + +- Pallas experimental kernel development: Continued Pallas framework + enhancements for custom GPU kernels, including new intrinsics (specific + kernel behaviors under review). + +- Improved build system and CI: Enhanced ROCm build system and CI for greater + reliability and maintainability. + +- Enhanced distributed computing setup: Improved JAX setup in multi-GPU + distributed environments. + +.. _jax_comp_known_issues: + +Known issues and notes for ROCm 7.0 +=============================================================================== + +- ``nn.dot_product_attention``: Certain configurations of ``jax.nn.dot_product_attention`` + may cause segmentation faults, though the majority of use cases work correctly. + +- SVD with dynamic shapes: SVD on inputs with dynamic/symbolic shapes might result in an error. + SVD with static shapes is unaffected. + +- QR decomposition with symbolic shapes: QR decomposition operations may fail when using + symbolic/dynamic shapes in shape polymorphic contexts. + +- Pallas kernels: Specific advanced Pallas kernels may exhibit variations in + numerical output or resource usage. These are actively reviewed as part of + Pallas's experimental development. diff --git a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst index cd325c8c5..243afb022 100644 --- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst +++ b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst @@ -366,7 +366,8 @@ feature set available to developers. Supported modules and data types ================================================================================ -The following section outlines the supported data types, modules, and domain libraries available in PyTorch on ROCm. +The following section outlines the supported data types, modules, and domain +libraries available in PyTorch on ROCm. Supported data types -------------------------------------------------------------------------------- @@ -533,3 +534,72 @@ with ROCm. dispatching. **Note:** Only official release exists. + +Key features and enhancements for PyTorch 2.7 with ROCm 7.0 +================================================================================ + +- Enhanced TunableOp framework: Introduces ``tensorfloat32`` support for + TunableOp operations, improved offline tuning for ScaledGEMM operations, + submatrix offline tuning capabilities, and better logging for BLAS operations + without bias vectors. + +- Expanded GPU architecture support: Provides optimized support for newer GPU + architectures, including gfx1200 and gfx1201 with preferred hipBLASLt backend + selection, along with improvements for gfx950 and gfx1100 series GPUs. + +- Advanced Triton Integration: AOTriton 0.10b introduces official support for + gfx950 and gfx1201, along with experimental support for gfx1101, gfx1151, + gfx1150, and gfx1200. + +- Improved element-wise kernel performance: Delivers enhanced vectorized + element-wise kernels with better support for heterogeneous tensor types and + optimized input vectorization for tensors with mixed data types. + +- MIOpen deep learning optimizations: Enables NHWC BatchNorm by default on + ROCm 7.0+, provides ``maxpool`` forward and backward performance improvements + targeting ResNet scenarios, and includes updated launch configurations for + better performance. + +- Enhanced memory and tensor operations: Features fixes for in-place ``aten`` + sum operations with specialized templated kernels, improved 3D tensor + performance with NHWC format, and better handling of memory-bound matrix + multiplication operations. + +- Robust testing and quality improvements: Includes comprehensive test suite + updates with improved tolerance handling for Navi3x architectures, generalized + ROCm-specific test conditions, and enhanced unit test coverage for Flash + Attention and Memory Efficient operations. + +- Build system and infrastructure improvements: Provides updated CentOS Stream 9 + support, improved Docker configuration, migration to public MAGMA repository, + and enhanced QA automation scripts for PyTorch unit testing. + +- Composable Kernel (CK) updates: Features updated CK submodule integration with + the latest optimizations and performance improvements for core mathematical + operations. + +- Development and debugging enhancements: Includes improved source handling for + dynamic compilation, better error handling for atomic operations, and enhanced + state checking for trace operations. + +- Integrate APEX fused layer normalization, which can have positive impact on + text-to-video models. + +- Integrate APEX distributed fused LAMB and distributed fused ADAM, which can + have positive impact on BERT-L and Llama2-SFT. + +- FlashAttention v3 has been integrated for AMD GPUs. + +- `Pytorch C++ extensions `_ + provide a mechanism for compiling custom operations that can be used during + network training or inference. For AMD platforms, ``amdclang++`` has been + validated as the supported compiler for building these extensions. + +Known issues and notes for PyTorch 2.7 with ROCm 7.0 +================================================================================ + +- The ``matmul.allow_fp16_reduced_precision_reduction`` and + ``matmul.allow_bf16_reduced_precision_reduction`` options under + ``torch.backends.cuda`` are not supported. As a result, + reduced-precision reductions using FP16 or BF16 accumulation types are not + available. diff --git a/docs/conceptual/gpu-arch.md b/docs/conceptual/gpu-arch.md index e60c6a653..f14a39421 100644 --- a/docs/conceptual/gpu-arch.md +++ b/docs/conceptual/gpu-arch.md @@ -21,7 +21,8 @@ architecture. * [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md) * [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) * [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf) -* [Performance counters](./gpu-arch/mi300-mi200-performance-counters.rst) +* [MI300 performance counters](./gpu-arch/mi300-mi200-performance-counters.rst) +* [MI350 series performance counters](./gpu-arch/mi350-performance-counters.rst) ::: :::{grid-item-card} diff --git a/docs/conceptual/gpu-arch/mi350-performance-counters.rst b/docs/conceptual/gpu-arch/mi350-performance-counters.rst new file mode 100644 index 000000000..fe103c17c --- /dev/null +++ b/docs/conceptual/gpu-arch/mi350-performance-counters.rst @@ -0,0 +1,530 @@ +.. meta:: + :description: MI355 series performance counters and metrics + :keywords: MI355, MI355X, MI3XX + +*********************************** +MI350 series performance counters +*********************************** + +This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using `ROCprofiler-SDK `_ and `ROCm Compute Profiler `_. + +The following sections list the performance counters based on the IP blocks. + +Command processor packet processor counters (CPC) +================================================== + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - CPC_ALWAYS_COUNT + - Always count. + + * - CPC_ADC_VALID_CHUNK_NOT_AVAIL + - ADC valid chunk is not available when dispatch walking is in progress in the multi-xcc mode. + + * - CPC_ADC_DISPATCH_ALLOC_DONE + - ADC dispatch allocation is done. + + * - CPC_ADC_VALID_CHUNK_END + - ADC crawler's valid chunk end in the multi-xcc mode. + + * - CPC_SYNC_FIFO_FULL_LEVEL + - SYNC FIFO full last cycles. + + * - CPC_SYNC_FIFO_FULL + - SYNC FIFO full times. + + * - CPC_GD_BUSY + - ADC busy. + + * - CPC_TG_SEND + - ADC thread group send. + + * - CPC_WALK_NEXT_CHUNK + - ADC walking next valid chunk in the multi-xcc mode. + + * - CPC_STALLED_BY_SE0_SPI + - ADC CSDATA stalled by SE0SPI. + + * - CPC_STALLED_BY_SE1_SPI + - ADC CSDATA stalled by SE1SPI. + + * - CPC_STALLED_BY_SE2_SPI + - ADC CSDATA stalled by SE2SPI. + + * - CPC_STALLED_BY_SE3_SPI + - ADC CSDATA stalled by SE3SPI. + + * - CPC_LTE_ALL + - CPC sync counter LteAll. Only Master XCD manages LteAll. + + * - CPC_SYNC_WRREQ_FIFO_BUSY + - CPC sync counter request FIFO is not empty. + + * - CPC_CANE_BUSY + - CPC CANE bus is busy, which indicates the presence of inflight sync counter requests. + + * - CPC_CANE_STALL + - CPC sync counter sending is stalled by CANE. + +Shader pipe interpolators (SPI) counters +========================================= + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - SPI_CS0_WINDOW_VALID + - Clock count enabled by PIPE0 perfcounter_start event. + + * - SPI_CS0_BUSY + - Number of clocks with outstanding waves for PIPE0 (SPI or SH). + + * - SPI_CS0_NUM_THREADGROUPS + - Number of thread groups launched for PIPE0. + + * - SPI_CS0_CRAWLER_STALL + - Number of clocks when PIPE0 event or wave order FIFO is full. + + * - SPI_CS0_EVENT_WAVE + - Number of PIPE0 events and waves. + + * - SPI_CS0_WAVE + - Number of PIPE0 waves. + + * - SPI_CS1_WINDOW_VALID + - Clock count enabled by PIPE1 perfcounter_start event. + + * - SPI_CS1_BUSY + - Number of clocks with outstanding waves for PIPE1 (SPI or SH). + + * - SPI_CS1_NUM_THREADGROUPS + - Number of thread groups launched for PIPE1. + + * - SPI_CS1_CRAWLER_STALL + - Number of clocks when PIPE1 event or wave order FIFO is full. + + * - SPI_CS1_EVENT_WAVE + - Number of PIPE1 events and waves. + + * - SPI_CS1_WAVE + - Number of PIPE1 waves. + + * - SPI_CS2_WINDOW_VALID + - Clock count enabled by PIPE2 perfcounter_start event. + + * - SPI_CS2_BUSY + - Number of clocks with outstanding waves for PIPE2 (SPI or SH). + + * - SPI_CS2_NUM_THREADGROUPS + - Number of thread groups launched for PIPE2. + + * - SPI_CS2_CRAWLER_STALL + - Number of clocks when PIPE2 event or wave order FIFO is full. + + * - SPI_CS2_EVENT_WAVE + - Number of PIPE2 events and waves. + + * - SPI_CS2_WAVE + - Number of PIPE2 waves. + + * - SPI_CS3_WINDOW_VALID + - Clock count enabled by PIPE3 perfcounter_start event. + + * - SPI_CS3_BUSY + - Number of clocks with outstanding waves for PIPE3 (SPI or SH). + + * - SPI_CS3_NUM_THREADGROUPS + - Number of thread groups launched for PIPE3. + + * - SPI_CS3_CRAWLER_STALL + - Number of clocks when PIPE3 event or wave order FIFO is full. + + * - SPI_CS3_EVENT_WAVE + - Number of PIPE3 events and waves. + + * - SPI_CS3_WAVE + - Number of PIPE3 waves. + + * - SPI_CSQ_P0_Q0_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue0. + + * - SPI_CSQ_P0_Q1_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue1. + + * - SPI_CSQ_P0_Q2_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue2. + + * - SPI_CSQ_P0_Q3_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue3. + + * - SPI_CSQ_P0_Q4_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue4. + + * - SPI_CSQ_P0_Q5_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue5. + + * - SPI_CSQ_P0_Q6_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue6. + + * - SPI_CSQ_P0_Q7_OCCUPANCY + - Sum of occupancy info for PIPE0 Queue7. + + * - SPI_CSQ_P1_Q0_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue0. + + * - SPI_CSQ_P1_Q1_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue1. + + * - SPI_CSQ_P1_Q2_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue2. + + * - SPI_CSQ_P1_Q3_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue3. + + * - SPI_CSQ_P1_Q4_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue4. + + * - SPI_CSQ_P1_Q5_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue5. + + * - SPI_CSQ_P1_Q6_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue6. + + * - SPI_CSQ_P1_Q7_OCCUPANCY + - Sum of occupancy info for PIPE1 Queue7. + + * - SPI_CSQ_P2_Q0_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue0. + + * - SPI_CSQ_P2_Q1_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue1. + + * - SPI_CSQ_P2_Q2_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue2. + + * - SPI_CSQ_P2_Q3_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue3. + + * - SPI_CSQ_P2_Q4_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue4. + + * - SPI_CSQ_P2_Q5_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue5. + + * - SPI_CSQ_P2_Q6_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue6. + + * - SPI_CSQ_P2_Q7_OCCUPANCY + - Sum of occupancy info for PIPE2 Queue7. + + * - SPI_CSQ_P3_Q0_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue0. + + * - SPI_CSQ_P3_Q1_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue1. + + * - SPI_CSQ_P3_Q2_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue2. + + * - SPI_CSQ_P3_Q3_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue3. + + * - SPI_CSQ_P3_Q4_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue4. + + * - SPI_CSQ_P3_Q5_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue5. + + * - SPI_CSQ_P3_Q6_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue6. + + * - SPI_CSQ_P3_Q7_OCCUPANCY + - Sum of occupancy info for PIPE3 Queue7. + + * - SPI_CSQ_P0_OCCUPANCY + - Sum of occupancy info for all PIPE0 queues. + + * - SPI_CSQ_P1_OCCUPANCY + - Sum of occupancy info for all PIPE1 queues. + + * - SPI_CSQ_P2_OCCUPANCY + - Sum of occupancy info for all PIPE2 queues. + + * - SPI_CSQ_P3_OCCUPANCY + - Sum of occupancy info for all PIPE3 queues. + + * - SPI_VWC0_VDATA_VALID_WR + - Number of clocks VGPR bus_0 writes VGPRs. + + * - SPI_VWC1_VDATA_VALID_WR + - Number of clocks VGPR bus_1 writes VGPRs. + + * - SPI_CSC_WAVE_CNT_BUSY + - Number of cycles when there is any wave in the pipe. + +Compute unit (SQ) counters +=========================== + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - SQ_INSTS_VALU_MFMA_F6F4 + - Number of VALU V_MFMA_*_F6F4 instructions. + + * - SQ_INSTS_VALU_MFMA_MOPS_F6F4 + - Number of VALU matrix with the performed math operations (add or mul) divided by 512, assuming a full EXEC mask of F6 or F4 data type. + + * - SQ_ACTIVE_INST_VALU2 + - Number of quad-cycles when two VALU instructions are issued (per-simd, nondeterministic). + + * - SQ_INSTS_LDS_LOAD + - Number of LDS load instructions issued (per-simd, emulated). + + * - SQ_INSTS_LDS_STORE + - Number of LDS store instructions issued (per-simd, emulated). + + * - SQ_INSTS_LDS_ATOMIC + - Number of LDS atomic instructions issued (per-simd, emulated). + + * - SQ_INSTS_LDS_LOAD_BANDWIDTH + - Total number of 64-bytes loaded (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). + + * - SQ_INSTS_LDS_STORE_BANDWIDTH + - Total number of 64-bytes written (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). + + * - SQ_INSTS_LDS_ATOMIC_BANDWIDTH + - Total number of 64-bytes atomic (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). + + * - SQ_INSTS_VALU_FLOPS_FP16 + - Counts FLOPS per instruction on float 16 excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_FLOPS_FP32 + - Counts FLOPS per instruction on float 32 excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_FLOPS_FP64 + - Counts FLOPS per instruction on float 64 excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_FLOPS_FP16_TRANS + - Counts FLOPS per instruction on float 16 trans excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_FLOPS_FP32_TRANS + - Counts FLOPS per instruction on float 32 trans excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_FLOPS_FP64_TRANS + - Counts FLOPS per instruction on float 64 trans excluding MFMA/SMFMA. + + * - SQ_INSTS_VALU_IOPS + - Counts OPS per instruction on integer or unsigned or bit data (per-simd, emulated). + + * - SQ_LDS_DATA_FIFO_FULL + - Number of cycles LDS data FIFO is full (nondeterministic, unwindowed). + + * - SQ_LDS_CMD_FIFO_FULL + - Number of cycles LDS command FIFO is full (nondeterministic, unwindowed). + + * - SQ_VMEM_TA_ADDR_FIFO_FULL + - Number of cycles texture requests are stalled due to full address FIFO in TA (nondeterministic, unwindowed). + + * - SQ_VMEM_TA_CMD_FIFO_FULL + - Number of cycles texture requests are stalled due to full cmd FIFO in TA (nondeterministic, unwindowed). + + * - SQ_VMEM_WR_TA_DATA_FIFO_FULL + - Number of cycles texture writes are stalled due to full data FIFO in TA (nondeterministic, unwindowed). + + * - SQC_ICACHE_MISSES_DUPLICATE + - Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic). + + * - SQC_DCACHE_MISSES_DUPLICATE + - Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic). + +Texture addressing (TA) unit counters +====================================== + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - TA_BUFFER_READ_LDS_WAVEFRONTS + - Number of buffer read wavefronts for LDS return processed by the TA. + + * - TA_FLAT_READ_LDS_WAVEFRONTS + - Number of flat opcode reads for LDS return processed by the TA. + +Texture data (TD) unit counters +================================ + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - TD_WRITE_ACKT_WAVEFRONT + - Number of write acknowledgments, sent to SQ and not to SP. + + * - TD_TD_SP_TRAFFIC + - Number of times this TD sends data to the SP. + +Texture cache per pipe (TCP) counters +====================================== + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - TCP_TCP_TA_ADDR_STALL_CYCLES + - TCP stalls TA addr interface. + + * - TCP_TCP_TA_DATA_STALL_CYCLES + - TCP stalls TA data interface. Now windowed. + + * - TCP_LFIFO_STALL_CYCLES + - Memory latency FIFOs full stall. + + * - TCP_RFIFO_STALL_CYCLES + - Memory Request FIFOs full stall. + + * - TCP_TCR_RDRET_STALL + - Write into cache stalled by read return from TCR. + + * - TCP_PENDING_STALL_CYCLES + - Stall due to data pending from L2. + + * - TCP_UTCL1_SERIALIZATION_STALL + - Total number of stalls caused due to serializing translation requests through the UTCL1. + + * - TCP_UTCL1_THRASHING_STALL + - Stall caused by thrashing feature in any probe. Lacks accuracy when the stall signal overlaps between probe0 and probe1, which is worse with MECO of thrashing deadlock. Some probe0 events could miss being counted in with MECO on. This perf count provides a rough thrashing estimate. + + * - TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS + - Translation miss_under_miss. + + * - TCP_UTCL1_STALL_INFLIGHT_MAX + - Total UTCL1 stalls due to inflight counter saturation. + + * - TCP_UTCL1_STALL_LRU_INFLIGHT + - Total UTCL1 stalls due to LRU cache line with inflight traffic. + + * - TCP_UTCL1_STALL_MULTI_MISS + - Total UTCL1 stalls due to arbitrated multiple misses. + + * - TCP_UTCL1_LFIFO_FULL + - Total UTCL1 and UTCL2 latency, which hides FIFO full cycles. + + * - TCP_UTCL1_STALL_LFIFO_NOT_RES + - Total UTCL1 stalls due to UTCL2 latency, which hides FIFO output (not resident). + + * - TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS + - Total UTCL1 stalls due to UTCL2_req being out of credits. + + * - TCP_CLIENT_UTCL1_INFLIGHT + - The sum of inflight client to UTCL1 requests per cycle. + + * - TCP_TAGRAM0_REQ + - Total L2 requests mapping to TagRAM 0 from this TCP to all TCCs. + + * - TCP_TAGRAM1_REQ + - Total L2 requests mapping to TagRAM 1 from this TCP to all TCCs. + + * - TCP_TAGRAM2_REQ + - Total L2 requests mapping to TagRAM 2 from this TCP to all TCCs. + + * - TCP_TAGRAM3_REQ + - Total L2 requests mapping to TagRAM 3 from this TCP to all TCCs. + + * - TCP_TCP_LATENCY + - Total TCP wave latency (from the first clock of wave entering to the first clock of wave leaving). Divide by TA_TCP_STATE_READ to find average wave latency. + + * - TCP_TCC_READ_REQ_LATENCY + - Total TCP to TCC request latency for reads and atomics with return. Not Windowed. + + * - TCP_TCC_WRITE_REQ_LATENCY + - Total TCP to TCC request latency for writes and atomics without return. Not Windowed. + + * - TCP_TCC_WRITE_REQ_HOLE_LATENCY + - Total TCP req to TCC hole latency for writes and atomics. Not Windowed. + +Texture cache per channel (TCC) counters +========================================= + +.. list-table:: + :header-rows: 1 + + * - Hardware counter + - Definition + + * - TCC_READ_SECTORS + - Total number of 32B data sectors in read requests. + + * - TCC_WRITE_SECTORS + - Total number of 32B data sectors in write requests. + + * - TCC_ATOMIC_SECTORS + - Total number of 32B data sectors in atomic requests. + + * - TCC_BYPASS_REQ + - Number of bypass requests. This is measured at the tag block. + + * - TCC_LATENCY_FIFO_FULL + - Number of cycles when the latency FIFO is full. + + * - TCC_SRC_FIFO_FULL + - Number of cycles when the SRC FIFO is assumed to be full as measured at the IB block. + + * - TCC_EA0_RDREQ_64B + - Number of 64-byte TCC/EA read requests. + + * - TCC_EA0_RDREQ_128B + - Number of 128-byte TCC/EA read requests. + + * - TCC_IB_REQ + - Number of requests through the IB. This measures the number of raw requests from graphics clients to this TCC. + + * - TCC_IB_STALL + - Number of cycles when the IB output is stalled. + + * - TCC_EA0_WRREQ_WRITE_DRAM + - Number of TCC/EA write requests (32-byte or 64-byte) destined for DRAM (MC). + + * - TCC_EA0_WRREQ_ATOMIC_DRAM + - Number of TCC/EA atomic requests (32-byte or 64-byte) destined for DRAM (MC). + + * - TCC_EA0_RDREQ_DRAM_32B + - Number of 32-byte TCC/EA read requests due to DRAM traffic. One 64-byte request is counted as two and one 128-byte as four. + + * - TCC_EA0_RDREQ_GMI_32B + - Number of 32-byte TCC/EA read requests due to GMI traffic. One 64-byte request is counted as two and one 128-byte as four. + + * - TCC_EA0_RDREQ_IO_32B + - Number of 32-byte TCC/EA read requests due to IO traffic. One 64-byte request is counted as two and one 128-byte as four. + + * - TCC_EA0_WRREQ_WRITE_DRAM_32B + - Number of 32-byte TCC/EA write requests due to DRAM traffic. One 64-byte request is counted as two. + + * - TCC_EA0_WRREQ_ATOMIC_DRAM_32B + - Number of 32-byte TCC/EA atomic requests due to DRAM traffic. One 64-byte request is counted as two. + + * - TCC_EA0_WRREQ_WRITE_GMI_32B + - Number of 32-byte TCC/EA write requests due to GMI traffic. One 64-byte request is counted as two. + + * - TCC_EA0_WRREQ_ATOMIC_GMI_32B + - Number of 32-byte TCC/EA atomic requests due to GMI traffic. One 64-byte request is counted as two. + + * - TCC_EA0_WRREQ_WRITE_IO_32B + - Number of 32-byte TCC/EA write requests due to IO traffic. One 64-byte request is counted as two. + + * - TCC_EA0_WRREQ_ATOMIC_IO_32B + - Number of 32-byte TCC/EA atomic requests due to IO traffic. One 64-byte request is counted as two. diff --git a/docs/conf.py b/docs/conf.py index f852b6697..208063655 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -89,15 +89,15 @@ project = "ROCm Documentation" project_path = os.path.abspath(".").replace("\\", "/") author = "Advanced Micro Devices, Inc." copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved." -version = "6.4.3" -release = "6.4.3" +version = "7.0.0" +release = "7.0.0" setting_all_article_info = True all_article_info_os = ["linux", "windows"] all_article_info_author = "" # pages with specific settings article_pages = [ - {"file": "about/release-notes", "os": ["linux"], "date": "2025-08-07"}, + {"file": "about/release-notes", "os": ["linux"], "date": "2025-08-28"}, {"file": "release/changelog", "os": ["linux"],}, {"file": "compatibility/compatibility-matrix", "os": ["linux"]}, {"file": "compatibility/ml-compatibility/pytorch-compatibility", "os": ["linux"]}, diff --git a/docs/data/about/compatibility/floating-point-data-types.png b/docs/data/about/compatibility/floating-point-data-types.png index b59b40be4..87c3afe29 100644 Binary files a/docs/data/about/compatibility/floating-point-data-types.png and b/docs/data/about/compatibility/floating-point-data-types.png differ diff --git a/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv index cb909bbb0..b375886eb 100644 --- a/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv @@ -1,325 +1,325 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X,MI300A -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ CAS -32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ CAS -64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS -32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native,⚠️ Scope Downgrade - CAS +32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native,⚠️ Scope Downgrade - CAS +64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ CAS,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ CAS,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicSub,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicInc,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicDec,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atoimcExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ Native,⚠️ Scope Downgrade - CAS +32 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicExch,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicCAS,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAnd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicOr,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit atomicXor,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS diff --git a/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv index 74bbfed10..2e0e40fc1 100644 --- a/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv @@ -1,325 +1,325 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X,MI300A -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicInc,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicDec,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 half2 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atoimcExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicExch,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicOr,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit atomicXor,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS diff --git a/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv index 18f0bf55c..483684089 100644 --- a/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv @@ -1,325 +1,325 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X,MI300A -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade,✅ Native -32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS -32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS -64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade,✅ Native -32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAdd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicMin,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +32 bit float atomicMax,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade - CAS,✅ CAS,⚠️ Scope Downgrade - CAS +64 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMin,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMax,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 bfloat162 atomicAdd,❌ NOP,❌ NOP,✅ CAS,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atoimcExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicExch,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicCAS,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade diff --git a/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv index cf4136864..5ea596069 100644 --- a/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv @@ -1,325 +1,325 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X,MI300A -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,⚠️ Scope Downgrade,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native -32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS -64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native -16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,⚠️ Scope Downgrade,✅ Native -32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native -64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native -64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,✅ NoReturn,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ Native,✅ Native,✅ Native +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicSub,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicInc,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicDec,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicMin,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicMax,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit float atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +32 bit float atomicMax,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS +64 bit float atomicAdd,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMin,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit float atomicMax,✅ CAS,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 half2 atomicAdd,❌ NOP,❌ NOP,❌ NOP,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +16bx2 bfloat162 atomicAdd,✅ CAS,✅ CAS,✅ CAS,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atoimcExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +32 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +32 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicExch,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicCAS,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native +64 bit atomicAnd,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicOr,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade +64 bit atomicXor,❌ NOP,❌ NOP,✅ Native,⚠️ Scope Downgrade,✅ Native,⚠️ Scope Downgrade diff --git a/docs/data/reference/precision-support/precision-support.yaml b/docs/data/reference/precision-support/precision-support.yaml new file mode 100644 index 000000000..6f4773319 --- /dev/null +++ b/docs/data/reference/precision-support/precision-support.yaml @@ -0,0 +1,391 @@ +# rocm-library-support.yaml +library_groups: + - group: "ML & Computer Vision" + tag: "ml-cv" + libraries: + - name: "Composable Kernel" + tag: "composable-kernel" + doc_link: "composable_kernel:reference/Composable_Kernel_supported_scalar_types" + data_types: + - type: "int8" + support: "✅" + - type: "int32" + support: "✅" + - type: "float4" + support: "✅" + - type: "float6 (E2M3)" + support: "✅" + - type: "float6 (E3M2)" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "MIGraphX" + tag: "migraphx" + doc_link: "amdmigraphx:reference/cpp" + data_types: + - type: "int8" + support: "⚠️" + - type: "int16" + support: "✅" + - type: "int32" + support: "✅" + - type: "int64" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "MIOpen" + tag: "miopen" + doc_link: "miopen:reference/datatypes" + data_types: + - type: "int8" + support: "⚠️" + - type: "int32" + support: "⚠️" + - type: "float8 (E4M3)" + support: "⚠️" + - type: "float8 (E5M2)" + support: "⚠️" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "⚠️" + - type: "float32" + support: "✅" + - type: "float64" + support: "⚠️" + + - group: "Communication" + tag: "communication" + libraries: + - name: "RCCL" + tag: "rccl" + doc_link: "rccl:api-reference/library-specification" + data_types: + - type: "int8" + support: "✅" + - type: "int32" + support: "✅" + - type: "int64" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - group: "Math Libraries" + tag: "math-libs" + libraries: + - name: "hipBLAS" + tag: "hipblas" + doc_link: "hipblas:reference/data-type-support" + data_types: + - type: "float16" + support: "⚠️" + - type: "bfloat16" + support: "⚠️" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "hipBLASLt" + tag: "hipblaslt" + doc_link: "hipblaslt:reference/data-type-support" + data_types: + - type: "int8" + support: "✅" + - type: "float4" + support: "✅" + - type: "float6 (E2M3)" + support: "✅" + - type: "float6 (E3M2)" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + + - name: "hipFFT" + tag: "hipfft" + doc_link: "hipfft:reference/fft-api-usage" + data_types: + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "hipRAND" + tag: "hiprand" + doc_link: "hiprand:api-reference/data-type-support" + data_types: + - type: "int8" + support: "Output only" + - type: "int16" + support: "Output only" + - type: "int32" + support: "Output only" + - type: "int64" + support: "Output only" + - type: "float16" + support: "Output only" + - type: "float32" + support: "Output only" + - type: "float64" + support: "Output only" + + - name: "hipSOLVER" + tag: "hipsolver" + doc_link: "hipsolver:reference/precision" + data_types: + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "hipSPARSE" + tag: "hipsparse" + doc_link: "hipsparse:reference/precision" + data_types: + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "hipSPARSELt" + tag: "hipsparselt" + doc_link: "hipsparselt:reference/data-type-support" + data_types: + - type: "int8" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + + - name: "rocBLAS" + tag: "rocblas" + doc_link: "rocblas:reference/data-type-support" + data_types: + - type: "float16" + support: "⚠️" + - type: "bfloat16" + support: "⚠️" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocFFT" + tag: "rocfft" + doc_link: "rocfft:reference/api" + data_types: + - type: "float16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocRAND" + tag: "rocrand" + doc_link: "rocrand:api-reference/data-type-support" + data_types: + - type: "int8" + support: "Output only" + - type: "int16" + support: "Output only" + - type: "int32" + support: "Output only" + - type: "int64" + support: "Output only" + - type: "float16" + support: "Output only" + - type: "float32" + support: "Output only" + - type: "float64" + support: "Output only" + + - name: "rocSOLVER" + tag: "rocsolver" + doc_link: "rocsolver:reference/precision" + data_types: + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocSPARSE" + tag: "rocsparse" + doc_link: "rocsparse:reference/precision" + data_types: + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocWMMA" + tag: "rocwmma" + doc_link: "rocwmma:api-reference/api-reference-guide" + data_types: + - type: "int8" + support: "✅" + - type: "int32" + support: "Output only" + - type: "float8 (E4M3)" + support: "Input only" + - type: "float8 (E5M2)" + support: "Input only" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "tensorfloat32" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "Tensile" + tag: "tensile" + doc_link: "tensile:reference/precision-support" + data_types: + - type: "int8" + support: "✅" + - type: "int32" + support: "✅" + - type: "float8 (E4M3)" + support: "✅" + - type: "float8 (E5M2)" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "tensorfloat32" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - group: "Primitives" + tag: "primitives" + libraries: + - name: "hipCUB" + tag: "hipcub" + doc_link: "hipcub:api-reference/data-type-support" + data_types: + - type: "int8" + support: "✅" + - type: "int16" + support: "✅" + - type: "int32" + support: "✅" + - type: "int64" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "hipTensor" + tag: "hiptensor" + doc_link: "hiptensor:api-reference/api-reference" + data_types: + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocPRIM" + tag: "rocprim" + doc_link: "rocprim:reference/data-type-support" + data_types: + - type: "int8" + support: "✅" + - type: "int16" + support: "✅" + - type: "int32" + support: "✅" + - type: "int64" + support: "✅" + - type: "float16" + support: "✅" + - type: "bfloat16" + support: "✅" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" + + - name: "rocThrust" + tag: "rocthrust" + doc_link: "rocthrust:data-type-support" + data_types: + - type: "int8" + support: "✅" + - type: "int16" + support: "✅" + - type: "int32" + support: "✅" + - type: "int64" + support: "✅" + - type: "float16" + support: "⚠️" + - type: "bfloat16" + support: "⚠️" + - type: "float32" + support: "✅" + - type: "float64" + support: "✅" diff --git a/docs/data/rocm-software-stack-7_0_0.jpg b/docs/data/rocm-software-stack-7_0_0.jpg new file mode 100644 index 000000000..dd15df471 Binary files /dev/null and b/docs/data/rocm-software-stack-7_0_0.jpg differ diff --git a/docs/index.md b/docs/index.md index 3a134adb9..5a784c73d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -65,7 +65,7 @@ ROCm documentation is organized into the following categories: * [ROCm libraries](./reference/api-libraries.md) * [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md) * [Accelerator and GPU hardware specifications](./reference/gpu-arch-specs.rst) -* [Precision support](./reference/precision-support.rst) +* [Data types and precision support](./reference/precision-support.rst) * [Graph safe support](./reference/graph-safe-support.rst) ::: diff --git a/docs/reference/gpu-arch-specs.rst b/docs/reference/gpu-arch-specs.rst index ea70ef70a..7d7ef607c 100644 --- a/docs/reference/gpu-arch-specs.rst +++ b/docs/reference/gpu-arch-specs.rst @@ -34,6 +34,40 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil - SGPR File (KiB) - GFXIP Major version - GFXIP Minor version + * + - MI355X + - CDNA4 + - gfx950 + - 288 + - 256 (32 per XCD) + - 64 + - 160 + - 256 + - 32 (4 per XCD) + - 32 + - 16 per 2 CUs + - 64 per 2 CUs + - 512 + - 12.5 + - 9 + - 5 + * + - MI350X + - CDNA4 + - gfx950 + - 288 + - 256 (32 per XCD) + - 64 + - 160 + - 256 + - 32 (4 per XCD) + - 32 + - 16 per 2 CUs + - 64 per 2 CUs + - 512 + - 12.5 + - 9 + - 5 * - MI325X - CDNA3 diff --git a/docs/reference/gpu-atomics-operation.rst b/docs/reference/gpu-atomics-operation.rst index dddbef0c0..faa9a8320 100644 --- a/docs/reference/gpu-atomics-operation.rst +++ b/docs/reference/gpu-atomics-operation.rst @@ -14,16 +14,26 @@ completed as an indivisible unit, preventing race conditions where simultaneous access to the same memory location could lead to incorrect or undefined behavior. -This document details the various support of atomic read-modify-write -(atomicRMW) operations on gfx9, gfx10, gfx11, gfx12, MI100, MI200 and MI300 AMD -GPUs. The atomics operation type behavior effected by the memory locations, -memory granularity or scope of operations. +This topic summarizes the support of atomic read-modify-write +(atomicRMW) operations on AMD GPUs and accelerators. This includes gfx9, gfx10, +gfx11, and gfx12 targets and the following series of Instinct™ series: + +- MI100 + +- MI200 + +- MI300 + +- MI350 + +The atomics operation type behavior is affected by the memory locations, memory +granularity, and scope of operations. Memory locations: -- :ref:`Device memory `, i.e. VRAM, the RAM on a discrete GPU - device or in framebuffer carveout for APUs. This includes peer-device memory - within an Infinity Fabric™ hive. +- :ref:`Device memory `, that is, VRAM, the RAM on a discrete + GPU device or in framebuffer carveout for APUs. This includes peer-device + memory within an Infinity Fabric™ hive. - :ref:`Host memory `: in DRAM associated with the CPU (or peer device memory using PCIe® (PCI Express) peer-to-peer). This can be two sub-types: @@ -69,10 +79,10 @@ Scopes of operations: Support summary ================================================================================ -AMD Instinct™ accelerators +AMD Instinct accelerators -------------------------------------------------------------------------------- -**MI300** +**MI300 and MI350 series** - All atomicRMW operations are forwarded out to the Infinity Fabric. - Infinity Fabric supports common integer and bitwise atomics, FP32 atomic add, @@ -85,7 +95,7 @@ AMD Instinct™ accelerators It will seem like atomics to the wave, but the CPU sees it as a non-atomic load-op-store sequence. This downgrades system-scope atomics to device-scope. -**MI200** +**MI200 series** - L2 cache and Infinity Fabric both support common integer and bitwise atomics. - L2 cache supports FP32 atomic add, packed-FP16 atomic add, and FP64 add, diff --git a/docs/reference/precision-support.rst b/docs/reference/precision-support.rst index 4f5be7e33..8ee81e4b3 100644 --- a/docs/reference/precision-support.rst +++ b/docs/reference/precision-support.rst @@ -9,8 +9,8 @@ Data types and precision support ************************************************************* -This topic lists the data types support on AMD GPUs, ROCm libraries along -with corresponding :doc:`HIP ` data types. +This topic summarizes the data types supported on AMD GPUs and accelerators and +ROCm libraries, along with corresponding :doc:`HIP ` data types. Integral types ============== @@ -61,18 +61,38 @@ The floating-point types supported by ROCm are listed in the following table. - Type name - HIP type - Description + + * + - float4 (E2M1) + - | ``__hip_fp4_e2m1`` + - A 4-bit floating-point number with **E2M1** bit layout, as described + in :doc:`low precision floating point types page `. + + * + - float6 (E3M2) + - | ``__hip_fp6_e3m2`` + - A 6-bit floating-point number with **E3M2** bit layout, as described + in :doc:`low precision floating point types page `. + + * + - float6 (E2M3) + - | ``__hip_fp6_e2m3`` + - A 6-bit floating-point number with **E2M3** bit layout, as described + in :doc:`low precision floating point types page `. + * - float8 (E4M3) - | ``__hip_fp8_e4m3_fnuz``, | ``__hip_fp8_e4m3`` - - An 8-bit floating-point number with **S1E4M3** bit layout, as described in :doc:`low precision floating point types page `. + - An 8-bit floating-point number with **E4M3** bit layout, as described in :doc:`low precision floating point types page `. The FNUZ variant has expanded range with no infinity or signed zero (NaN represented as negative zero), while the OCP variant follows the Open Compute Project specification. + * - float8 (E5M2) - | ``__hip_fp8_e5m2_fnuz``, | ``__hip_fp8_e5m2`` - - An 8-bit floating-point number with **S1E5M2** bit layout, as described in :doc:`low precision floating point types page `. + - An 8-bit floating-point number with **E5M2** bit layout, as described in :doc:`low precision floating point types page `. The FNUZ variant has expanded range with no infinity or signed zero (NaN represented as negative zero), while the OCP variant follows the Open Compute Project specification. @@ -81,22 +101,26 @@ The floating-point types supported by ROCm are listed in the following table. - ``half`` - A 16-bit floating-point number that conforms to the IEEE 754-2008 half-precision storage format. + * - bfloat16 - ``bfloat16`` - A shortened 16-bit version of the IEEE 754 single-precision storage format. + * - tensorfloat32 - Not available - A floating-point number that occupies 32 bits or less of storage, providing improved range compared to half (16-bit) format, at (potentially) greater throughput than single-precision (32-bit) formats. + * - float32 - ``float`` - A 32-bit floating-point number that conforms to the IEEE 754 single-precision storage format. + * - float64 - ``double`` @@ -108,8 +132,8 @@ The floating-point types supported by ROCm are listed in the following table. * The float8 and tensorfloat32 types are internal types used in calculations in Matrix Cores and can be stored in any type of the same size. - * CNDA3 natively supports FP8 FNUZ (E4M3 and E5M2), which differs from the customised - FP8 format used in NVIDIA's H100 + * CDNA3 natively supports FP8 FNUZ (E4M3 and E5M2), which differs from the customized + FP8 format used with NVIDIA H100 (`FP8 Formats for Deep Learning `_). * In some AMD documents and articles, float8 (E5M2) is referred to as bfloat8. @@ -168,11 +192,13 @@ Data type support by hardware architecture AMD's GPU lineup spans multiple architecture generations: -* CDNA1 architecture: includes models such as MI100 -* CDNA2 architecture: includes models such as MI210, MI250, and MI250X -* CDNA3 architecture: includes models such as MI300A, MI300X, and MI325X -* RDNA3 architecture: includes models such as RX 7900XT and RX 7900XTX -* RDNA4 architecture: includes models such as RX 9070 and RX 9070XT +* CDNA1 such as MI100 +* CDNA2 such as MI210, MI250, and MI250X +* CDNA3 such as MI300A, MI300X, and MI325X +* CDNA4 such as MI350X and MI355X +* RDNA2 such as PRO W6800 and PRO V620 +* RDNA3 such as RX 7900XT and RX 7900XTX +* RDNA4 such as RX 9070 and RX 9070XT HIP C++ type implementation support ----------------------------------- @@ -188,6 +214,8 @@ following table. - CDNA1 - CDNA2 - CDNA3 + - CDNA4 + - RDNA2 - RDNA3 - RDNA4 @@ -198,6 +226,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``int16_t``, ``uint16_t`` @@ -206,6 +236,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``int32_t``, ``uint32_t`` @@ -214,6 +246,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``int64_t``, ``uint64_t`` @@ -222,6 +256,38 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ + + * + - ``__hip_fp4_e2m1`` + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + - ❌ + - ❌ + + * + - ``__hip_fp6_e2m3`` + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + - ❌ + - ❌ + + * + - ``__hip_fp6_e3m2`` + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + - ❌ + - ❌ * - ``__hip_fp8_e4m3_fnuz`` @@ -230,6 +296,8 @@ following table. - ✅ - ❌ - ❌ + - ❌ + - ❌ * - ``__hip_fp8_e5m2_fnuz`` @@ -238,12 +306,16 @@ following table. - ✅ - ❌ - ❌ + - ❌ + - ❌ * - ``__hip_fp8_e4m3`` - ❌ - ❌ - ❌ + - ✅ + - ❌ - ❌ - ✅ @@ -252,6 +324,8 @@ following table. - ❌ - ❌ - ❌ + - ✅ + - ❌ - ❌ - ✅ @@ -262,6 +336,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``bfloat16`` @@ -270,6 +346,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``float`` @@ -278,6 +356,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ * - ``double`` @@ -286,6 +366,8 @@ following table. - ✅ - ✅ - ✅ + - ✅ + - ✅ .. note:: @@ -314,18 +396,21 @@ The following table lists data type support for compute units. - int16 - int32 - int64 + * - CDNA1 - ✅ - ✅ - ✅ - ✅ + * - CDNA2 - ✅ - ✅ - ✅ - ✅ + * - CDNA3 - ✅ @@ -333,6 +418,20 @@ The following table lists data type support for compute units. - ✅ - ✅ + * + - CDNA4 + - ✅ + - ✅ + - ✅ + - ✅ + + * + - RDNA2 + - ✅ + - ✅ + - ✅ + - ✅ + * - RDNA3 - ✅ @@ -347,53 +446,132 @@ The following table lists data type support for compute units. - ✅ - ✅ - .. tab-item:: Floating-point types - :sync: floating-point-type + .. tab-item:: Low precision floating-point types + :sync: floating-point-type-low .. list-table:: :header-rows: 1 * - Type name + - float4 + - float6 (E2M3) + - float6 (E3M2) - float8 (E4M3) - float8 (E5M2) + + * + - CDNA1 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - CDNA2 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - CDNA3 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - CDNA4 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - RDNA2 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - RDNA3 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - RDNA4 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + .. tab-item:: High precision floating-point types + :sync: floating-point-type-high + + .. list-table:: + :header-rows: 1 + + * + - Type name - float16 - bfloat16 - tensorfloat32 - float32 - float64 + * - CDNA1 - - ❌ - - ❌ - ✅ - ✅ - ❌ - ✅ - ✅ + * - CDNA2 - - ❌ - - ❌ - ✅ - ✅ - ❌ - ✅ - ✅ + * - CDNA3 + - ✅ + - ✅ - ❌ + - ✅ + - ✅ + + * + - CDNA4 + - ✅ + - ✅ - ❌ - ✅ - ✅ + + * + - RDNA2 + - ✅ + - ✅ - ❌ - ✅ - ✅ * - RDNA3 - - ❌ - - ❌ - ✅ - ✅ - ❌ @@ -402,8 +580,6 @@ The following table lists data type support for compute units. * - RDNA4 - - ❌ - - ❌ - ✅ - ✅ - ❌ @@ -429,18 +605,21 @@ The following table lists data type support for AMD GPU matrix cores. - int16 - int32 - int64 + * - CDNA1 - ✅ - ❌ - ❌ - ❌ + * - CDNA2 - ✅ - ❌ - ❌ - ❌ + * - CDNA3 - ✅ @@ -448,6 +627,20 @@ The following table lists data type support for AMD GPU matrix cores. - ❌ - ❌ + * + - CDNA4 + - ✅ + - ❌ + - ❌ + - ❌ + + * + - RDNA2 + - ✅ + - ❌ + - ❌ + - ❌ + * - RDNA3 - ✅ @@ -462,53 +655,132 @@ The following table lists data type support for AMD GPU matrix cores. - ❌ - ❌ - .. tab-item:: Floating-point types - :sync: floating-point-type + .. tab-item:: Low precision floating-point types + :sync: floating-point-type-low .. list-table:: :header-rows: 1 * - Type name + - float4 + - float6 (E2M3) + - float6 (E3M2) - float8 (E4M3) - float8 (E5M2) - - float16 - - bfloat16 - - tensorfloat32 - - float32 - - float64 + * - CDNA1 - ❌ - ❌ - - ✅ - - ✅ - ❌ - - ✅ - ❌ + - ❌ + * - CDNA2 - ❌ - ❌ - - ✅ - - ✅ + - ❌ + - ❌ + - ❌ + + * + - CDNA3 + - ❌ + - ❌ - ❌ - ✅ - ✅ + * - - CDNA3 - - ✅ - - ✅ + - CDNA4 - ✅ - ✅ - ✅ - ✅ - ✅ + * + - RDNA2 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + * - RDNA3 - ❌ - ❌ + - ❌ + - ❌ + - ❌ + + * + - RDNA4 + - ❌ + - ❌ + - ❌ + - ✅ + - ✅ + + .. tab-item:: High precision floating-point types + :sync: floating-point-type-high + + .. list-table:: + :header-rows: 1 + + * + - Type name + - float16 + - bfloat16 + - tensorfloat32 + - float32 + - float64 + + * + - CDNA1 + - ✅ + - ✅ + - ❌ + - ✅ + - ❌ + + * + - CDNA2 + - ✅ + - ✅ + - ❌ + - ✅ + - ✅ + + * + - CDNA3 + - ✅ + - ✅ + - ✅ + - ✅ + - ✅ + + * + - CDNA4 + - ✅ + - ✅ + - ✅ + - ✅ + - ✅ + + * + - RDNA2 + - ✅ + - ✅ + - ❌ + - ❌ + - ❌ + + * + - RDNA3 - ✅ - ✅ - ❌ @@ -519,8 +791,6 @@ The following table lists data type support for AMD GPU matrix cores. - RDNA4 - ✅ - ✅ - - ✅ - - ✅ - ❌ - ❌ - ❌ @@ -582,48 +852,59 @@ page. - ✅ - ✅ - .. tab-item:: Floating-point types - :sync: floating-point-type + .. tab-item:: Low precision floating-point types + :sync: floating-point-type-low .. list-table:: :header-rows: 1 * - Type name + - float4 + - float6 (E2M3) + - float6 (E3M2) - float8 (E4M3) - float8 (E5M2) - - 2 x float16 - - 2 x bfloat16 - - tensorfloat32 - - float32 - - float64 + * - CDNA1 - ❌ - ❌ - - ✅ - - ✅ - ❌ - - ✅ - ❌ + - ❌ + * - CDNA2 - ❌ - ❌ - - ✅ - - ✅ - ❌ - - ✅ - - ✅ + - ❌ + - ❌ + * - CDNA3 - ❌ - ❌ - - ✅ - - ✅ - ❌ - - ✅ - - ✅ + - ❌ + - ❌ + + * + - CDNA4 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ + + * + - RDNA2 + - ❌ + - ❌ + - ❌ + - ❌ + - ❌ * - RDNA3 @@ -632,13 +913,79 @@ page. - ❌ - ❌ - ❌ - - ✅ - - ❌ * - RDNA4 - ❌ - ❌ + - ❌ + - ❌ + - ❌ + + .. tab-item:: High precision floating-point types + :sync: floating-point-type-high + + .. list-table:: + :header-rows: 1 + + * + - Type name + - 2 x float16 + - 2 x bfloat16 + - tensorfloat32 + - float32 + - float64 + + * + - CDNA1 + - ✅ + - ✅ + - ❌ + - ✅ + - ❌ + + * + - CDNA2 + - ✅ + - ✅ + - ❌ + - ✅ + - ✅ + + * + - CDNA3 + - ✅ + - ✅ + - ❌ + - ✅ + - ✅ + + * + - CDNA4 + - ✅ + - ✅ + - ❌ + - ✅ + - ✅ + + * + - RDNA2 + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + + * + - RDNA3 + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + + * + - RDNA4 - ✅ - ✅ - ❌ @@ -662,295 +1009,64 @@ Libraries input/output type support ----------------------------------- The following tables list ROCm library support for specific input and output -data types. Refer to the corresponding library data type support page for a -detailed description. +data types. Select a library from the below table to view the supported data +types. -.. tab-set:: +.. datatemplate:yaml:: /data/reference/precision-support/precision-support.yaml - .. tab-item:: Integral types - :sync: integral-type + {% set library_groups = data.library_groups %} - .. list-table:: - :header-rows: 1 + .. raw:: html - * - - Library input/output data type name - - int8 - - int16 - - int32 - - int64 +
+
+
Category
+
+ {% for group in library_groups %} +
{{ group.group }}
+ {% endfor %} +
+
- * - - :doc:`Composable Kernel ` - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ❌/❌ +
+
Library
+
+ {% for group in library_groups %} + {% for library in group.libraries %} +
{{ library.name }}
+ {% endfor %} + {% endfor %} +
+
+
- * - - :doc:`hipCUB ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ + {% for group in library_groups %} + {% for library in group.libraries %} - * - - :doc:`hipRAND ` - - NA/✅ - - NA/✅ - - NA/✅ - - NA/✅ + .. container:: model-doc {{ library.tag }} - * - - :doc:`hipSOLVER ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ + For more information, please visit :doc:`{{ library.name }} <{{ library.doc_link }}>`. - * - - :doc:`hipSPARSELt ` - - ✅/✅ - - ❌/❌ - - ❌/❌ - - ❌/❌ + .. list-table:: + :header-rows: 1 + :widths: 70, 30 - * - - :doc:`hipTensor ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ + * + - Data Type + - Support + {% for data_type in library.data_types %} + * + - {{ data_type.type }} + - {{ data_type.support }} + {% endfor %} - * - - :doc:`MIGraphX ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ + {% endfor %} + {% endfor %} - * - - :doc:`MIOpen ` - - ⚠️/⚠️ - - ❌/❌ - - ⚠️/⚠️ - - ❌/❌ +.. note:: - * - - :doc:`RCCL ` - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocFFT ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - * - - :doc:`rocPRIM ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocRAND ` - - NA/✅ - - NA/✅ - - NA/✅ - - NA/✅ - - * - - :doc:`rocSOLVER ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - * - - :doc:`rocThrust ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocWMMA ` - - ✅/✅ - - ❌/❌ - - ❌/✅ - - ❌/❌ - - - .. tab-item:: Floating-point types - :sync: floating-point-type - - .. list-table:: - :header-rows: 1 - - * - - Library input/output data type name - - float8 (E4M3) - - float8 (E5M2) - - float16 - - bfloat16 - - tensorfloat32 - - float32 - - float64 - - * - - :doc:`Composable Kernel ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`hipCUB ` - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`hipRAND ` - - NA/❌ - - NA/❌ - - NA/✅ - - NA/❌ - - NA/❌ - - NA/✅ - - NA/✅ - - * - - :doc:`hipSOLVER ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`hipSPARSELt ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - * - - :doc:`hipTensor ` - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`MIGraphX ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - * - - :doc:`MIOpen ` - - ⚠️/⚠️ - - ⚠️/⚠️ - - ✅/✅ - - ⚠️/⚠️ - - ❌/❌ - - ✅/✅ - - ⚠️/⚠️ - - * - - :doc:`RCCL ` - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocFFT ` - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocPRIM ` - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocRAND ` - - NA/❌ - - NA/❌ - - NA/✅ - - NA/❌ - - NA/❌ - - NA/✅ - - NA/✅ - - * - - :doc:`rocSOLVER ` - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocThrust ` - - ❌/❌ - - ❌/❌ - - ⚠️/⚠️ - - ⚠️/⚠️ - - ❌/❌ - - ✅/✅ - - ✅/✅ - - * - - :doc:`rocWMMA ` - - ✅/❌ - - ✅/❌ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ - - ✅/✅ + The meaning of partial support depends on the library. Please refer to the individual + libraries' documentation for more information. .. note:: @@ -958,6 +1074,15 @@ detailed description. data types for the random values they generate, with no need for input data types. +.. note:: + + hipBLASLt supports additional data types as internal compute types, which may + differ from the supported input/output types shown in the tables above. While + TensorFloat32 is not supported as an input or output type in this library, it + is available as an internal compute type. For complete details on supported + compute types, refer to the :doc:`hipBLASLt ` + documentation. + hipDataType enumeration ----------------------- @@ -1049,6 +1174,24 @@ following table with descriptions and values. - 29 - 8-bit real bfloat8 precision floating-point (OCP version). + * + - ``HIP_R_6F_E2M3`` + - ``__hip_fp6_e2m3`` + - 31 + - 6-bit real float6 precision floating-point. + + * + - ``HIP_R_6F_E3M2`` + - ``__hip_fp6_e3m2`` + - 32 + - 6-bit real bfloat6 precision floating-point. + + * + - ``HIP_R_4F_E2M1`` + - ``__hip_fp4_e2m1`` + - 33 + - 4-bit real float4 precision floating-point. + * - ``HIP_R_8F_E4M3_FNUZ`` - ``__hip_fp8_e4m3_fnuz`` @@ -1061,4 +1204,4 @@ following table with descriptions and values. - 1001 - 8-bit real bfloat8 precision floating-point (FNUZ version). -The full list of the ``hipDataType`` enumeration listed in `library_types.h `_ . +The full list of the ``hipDataType`` enumeration listed in `library_types.h `_. diff --git a/docs/release/versions.md b/docs/release/versions.md index 943449d03..f492d2742 100644 --- a/docs/release/versions.md +++ b/docs/release/versions.md @@ -10,6 +10,7 @@ | Version | Release date | | ------- | ------------ | +| [7.0.0](https://rocm.docs.amd.com/en/docs-7.0.0/) | September 16, 2025 | | [6.4.3](https://rocm.docs.amd.com/en/docs-6.4.3/) | August 7, 2025 | | [6.4.2](https://rocm.docs.amd.com/en/docs-6.4.2/) | July 21, 2025 | | [6.4.1](https://rocm.docs.amd.com/en/docs-6.4.1/) | May 21, 2025 | diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 732aab15e..b8cbd0e31 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -12,14 +12,14 @@ subtrees: - file: compatibility/compatibility-matrix.rst title: Compatibility matrix entries: - - url: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html + - url: https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/reference/system-requirements.html title: Linux system requirements - url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/reference/system-requirements.html title: Windows system requirements - caption: Install entries: - - url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/ + - url: https://rocm.docs.amd.com/projects/install-on-linux-internal/en/latest/ title: ROCm on Linux - url: https://rocm.docs.amd.com/projects/install-on-windows/en/latest/ title: HIP SDK on Windows @@ -147,7 +147,7 @@ subtrees: - file: how-to/setting-cus title: Set the number of CUs - file: how-to/Bar-Memory.rst - title: Troubleshoot BAR access limitation + title: Troubleshoot BAR access limitation - url: https://github.com/amd/rocm-examples title: ROCm examples @@ -167,7 +167,9 @@ subtrees: - url: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf title: White paper - file: conceptual/gpu-arch/mi300-mi200-performance-counters.rst - title: MI300 and MI200 Performance counter + title: MI300 and MI200 performance counters + - file: conceptual/gpu-arch/mi350-performance-counters.rst + title: MI350 series performance counters - file: conceptual/gpu-arch/mi250.md title: MI250 microarchitecture subtrees: @@ -202,7 +204,7 @@ subtrees: - file: reference/gpu-arch-specs.rst - file: reference/gpu-atomics-operation.rst - file: reference/precision-support.rst - title: Precision support + title: Data types and precision support - file: reference/graph-safe-support.rst title: Graph safe support diff --git a/docs/what-is-rocm.rst b/docs/what-is-rocm.rst index 4accd00ed..b8ade66c5 100644 --- a/docs/what-is-rocm.rst +++ b/docs/what-is-rocm.rst @@ -10,7 +10,7 @@ ROCm is a software stack, composed primarily of open-source software, that provides the tools for programming AMD Graphics Processing Units (GPUs), from low-level kernels to high-level end-user applications. -.. image:: data/rocm-software-stack-6_4_0.jpg +.. image:: data/rocm-software-stack-7_0_0.jpg :width: 800 :alt: AMD's ROCm software stack and enabling technologies. :align: center @@ -45,6 +45,10 @@ Machine Learning & Computer Vision ":doc:`rocJPEG `", "Library for decoding JPG images on AMD GPUs" ":doc:`rocPyDecode `", "Provides access to rocDecode APIs in both Python and C/C++ languages" +.. note:: + + `rocCV `_ is an efficient GPU-accelerated library for image pre- and post-processing. rocCV is in an early access state. Using it on production workloads is not recommended. + Communication ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/manifest_700.xml b/manifest_700.xml new file mode 100644 index 000000000..4f7d505d8 --- /dev/null +++ b/manifest_700.xml @@ -0,0 +1,80 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file