Update RELEASE.md (#505)

* Update RELEASE.md

Updated with Changelog info from Julia

* Update RELEASE.md

* Update RELEASE.md

* Update RELEASE.md
This commit is contained in:
randyh62
2025-08-19 10:53:36 -07:00
committed by GitHub
parent 08d0840b69
commit 71bc63d2d8

View File

@@ -770,24 +770,28 @@ See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/roc
- `hipMemGetHandleForAddressRange` gets a handle for the address range requested.
- `num_threads` Total number of threads in the group. The legacy API size is alias.
- `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for reduction across lanes of a warp. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions).
* New support for Open Compute Project (OCP) floating-point `FP4` / `FP6` / `FP8` as the following. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
- Data types for `FP4` / `FP6` / `FP8`.
- HIP APIs for `FP4` / `FP6` / `FP8`, which are compatible with corresponding CUDA APIs.
* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as the following. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html).
- Data types for `FP4`/`FP6`/`FP8`.
- HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs.
- HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs.
* New `wptr` and `rptr` values in `ClPrint`, for better logging in dispatch barrier methods.
* New debug mask, to print precise code object information for logging.
* The `_sync()` version of crosslane builtins such as `shfl_sync()` and `__reduce_add_sync` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`.
* Added `constexpr` operators for `FP16` / `BF16`.
* Added `__syncwarp` operation.
* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`.
* Added `constexpr` operators for `fp16`/`bf16`.
* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`)
* Extended fine grained system memory pool.
* `num_threads` total number of threads in the group. The legacy API size is alias.
* Added PCI CHIP ID information as the device attribute.
* Added new tests applications for OCP data types `FP4` / `FP6` / `FP8`.
* Added new tests applications for OCP data types `FP4`/`FP6`/`FP8`.
* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement.
#### Changed
* Deprecated GPUs.
Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows.
* Removal of Beta warnings in HIP Graph APIs
All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported.
* Behavior changes
- `hipGetLastError` now gets the error code returned by `hipGetLastError` which should be the last actual error caught in the current thread during the application execution.
- `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution.
- Cooperative groups in `hipLaunchCooperativeKernelMultiDevice` and `hipLaunchCooperativeKernel` functions, additional input parameter validation checks are added.
- `hipPointerGetAttributes` returns `hipSuccess` instead of an error with invalid value `hipErrorInvalidValue`, in case `NULL` host or attribute pointer is passed as input parameter. It now matches the functionality of `cudaPointerGetAttributes` which changed with CUDA 11 and above releases.
- `hipFree` previously there was an implicit wait which was applicable for all memory allocations, for synchronization purpose. This wait is now disabled for allocations made with `hipMallocAsync` and `hipMallocFromPoolAsync`, to match the behavior of CUDA API `cudaFree`
@@ -901,9 +905,9 @@ HIP runtime has the following functional improvements which greatly improve runt
* Refactored memory validation, creates a unique function to validate a variety of memory copy operations.
* Improved kernel logging using demangling shader names.
* Advanced support for SPIRV, now kernel compilation caching is enabled by default. This feature is controlled by the environment variable `AMD_COMGR_CACHE`, for details, see [hip_rtc document](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_rtc.html).
* Programmatic support for scratch limits on MI300 and MI350 series up GPU devices. More enumeration values were added in `hipLimit_t` as following,
- `hipExtLimitScratchMin`, minimum allowed value in bytes for scratch limit on the device.
- `hipExtLimitScratchMax`, maximum allowed value in bytes for scratch limit on the device.
* Programmatic support for scratch limits on MI300 and MI350 series up GPU devices. More enumeration values were added in `hipLimit_t` as following,
- `hipExtLimitScratchMin`, minimum allowed value in bytes for scratch limit on the device.
- `hipExtLimitScratchMax`, maximum allowed value in bytes for scratch limit on the device.
- `hipExtLimitScratchCurrent`, current scratch limit threshold in bytes on the device. Must be between the value `hipExtLimitScratchMin` and `hipExtLimitScratchMax`.
Developers can now use the environment variable `HSA_SCRATCH_SINGLE_LIMIT_ASYNC` to change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP API `hipDeviceSetLimit(hipExtLimitScratchCurrent, value)` to reset the scratch limit value.
* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth.
@@ -919,6 +923,11 @@ HIP runtime has the following functional improvements which greatly improve runt
* A crash in TensorFlow related application. HIP runtime now combines multiple definitions of `callbackQueue` into a single function, in case of an exception, passes its handler to the application and provides corresponding error code.
* Fixed issue of handling the kernel parameters for the graph launch.
* Failures in roc-obj tools. HIP runtime now makes `DEPRECATED` message in roc-obj tools as `STDERR`.
* Support of `hipDeviceMallocContiguous` flags in `hipExtMallocWithFlags()`. It now enables `HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG` in the memory pool allocation on GPU device.
* Compilation failure, HIP runtime refactored the vector type alignment with `__hip_vec_align_v`
* A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture.
* A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments.
* A permission related error during execution of `hipLaunchHostFunc`. The API is now supported and allowed to run during stream capture, to match the behavior of CUDA.
### **hipBLAS** (3.0.0)