Files
ROCm/RELEASE.md
2023-06-22 18:46:12 +02:00

19 KiB
Raw Permalink Blame History

Release Notes

The release notes for the ROCm platform.


ROCm 5.2.0

What's New in This Release

HIP Enhancements

The ROCm v5.2 release consists of the following HIP enhancements:

HIP Installation Guide Updates

The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.

For more details, refer to the HIP Installation Guide v5.2.

Support for device-side malloc on HIP-Clang

HIP-Clang now supports device-side malloc. This implementation does not require the use of hipDeviceSetLimit(hipLimitMallocHeapSize,value) nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.

The test codes at the following link show how to implement applications using malloc and free functions in device kernels:

https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp

New HIP APIs in This Release

The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:

Device management HIP APIs

The new device management HIP APIs are as follows:

  • Gets a UUID for the device. This API returns a UUID for the device.

    hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
    

    Note

    This new API corresponds to the following CUDA API:

    CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
    
  • Gets default memory pool of the specified device

    hipError_t hipDeviceGetDefaultMemPool(hipMemPool_t* mem_pool, int device);
    
  • Sets the current memory pool of a device

    hipError_t hipDeviceSetMemPool(int device, hipMemPool_t mem_pool);
    
  • Gets the current memory pool for the specified device

    hipError_t hipDeviceGetMemPool(hipMemPool_t* mem_pool, int device);
    
New HIP Runtime APIs in Memory Management

The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows:

  • Allocates memory with stream ordered semantics

    hipError_t hipMallocAsync(void** dev_ptr, size_t size, hipStream_t stream);
    
  • Frees memory with stream ordered semantics

    hipError_t hipFreeAsync(void* dev_ptr, hipStream_t stream);
    
  • Releases freed memory back to the OS

    hipError_t hipMemPoolTrimTo(hipMemPool_t mem_pool, size_t min_bytes_to_hold);
    
  • Sets attributes of a memory pool

    hipError_t hipMemPoolSetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
    
  • Gets attributes of a memory pool

    hipError_t hipMemPoolGetAttribute(hipMemPool_t mem_pool, hipMemPoolAttr attr, void* value);
    
  • Controls visibility of the specified pool between devices

    hipError_t hipMemPoolSetAccess(hipMemPool_t mem_pool, const hipMemAccessDesc* desc_list, size_t count);
    
  • Returns the accessibility of a pool from a device

    hipError_t hipMemPoolGetAccess(hipMemAccessFlags* flags, hipMemPool_t mem_pool, hipMemLocation* location);
    
  • Creates a memory pool

    hipError_t hipMemPoolCreate(hipMemPool_t* mem_pool, const hipMemPoolProps* pool_props);
    
  • Destroys the specified memory pool

    hipError_t hipMemPoolDestroy(hipMemPool_t mem_pool);
    
  • Allocates memory from a specified pool with stream ordered semantics

    hipError_t hipMallocFromPoolAsync(void** dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream);
    
  • Exports a memory pool to the requested handle type

    hipError_t hipMemPoolExportToShareableHandle(
        void*                      shared_handle,
        hipMemPool_t               mem_pool,
        hipMemAllocationHandleType handle_type,
        unsigned int               flags);
    
  • Imports a memory pool from a shared handle

    hipError_t hipMemPoolImportFromShareableHandle(
        hipMemPool_t*              mem_pool,
        void*                      shared_handle,
        hipMemAllocationHandleType handle_type,
        unsigned int               flags);
    
  • Exports data to share a memory pool allocation between processes

    hipError_t hipMemPoolExportPointer(hipMemPoolPtrExportData* export_data, void* dev_ptr);
    Import a memory pool allocation from another process.t
    hipError_t hipMemPoolImportPointer(
        void**                   dev_ptr,
        hipMemPool_t             mem_pool,
        hipMemPoolPtrExportData* export_data);
    
HIP Graph Management APIs

The new HIP Graph Management APIs are as follows:

  • Enqueues a host function call in a stream

    hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
    
  • Swaps the stream capture mode of a thread

    hipError_t hipThreadExchangeStreamCaptureMode(hipStreamCaptureMode* mode);
    
  • Sets a node attribute

    hipError_t hipGraphKernelNodeSetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, const hipKernelNodeAttrValue* value);
    
  • Gets a node attribute

    hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr,                                          hipKernelNodeAttrValue* value);
    
Support for Virtual Memory Management APIs

The new APIs for virtual memory management are as follows:

  • Frees an address range reservation made via hipMemAddressReserve

    hipError_t hipMemAddressFree(void* devPtr, size_t size);
    
  • Reserves an address range

    hipError_t hipMemAddressReserve(void** ptr, size_t size, size_t alignment, void* addr, unsigned long long flags);
    
  • Creates a memory allocation described by the properties and size

    hipError_t hipMemCreate(hipMemGenericAllocationHandle_t* handle, size_t size, const hipMemAllocationProp* prop, unsigned long long flags);
    
  • Exports an allocation to a requested shareable handle type

    hipError_t hipMemExportToShareableHandle(void* shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags);
    
  • Gets the access flags set for the given location and ptr

    hipError_t hipMemGetAccess(unsigned long long* flags, const hipMemLocation* location, void* ptr);
    
  • Calculates either the minimal or recommended granularity

    hipError_t hipMemGetAllocationGranularity(size_t* granularity, const hipMemAllocationProp* prop, hipMemAllocationGranularity_flags option);
    
  • Retrieves the property structure of the given handle

    hipError_t hipMemGetAllocationPropertiesFromHandle(hipMemAllocationProp* prop, hipMemGenericAllocationHandle_t handle);
    
  • Imports an allocation from a requested shareable handle type

    hipError_t hipMemImportFromShareableHandle(hipMemGenericAllocationHandle_t* handle, void* osHandle, hipMemAllocationHandleType shHandleType);
    
  • Maps an allocation handle to a reserved virtual address range

    hipError_t hipMemMap(void* ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags);
    
  • Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays

    hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int  count, hipStream_t stream);
    
  • Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate

    hipError_t hipMemRelease(hipMemGenericAllocationHandle_t handle);
    
  • Returns the allocation handle of the backing memory allocation given the address

    hipError_t hipMemRetainAllocationHandle(hipMemGenericAllocationHandle_t* handle, void* addr);
    
  • Sets the access flags for each location specified in desc for the given virtual address range

    hipError_t hipMemSetAccess(void* ptr, size_t size, const hipMemAccessDesc* desc, size_t count);
    
  • Unmaps memory allocation of a given address range

    hipError_t hipMemUnmap(void* ptr, size_t size);
    

For more information, refer to the HIP API documentation at {doc}hip:.doxygen/docBin/html/modules.

Planned HIP Changes in Future Releases

Changes to hipDeviceProp_t, HIPMEMCPY_3D, and hipArray structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility.

Refer to the Release Notes document in subsequent releases for more information. ROCm Math and Communication Libraries

In this release, ROCm Math and Communication Libraries consist of the following enhancements and fixes: New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration

This release introduces a new ROCm C++ library for accelerating mixed precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.

For more information, refer to Communication Libraries.

OpenMP Enhancements in This Release

OMPT Target Support

The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.

Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:

cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
make run

The file veccopy-ompt-target-tracing.c simulates how a tool would initiate device activity tracing. The file callbacks.h shows the callbacks that may be registered and implemented by the tool.

Deprecations and Warnings

Linux Filesystem Hierarchy Standard for ROCm

ROCm packages have adopted the Linux foundation filesystem hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new filesystem hierarchy, ROCm ensures backward compatibility with its 5.1 version or older filesystem hierarchy. See below for a detailed explanation of the new filesystem hierarchy and backward compatibility.

New Filesystem Hierarchy

The following is the new filesystem hierarchy:

/opt/rocm-<ver>
    | --bin
      | --All externally exposed Binaries
    | --libexec
        | --<component>
            | -- Component specific private non-ISA executables (architecture independent)
    | --include
        | -- <component>
            | --<header files>
    | --lib
        | --lib<soname>.so -> lib<soname>.so.major -> lib<soname>.so.major.minor.patch
            (public libraries linked with application)
        | --<component> (component specific private library, executable data)
        | --<cmake>
            | --components
                | --<component>.config.cmake
    | --share
        | --html/<component>/*.html
        | --info/<component>/*.[pdf, md, txt]
        | --man
        | --doc
            | --<component>
                | --<licenses>
        | --<component>
            | --<misc files> (arch independent non-executable)
            | --samples

Note

ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.

For more information, refer to https://refspecs.linuxfoundation.org/fhs.shtml.

Backward Compatibility with Older Filesystems

ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.

Note

ROCm will continue supporting backward compatibility until the next major release.

Wrapper header files

Wrapper header files are placed in the old location (/opt/rocm-xxx/<component>/include) with a warning message to include files from the new location (/opt/rocm-xxx/include) as shown in the example below:

// Code snippet from hip_runtime.h
#pragma message “This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip”.
#include "hip/hip_runtime.h"

The wrapper header files backward compatibility deprecation is as follows:

  • #pragma message announcing deprecation -- ROCm v5.2 release
  • #pragma message changed to #warning -- Future release
  • #warning changed to #error -- Future release
  • Backward compatibility wrappers removed -- Future release
Library files

Library files are available in the /opt/rocm-xxx/lib folder. For backward compatibility, the old library location (/opt/rocm-xxx/<component>/lib) has a soft link to the library at the new location.

Example:

$ ls -l /opt/rocm/hip/lib/
total 4
drwxr-xr-x 4 root root 4096 May 12 10:45 cmake
lrwxrwxrwx 1 root root   24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
CMake Config files

All CMake configuration files are available in the /opt/rocm-xxx/lib/cmake/<component> folder. For backward compatibility, the old CMake locations (/opt/rocm-xxx/<component>/lib/cmake) consist of a soft link to the new CMake config.

Example:

$ ls -l /opt/rocm/hip/lib/cmake/hip/
total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake

Planned deprecation of hip-rocclr and hip-base packages

In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.

Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported. Deprecation of Integrated HIP Directed Tests

The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.

Fixed Defects

Fixed Defect Fix
ROCmInfo does not list gpus Code fix
Hang observed while restoring cooperative group samples Code fix
ROCM-SMI over SRIOV: Unsupported commands do not return proper error message Code fix

Known Issues

This section consists of known issues in this release.

Compiler Error on gfx1030 When Compiling at -O0

Issue

A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.

Workaround

The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.

System Freeze Observed During CUDA Memtest Checkpoint

Issue

Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.

Workaround

To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.

HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error

Issue

The compiler may incorrectly compile a program that uses the __shfl_sync(mask, value, srcLane) function when the "value" parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for __shfl_sync should allow for undefined values.

Workaround

The workaround is to initialize the parameters to __shfl_sync.

Note

When the -Wall compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path.

Example:

double res = 0.0; // Initialize the input to __shfl_sync.
if (lane == 0) {
  res = <some expression>
}
res = __shfl_sync(mask, res, 0);

Kernel produces incorrect result

Issue

In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.

In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.

So, the function argument that is potentially undef (because it is not intialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted noundef attribute). This leads to ambiguous kernel execution.

Workaround
  • Skip adding noundef attribute to functions tagged with convergent attribute. Refer to https://reviews.llvm.org/D124158 for more information.

  • Introduce shuffle attribute and add it to __shfl like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to https://reviews.llvm.org/D125378 for more information.

  • Introduce clang builtin for __shfl to identify it and skip adding noundef attribute.

  • Introduce __builtin_freeze to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.

Issue with Applications Triggering Oversubscription

There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.

This issue is under investigation and will be fixed in a future release.