diff --git a/CHANGELOG.md b/CHANGELOG.md deleted file mode 100644 index 16d4032fb..000000000 --- a/CHANGELOG.md +++ /dev/null @@ -1,9283 +0,0 @@ -# ROCm consolidated changelog - -This page is a historical overview of changes made to ROCm components. This -consolidated changelog documents key modifications and improvements across -different versions of the ROCm software stack and its components. - -## ROCm 7.0.2 - -See the [ROCm 7.0.2 release notes](https://rocm.docs.amd.com/en/docs-7.0.2/about/release-notes.html#rocm-7-0-2-release-notes) -for a complete overview of this release. - -### **AMD SMI** (26.0.2) - -#### Added - -* Added `bad_page_threshold_exceeded` field to `amd-smi static --ras`, which compares retired pages count against bad page threshold. This field displays `True` if retired pages exceed the threshold, `False` if within threshold, or `N/A` if threshold data is unavailable. Note that `sudo` is required to have the `bad_page_threshold_exceeded` field populated. - -#### Removed - -* Removed gpuboard and baseboard temperatures enums in amdsmi Python Library. - * `AmdSmiTemperatureType` had issues with referencing the correct attribute. As such, the following duplicate enums have been removed: - - `AmdSmiTemperatureType.GPUBOARD_NODE_FIRST` - - `AmdSmiTemperatureType.GPUBOARD_VR_FIRST` - - `AmdSmiTemperatureType.BASEBOARD_FIRST` - -#### Resolved Issues - -* Fixed `attribute error` in `amd-smi monitor` on Linux Guest systems, where the violations argument caused CLI to break. -* Fixed certain output in `amd-smi monitor` when GPUs are partitioned. - * It fixes the amd-smi monitor such as: `amd-smi monitor -Vqt`, `amd-smi monitor -g 0 -Vqt -w 1`, `amd-smi monitor -Vqt --file /tmp/test1`, etc. These commands will now be able to display as normal in partitioned GPU scenarios. - -* Fixed an issue where using `amd-smi ras --folder ` was forcing the created folder's name to be lowercase. This fix also allows all string input options to be case insensitive. - -* Fixed an issue of some processes not being detected by AMD SMI despite making use of KFD resources. This fix, with the addition of KFD Fallback for process detection, ensures that all KFD processes will be detected. - -* Multiple CPER issues were fixed. - - Issue of being unable to query for additional CPERs after 20 were generated on a single device. - - Issue where the RAS HBM CRC read was failing due to an incorrect AFID value. - - Issue where RAS injections were not consistently producing related CPERs. - -### **HIP** (7.0.2) - -#### Added - -* Support for the `hipMemAllocationTypeUncached` flag, enabling developers to allocate uncached memory. This flag is now supported in the following APIs: - - `hipMemGetAllocationGranularity` determines the recommended allocation granularity for uncached memory. - - `hipMemCreate` allocates memory with uncached properties. - -#### Resolved issues - -* A compilation failure affecting applications that compile kernels using `hiprtc` with the compiler option `std=c++11`. -* A permission-related error occurred during the execution of `hipLaunchHostFunc`. This API is now supported and permitted to run during stream capture, aligning its behavior with CUDA. -* A numerical error during graph capture of kernels that rely on a remainder in `globalWorkSize`, in frameworks like MIOpen and PyTorch, where the grid size is not a multiple of the block size. To ensure correct replay behavior, HIP runtime now stores this remainder in `hip::GraphKernelNode` during `hipExtModuleLaunchKernel` capture, enabling accurate execution and preventing corruption. -* A page fault occurred during viewport rendering while running the file undo.blend in Blender. The issue was resolved by the HIP runtime, which reused the same context during image creation. -* Resolved a segmentation fault in `gpu_metrics`, which is used in threshold logic for command submission patches to GPU device(s) during CPU synchronization. - -### **hipBLAS** (3.0.2) - -#### Added - -* Enabled support for gfx1150, gfx1151, gfx1200, and gfx1201 AMD hardware. - -### **RCCL** (2.26.6) - -#### Added - -* Enabled double-buffering in `reduceCopyPacks` to trigger pipelining, especially to overlap bf16 arithmetic. -* Added `--force-reduce-pipeline` as an option that can be passed to the `install.sh` script. Passing this option will enable software-triggered pipelining `bfloat16` reductions (that is, `all_reduce`, `reduce_scatter`, and `reduce`). - -### **rocBLAS** (5.0.2) - -#### Added - -* Enabled gfx1150 and gfx1151. -* The `ROCBLAS_USE_HIPBLASLT_BATCHED` variable to independently control the batched hipblaslt backend. Set `ROCBLAS_USE_HIPBLASLT_BATCHED=0` to disable batched GEMM use of the hipblaslt backend. - -#### Resolved issues - -* Set the imaginary portion of the main diagonal of the output matrix to zero in syrk and herk. - -### **ROCdbgapi** (0.77.4) - -#### Added - -* ROCdbgapi documentation link in the README.md file. - -### **ROCm Systems Profiler** (1.1.1) - -#### Resolved issues - -* Fixed an issue where ROC-TX ranges were displayed as two separate events instead of a single spanning event. - -### **rocPRIM** (4.0.1) - -#### Resolved issues - -* Fixed compilation issue when using `rocprim::texture_cache_iterator`. -* Fixed a HIP version check used to determine whether `hipStreamLegacy` is supported. This resolves runtime errors that occur when `hipStreamLegacy` is used in ROCm 7.0.0 and later. - -### **rocSPARSE** (4.0.3) - -#### Resolved issues - -* Fixed an issue causing premature deallocation of internal buffers while still in use. - -### **rocSOLVER** (3.30.1) - -#### Optimized - -Improved the performance of: - -* LARFT and downstream functions such as GEQRF and ORMTR. -* LARF and downstream functions such as GEQR2. -* ORMTR and downstream functions such as SYEVD. -* GEQR2 and downstream functions such as GEQRF. - -## ROCm 7.0.1 - -ROCm 7.0.1 is a quality release that resolves the existing issue. There is no change in component from the previous ROCm 7.0.0 release. See the [ROCm 7.0.1 release notes](https://rocm.docs.amd.com/en/docs-7.0.1/about/release-notes.html#rocm-7-0-1-release-notes) for a complete overview of this release. - -## ROCm 7.0.0 - -See the [ROCm 7.0.0 release notes](https://rocm.docs.amd.com/en/docs-7.0.0/about/release-notes.html#rocm-7-0-0-release-notes) -for a complete overview of this release. - -### **AMD SMI** (26.0.0) - -#### Added - -* Ability to restart the AMD GPU driver from the CLI and API. - - `amdsmi_gpu_driver_reload()` API and `amd-smi reset --reload-driver` or `amd-smi reset -r` CLI options. - - Driver reload functionality is now separated from memory partition - functions; memory partition change requests should now be followed by a driver reload. - - Driver reload requires all GPU activity on all devices to be stopped. - -* Default command: - - A default view has been added. The default view provides a snapshot of commonly requested information such as bdf, current partition mode, version information, and more. Users can access that information by simply typing `amd-smi` with no additional commands or arguments. Users may also obtain this information through alternate output formats such as json or csv by using the default command with the respective output format: `amd-smi default --json` or `amd-smi default --csv`. - -* Support for GPU metrics 1.8: - - Added new fields for `amdsmi_gpu_xcp_metrics_t` including: - - Metrics to allow new calculations for violation status: - - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts - - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts - - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. - - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). - - Increased available JPEG engines to 40. Current ASICs might not support all 40. These are indicated as `UINT16_MAX` or `N/A` in CLI. - -* Bad page threshold count. - - Added `amdsmi_get_gpu_bad_page_threshold` to Python API and CLI; root/sudo permissions are required to display the count. - -* CPU model name for RDC. - - Added new C and Python API `amdsmi_get_cpu_model_name`. - - Not sourced from esmi library. - -* New API `amdsmi_get_cpu_affinity_with_scope()`. - -* `socket power` to `amdsmi_get_power_info` - - Previously, the C API had the value in the `amdsmi_power_info` structure, but was unused. - - The value is representative of the socket's power agnostic of the the GPU version. - -* New event notification types to `amdsmi_evt_notification_type_t`. - The following values were added to the `amdsmi_evt_notification_type_t` enum: - - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_START` - - `AMDSMI_EVT_NOTIF_EVENT_MIGRATE_END` - - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_START` - - `AMDSMI_EVT_NOTIF_EVENT_PAGE_FAULT_END` - - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_EVICTION` - - `AMDSMI_EVT_NOTIF_EVENT_QUEUE_RESTORE` - - `AMDSMI_EVT_NOTIF_EVENT_UNMAP_FROM_GPU` - - `AMDSMI_EVT_NOTIF_PROCESS_START` - - `AMDSMI_EVT_NOTIF_PROCESS_END` - -- Power cap to `amd-smi monitor`. - - `amd-smi monitor -p` will display the power cap along with power. - -#### Changed - -* Separated driver reload functionality from `amdsmi_set_gpu_memory_partition()` and - `amdsmi_set_gpu_memory_partition_mode()` APIs -- and from the CLI `amd-smi set -M `. - -* Disabled `amd-smi monitor --violation` on guests. Modified `amd-smi metric -T/--throttle` to alias to `amd-smi metric -v/--violation`. - -* Updated `amdsmi_get_clock_info` in `amdsmi_interface.py`. - - The `clk_deep_sleep` field now returns the sleep integer value. - -* The `amd-smi topology` command has been enabled for guest environments. - - This includes full functionality so users can use the command just as they would in bare metal environments. - -* Expanded violation status tracking for GPU metrics 1.8. - - The driver will no longer be supporting existing single-value GFX clock below host limit fields (`acc_gfx_clk_below_host_limit`, `per_gfx_clk_below_host_limit`, `active_gfx_clk_below_host_limit`), they are now changed in favor of new per-XCP/XCC arrays. - - Added new fields to `amdsmi_violation_status_t` and related interfaces for enhanced violation breakdown: - - Per-XCP/XCC accumulators and status for: - - GFX clock below host limit (power, thermal, and total) - - Low utilization - - Added 2D arrays to track per-XCP/XCC accumulators, percentage, and active status: - - `acc_gfx_clk_below_host_limit_pwr`, `acc_gfx_clk_below_host_limit_thm`, `acc_gfx_clk_below_host_limit_total` - - `per_gfx_clk_below_host_limit_pwr`, `per_gfx_clk_below_host_limit_thm`, `per_gfx_clk_below_host_limit_total` - - `active_gfx_clk_below_host_limit_pwr`, `active_gfx_clk_below_host_limit_thm`, `active_gfx_clk_below_host_limit_total` - - `acc_low_utilization`, `per_low_utilization`, `active_low_utilization` - - Python API and CLI now report these expanded fields. - -* The char arrays in the following structures have been changed. - - `amdsmi_vbios_info_t` member `build_date` changed from `AMDSMI_MAX_DATE_LENGTH` to `AMDSMI_MAX_STRING_LENGTH`. - - `amdsmi_dpm_policy_entry_t` member `policy_description` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. - - `amdsmi_name_value_t` member `name` changed from `AMDSMI_MAX_NAME` to `AMDSMI_MAX_STRING_LENGTH`. - -* For backwards compatibility, updated `amdsmi_bdf_t` union to have an identical unnamed struct. - -* Updated `amdsmi_get_temp_metric` and `amdsmi_temperature_type_t` with new values. - - Added new values to `amdsmi_temperature_type_t` representing various baseboard and GPU board temperature measures. - - Updated `amdsmi_get_temp_metric` API to be able to take in and return the respective values for the new temperature types. - -#### Removed - -- Unnecessary API, `amdsmi_free_name_value_pairs()` - - This API is only used internally to free up memory from the Python interface and does not need to be - exposed to the user. - -- Unused definitions: - - `AMDSMI_MAX_NAME`, `AMDSMI_256_LENGTH`, `AMDSMI_MAX_DATE_LENGTH`, `MAX_AMDSMI_NAME_LENGTH`, `AMDSMI_LIB_VERSION_YEAR`, - `AMDSMI_DEFAULT_VARIANT`, `AMDSMI_MAX_NUM_POWER_PROFILES`, `AMDSMI_MAX_DRIVER_VERSION_LENGTH`. - -- Unused member `year` in struct `amdsmi_version_t`. - -- `amdsmi_io_link_type_t` has been replaced with `amdsmi_link_type_t`. - - `amdsmi_io_link_type_t` is no longer needed as `amdsmi_link_type_t` is sufficient. - - `amdsmi_link_type_t` enum has changed; primarily, the ordering of the PCI and XGMI types. - - This change will also affect `amdsmi_link_metrics_t`, where the link_type field changes from `amdsmi_io_link_type_t` to `amdsmi_link_type_t`. - -- `amdsmi_get_power_info_v2()`. - - The ``amdsmi_get_power_info()`` has been unified and the v2 function is no longer needed or used. - -- `AMDSMI_EVT_NOTIF_RING_HANG` event notification type in `amdsmi_evt_notification_type_t`. - -- The `amdsmi_get_gpu_vram_info` now provides vendor names as a string. - - `amdsmi_vram_vendor_type_t` enum structure is removed. - - `amdsmi_vram_info_t` member named `amdsmi_vram_vendor_type_t` is changed to a character string. - - `amdsmi_get_gpu_vram_info` now no longer requires decoding the vendor name as an enum. - -- Backwards compatibility for `amdsmi_get_gpu_metrics_info()`'s,`jpeg_activity`and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` or `xcp_stats.vcn_busy`. - - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. - - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion about which field to use. By removing backward compatibility, it is easier to identify the relevant field. - - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. - -#### Optimized - -- Reduced ``amd-smi`` CLI API calls needed to be called before reading or (re)setting GPU features. This - improves overall runtime performance of the CLI. - -- Removed partition information from the default `amd-smi static` CLI command. - - Users can still retrieve the same data by calling `amd-smi`, `amd-smi static -p`, or `amd-smi partition -c -m`/`sudo amd-smi partition -a`. - - Reading `current_compute_partition` may momentarily wake the GPU up. This is due to reading XCD registers, which is expected behavior. Changing partitions is not a trivial operation, `current_compute_partition` SYSFS controls this action. - -- Optimized CLI command `amd-smi topology` in partition mode. - - Reduced the number of `amdsmi_topo_get_p2p_status` API calls to one fourth. - -#### Resolved issues - -- Removed duplicated GPU IDs when receiving events using the `amd-smi event` command. - -- Fixed `amd-smi monitor` decoder utilization (`DEC%`) not showing up on MI300 Series ASICs. - -#### Known issues - -- `amd-smi monitor` on Linux Guest systems triggers an attribute error. - -```{note} -See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. -``` - -### **Composable Kernel** (1.1.0) - -#### Added - -* Support for `BF16`, `F32`, and `F16` for 2D and 3D NGCHW grouped convolution backward data. -* Fully asynchronous HOST (CPU) arguments copy flow for CK grouped GEMM kernels. -* Support GKCYX for layout for grouped convolution forward (NGCHW/GKCYX/NGKHW, number of instances in instance factory for NGCHW/GKYXC/NGKHW has been reduced). -* Support for GKCYX layout for grouped convolution forward (NGCHW/GKCYX/NGKHW). -* Support for GKCYX layout for grouped convolution backward weight (NGCHW/GKCYX/NGKHW). -* Support for GKCYX layout for grouped convolution backward data (NGCHW/GKCYX/NGKHW). -* Support for Stream-K version of mixed `FP8` / `BF16` GEMM. -* Support for Multiple D GEMM. -* GEMM pipeline for microscaling (MX) `FP8` / `FP6` / `FP4` data types. -* Support for `FP16` 2:4 structured sparsity to universal GEMM. -* Support for Split K for grouped convolution backward data. -* Logit soft-capping support for fMHA forward kernels. -* Support for hdim as a multiple of 32 for FMHA (fwd/fwd_splitkv). -* Benchmarking support for tile engine GEMM. -* Ping-pong scheduler support for GEMM operation along the K dimension. -* Rotating buffer feature for CK_Tile GEMM. -* `int8` support for CK_TILE GEMM. -* Vectorize Transpose optimization for CK Tile. -* Asynchronous copy for gfx950. - -#### Changed - -* Replaced the raw buffer load/store intrinsics with Clang20 built-ins. -* DL and DPP kernels are now enabled by default. -* Number of instances in instance factory for grouped convolution forward NGCHW/GKYXC/NGKHW has been reduced. -* Number of instances in instance factory for grouped convolution backward weight NGCHW/GKYXC/NGKHW has been reduced. -* Number of instances in instance factory for grouped convolution backward data NGCHW/GKYXC/NGKHW has been reduced. - -#### Removed - -* Removed support for gfx940 and gfx941 targets. - -#### Optimized - -* Optimized the GEMM multiply preshuffle and lds bypass with Pack of KGroup and better instruction layout. - -### **HIP** 7.0.0 - -#### Added - -* New HIP APIs - - `hipLaunchKernelEx` dispatches the provided kernel with the given launch configuration and forwards the kernel arguments. - - `hipLaunchKernelExC` launches a HIP kernel using a generic function pointer and the specified configuration. - - `hipDrvLaunchKernelEx` dispatches the device kernel represented by a HIP function object. - - `hipMemGetHandleForAddressRange` gets a handle for the address range requested. - - `__reduce_add_sync`, `__reduce_min_sync`, and `__reduce_max_sync` functions added for aritimetic reduction across lanes of a warp, and `__reduce_and_sync`, `__reduce_or_sync`, and `__reduce_xor_sync` -functions added for logical reduction. For details, see [Warp cross-lane functions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warp-cross-lane-functions). -* New support for Open Compute Project (OCP) floating-point `FP4`/`FP6`/`FP8` as follows. For details, see [Low precision floating point document](https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html). - - Data types for `FP4`/`FP6`/`FP8`. - - HIP APIs for `FP4`/`FP6`/`FP8`, which are compatible with corresponding CUDA APIs. - - HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs. -* New `wptr` and `rptr` values in `ClPrint`, for better logging in dispatch barrier methods. -* The `_sync()` version of crosslane builtins such as `shfl_sync()` are enabled by default. These can be disabled by setting the preprocessor macro `HIP_DISABLE_WARP_SYNC_BUILTINS`. -* Added `constexpr` operators for `fp16`/`bf16`. -* Added warp level primitives: `__syncwarp` and reduce intrinsics (e.g. `__reduce_add_sync()`). -* Support for the flags in APIs as following, now allows uncached memory allocation. - - `hipExtHostRegisterUncached`, used in `hipHostRegister`. - - `hipHostMallocUncached` and `hipHostAllocUncached`, used in `hipHostMalloc` and `hipHostAlloc`. -* `num_threads` total number of threads in the group. The legacy API size is alias. -* Added PCI CHIP ID information as the device attribute. -* Added new tests applications for OCP data types `FP4`/`FP6`/`FP8`. -* A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API `hipDeviceGetAttribute`, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement. - -#### Changed -* Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows. -* Removal of beta warnings in HIP Graph APIs. All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported. -* `warpSize` has changed. -In order to match the CUDA specification, the `warpSize` variable is no longer `constexpr`. In general, this should be a transparent change; however, if an application was using `warpSize` as a compile-time constant, it will have to be updated to handle the new definition. For more information, see the discussion of `warpSize` within the [HIP C++ language extensions](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize). -* Behavior changes - - `hipGetLastError` now returns the error code which is the last actual error caught in the current thread during the application execution. - - Cooperative groups in `hipLaunchCooperativeKernelMultiDevice` and `hipLaunchCooperativeKernel` functions, additional input parameter validation checks are added. - - `hipPointerGetAttributes` returns `hipSuccess` instead of an error with invalid value `hipErrorInvalidValue`, in case `NULL` host or attribute pointer is passed as input parameter. It now matches the functionality of `cudaPointerGetAttributes` which changed with CUDA 11 and above releases. - - `hipFree` previously there was an implicit wait which was applicable for all memory allocations, for synchronization purpose. This wait is now disabled for allocations made with `hipMallocAsync` and `hipMallocFromPoolAsync`, to match the behavior of CUDA API `cudaFree`. - - `hipFreeAsync` now returns `hipSuccess` when the input pointer is NULL, instead of ` hipErrorInvalidValue` , to be consistent with `hipFree`. - - Exceptions occurring during a kernel execution will not abort the process anymore but will return an error unless core dump is enabled. -* Changes in hipRTC. - - Removal of `hipRTC` symbols from HIP Runtime Library. - Any application using `hipRTC` APIs should link explicitly with the `hipRTC` library. This makes the usage of `hipRTC` library on Linux the same as on Windows and matches the behavior of CUDA `nvRTC`. - - `hipRTC` compilation - The device code compilation now uses namespace `__hip_internal`, instead of the standard headers `std`, to avoid namespace collision. - - Changes of datatypes from `hipRTC`. - Datatype definitions such as `int64_t`, `uint64_t`, `int32_t`, and `uint32_t`, etc. are removed to avoid any potential conflicts in some applications. HIP now uses internal datatypes instead, prefixed with `__hip`, for example, `__hip_int64_t`. -* HIP header clean up - - Usage of STD headers, HIP header files only include necessary STL headers. - - Deprecated structure `HIP_MEMSET_NODE_PARAMS` is removed. Developers can use the definition `hipMemsetParams` instead. -* API signature/struct changes - - API signatures are adjusted in some APIs to match corresponding CUDA APIs. Impacted APIs are as folloing: - * `hiprtcCreateProgram` - * `hiprtcCompileProgram` - * `hipMemcpyHtoD` - * `hipCtxGetApiVersion` - - HIP struct change in `hipMemsetParams`, it is updated and compatible with CUDA. - - HIP vector constructor change in `hipComplex` initialization now generates correct values. The affected constructors will be small vector types such as `float2`, `int4`, etc. -* Stream Capture updates - - Restricted stream capture mode, it is made in HIP APIs via adding the macro `CHECK_STREAM_CAPTURE_SUPPORTED ()`. -In the previous HIP enumeration `hipStreamCaptureMode`, three capture modes were defined. With checking in the macro, the only supported stream capture mode is now `hipStreamCaptureModeRelaxed`. The rest are not supported, and the macro will return `hipErrorStreamCaptureUnsupported`. This update involves the following APIs, which is allowed only in relaxed stream capture mode: - * `hipMallocManaged` - * `hipMemAdvise` - - Checks stream capture mode, the following APIs check the stream capture mode and return error codes to match the behavior of CUDA. - * `hipLaunchCooperativeKernelMultiDevice` - * `hipEventQuery` - * `hipStreamAddCallback` - - Returns error during stream capture. The following HIP APIs now returns specific error `hipErrorStreamCaptureUnsupported` on the AMD platform, but not always `hipSuccess`, to match behavior with CUDA: - * `hipDeviceSetMemPool` - * `hipMemPoolCreate` - * `hipMemPoolDestroy` - * `hipDeviceSetSharedMemConfig` - * `hipDeviceSetCacheConfig` - * `hipMemcpyWithStream` -* Error code update -Returned error/value codes are updated in the following HIP APIs to match the corresponding CUDA APIs. - - Module Management Related APIs: - * `hipModuleLaunchKernel` - * `hipExtModuleLaunchKernel` - * `hipExtLaunchKernel` - * `hipDrvLaunchKernelEx` - * `hipLaunchKernel` - * `hipLaunchKernelExC` - * `hipModuleLaunchCooperativeKernel` - * `hipModuleLoad` - - Texture Management Related APIs: -The following APIs update the return codes to match the behavior with CUDA: - * `hipTexObjectCreate`, supports zero width and height for 2D image. If either is zero, will not return `false`. - * `hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns `hipErrorNotFound`. - * `hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error `hipErrorInvalidChannelDescriptor`, instead of `hipErrorInvalidValue`. - * `hipGetTextureAlignmentOffset`, adds a return code `hipErrorInvalidTexture` when the texture reference pointer is NULL. - - Cooperative Group Related APIs, more calidations are added in the following API implementation: - * `hipLaunchCooperativeKernelMultiDevice` - * `hipLaunchCooperativeKernel` -* Invalid stream input parameter handling -In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Previously, the HIP runtime returned an error code `hipErrorContextIsDestroyed` if the stream was invalid. In CUDA version 12 and later, the equivalent behavior is to raise a segmentation fault. HIP runtime now matches the CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows: - - Stream Management Related APIs - * `hipStreamGetCaptureInfo` - * `hipStreamGetPriority` - * `hipStreamGetFlags` - * `hipStreamDestroy` - * `hipStreamAddCallback` - * `hipStreamQuery` - * `hipLaunchHostFunc` - - Graph Management Related APIs - * `hipGraphUpload` - * `hipGraphLaunch` - * `hipStreamBeginCaptureToGraph` - * `hipStreamBeginCapture` - * `hipStreamIsCapturing` - * `hipStreamGetCaptureInfo` - * `hipGraphInstantiateWithParams` - - Memory Management Related APIs - * `hipMemcpyPeerAsync` - * `hipMemcpy2DValidateParams` - * `hipMallocFromPoolAsync` - * `hipFreeAsync` - * `hipMallocAsync` - * `hipMemcpyAsync` - * `hipMemcpyToSymbolAsync` - * `hipStreamAttachMemAsync` - * `hipMemPrefetchAsync` - * `hipDrvMemcpy3D` - * `hipDrvMemcpy3DAsync` - * `hipDrvMemcpy2DUnaligned` - * `hipMemcpyParam2D` - * `hipMemcpyParam2DAsync` - * `hipMemcpy2DArrayToArray` - * `hipMemcpy2D` - * `hipMemcpy2DAsync` - * `hipDrvMemcpy2DUnaligned` - * `hipMemcpy3D` - - Event Management Related APIs - * `hipEventRecord` - * `hipEventRecordWithFlags` - -#### Optimized - -HIP runtime has the following functional improvements which improves runtime performance and user experience: - -* Reduced usage of the lock scope in events and kernel handling. - - Switches to `shared_mutex` for event validation, uses `std::unique_lock` in HIP runtime to create/destroy event, instead of `scopedLock`. - - Reduces the `scopedLock` in handling of kernel execution. HIP runtime now calls `scopedLock` during kernel binary creation/initialization, doesn't call it again during kernel vector iteration before launch. -* Implementation of unifying managed buffer and kernel argument buffer so HIP runtime doesn't need to create/load a separate kernel argument buffer. -* Refactored memory validation, creates a unique function to validate a variety of memory copy operations. -* Improved kernel logging using demangling shader names. -* Advanced support for SPIRV, now kernel compilation caching is enabled by default. This feature is controlled by the environment variable `AMD_COMGR_CACHE`, for details, see [hip_rtc document](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_rtc.html). -* Programmatic support for scratch limits on the AMD Instinct MI300 and MI350 Series up GPU devices. More enumeration values were added in `hipLimit_t` as following: - - `hipExtLimitScratchMin`, minimum allowed value in bytes for scratch limit on the device. - - `hipExtLimitScratchMax`, maximum allowed value in bytes for scratch limit on the device. - - `hipExtLimitScratchCurrent`, current scratch limit threshold in bytes on the device. Must be between the value `hipExtLimitScratchMin` and `hipExtLimitScratchMax`. - Developers can now use the environment variable `HSA_SCRATCH_SINGLE_LIMIT_ASYNC` to change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP API `hipDeviceSetLimit(hipExtLimitScratchCurrent, value)` to reset the scratch limit value. -* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth. -* Improved launch latency for `D2D` copies and `memset` on MI300 Series. -* Introduced a threshold to handle the command submission patch to the GPU device(s), considering the synchronization with CPU, for performance improvement. - -#### Resolved issues - -* Error of "unable to find modules" in HIP clean up for code object module. -* The issue of incorrect return error `hipErrorNoDevice`, when a crash occurred on GPU device due to illegal operation or memory violation. HIP runtime now handles the failure on the GPU side properly and reports the precise error code based on the last error seen on the GPU. -* Failures in some framework test applications, HIP runtime fixed the bug in retrieving a memory object from the IPC memory handle. -* A crash in TensorFlow related application. HIP runtime now combines multiple definitions of `callbackQueue` into a single function, in case of an exception, passes its handler to the application and provides corresponding error code. -* Fixed issue of handling the kernel parameters for the graph launch. -* Failures in roc-obj tools. HIP runtime now makes `DEPRECATED` message in roc-obj tools as `STDERR`. -* Support of `hipDeviceMallocContiguous` flags in `hipExtMallocWithFlags()`. It now enables `HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG` in the memory pool allocation on GPU device. -* Compilation failure, HIP runtime refactored the vector type alignment with `__hip_vec_align_v`. -* A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture. -* A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments. -* Compilation error when using bfloat16 functions. HIP runtime removed the anonymous namespace from FP16 functions to resolve this issue. - -#### Known issues - -* `hipLaunchHostFunc` returns an error during stream capture. Any application using `hipLaunchHostFunc` might fail to capture graphs during stream capture, instead, it returns `hipErrorStreamCaptureUnsupported`. -* Compilation failure in kernels via hiprtc when using option `std=c++11`. - -### **hipBLAS** (3.0.0) - -#### Added - -* Added the `hipblasSetWorkspace()` API. -* Support for codecoverage tests. - -#### Changed - -* HIPBLAS_V2 API is the only available API using the `hipComplex` and `hipDatatype` types. -* Documentation updates. -* Verbose compilation for `hipblas.cpp`. - -#### Removed - -* `hipblasDatatype_t` type. -* `hipComplex` and `hipDoubleComplex` types. -* Support code for non-production gfx targets. - -#### Resolved issues - -* The build time `CMake` configuration for the dependency on `hipBLAS-common` is fixed. -* Compiler warnings for unhandled enumerations have been resolved. - -### **hipBLASLt** (1.0.0) - -#### Added - -* Stream-K GEMM support has been enabled for the `FP32`, `FP16`, `BF16`, `FP8`, and `BF8` data types on the Instinct MI300A APU. To activate this feature, set the `TENSILE_SOLUTION_SELECTION_METHOD` environment variable to `2`, for example, `export TENSILE_SOLUTION_SELECTION_METHOD=2`. -* Fused Swish/SiLU GEMM (enabled by ``HIPBLASLT_EPILOGUE_SWISH_EXT`` and ``HIPBLASLT_EPILOGUE_SWISH_BIAS_EXT``). -* Support for ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS`` for gfx942. -* `HIPBLASLT_TUNING_USER_MAX_WORKSPACE` to constrain the maximum workspace size for user offline tuning. -* ``HIPBLASLT_ORDER_COL16_4R16`` and ``HIPBLASLT_ORDER_COL16_4R8`` to ``hipblasLtOrder_t`` to support `FP16`/`BF16` swizzle GEMM and `FP8` / `BF8` swizzle GEMM respectively. -* TF32 emulation on gfx950. -* Support for `FP6`, `BF6`, and `FP4` on gfx950. -* Support for block scaling by setting `HIPBLASLT_MATMUL_DESC_A_SCALE_MODE` and `HIPBLASLT_MATMUL_DESC_B_SCALE_MODE` to `HIPBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0`. - -#### Changed - -* The non-V2 APIs (``GemmPreference``, ``GemmProblemType``, ``GemmEpilogue``, ``GemmTuning``, ``GemmInputs``) in the cpp header are now the same as the V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``). The original non-V2 APIs are removed. - -#### Removed - -* ``HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXT`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXT`` are removed. Use the ``HIPBLASLT_MATMUL_DESC_A_SCALE_MODE`` and ``HIPBLASLT_MATMUL_DESC_B_SCALE_MODE`` attributes to set scalar (``HIPBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F``) or vector (``HIPBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F``) attributes. -* The `hipblasltExtAMaxWithScale` API is removed. - -#### Optimized - -* Improved performance for 8-bit (`FP8` / `BF8` / `I8`) NN/NT cases by adding ``s_delay_alu`` to reduce stalls from dependent ALU operations on gfx12+. -* Improved performance for 8-bit and 16-bit (`FP16` / `BF16`) TN cases by enabling software dependency checks (Expert Scheduling Mode) under certain restrictions to reduce redundant hardware dependency checks on gfx12+. -* Improved performance for 8-bit, 16-bit, and 32-bit batched GEMM with a better heuristic search algorithm for gfx942. - -#### Upcoming changes - -* V2 APIs (``GemmPreferenceV2``, ``GemmProblemTypeV2``, ``GemmEpilogueV2``, ``GemmTuningV2``, ``GemmInputsV2``) are deprecated. - -### **hipCUB** (4.0.0) - -#### Added - -* A new cmake option, `BUILD_OFFLOAD_COMPRESS`. When hipCUB is built with this option enabled, the `--offload-compress` switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large that symbols are placed out of range, resulting in linking errors. The new `BUILD_OFFLOAD_COMPRESS` option is set to `ON` by default. -* Single pass operators in `agent/single_pass_scan_operators.hpp` which contains the following API: - * `BlockScanRunningPrefixOp` - * `ScanTileStatus` - * `ScanTileState` - * `ReduceByKeyScanTileState` - * `TilePrefixCallbackOp` -* Support for gfx950. -* An overload of `BlockScan::InclusiveScan` that accepts an initial value to seed the scan. -* An overload of `WarpScan::InclusiveScan` that accepts an initial value to seed the scan. -* `UnrolledThreadLoad`, `UnrolledCopy`, and `ThreadLoadVolatilePointer` were added to align hipCUB with CUB. -* `ThreadStoreVolatilePtr` and the `IterateThreadStore` struct were added to align hipCUB with CUB. -* `hipcub::InclusiveScanInit` for CUB parity. - -#### Changed - -* The CUDA backend now requires CUB, Thrust, and libcu++ 2.7.0. If they aren't found, they will be downloaded from the CUDA CCCL repository. -* Updated `thread_load` and `thread_store` to align hipCUB with CUB. -* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, (for example, `hipcub::HIPCUB_300400_NS::symbol` instead of `hipcub::symbol`), letting the user link multiple libraries built with different versions of hipCUB. -* Modified the broadcast kernel in warp scan benchmarks. The reported performance may be different to previous versions. -* The `hipcub::detail::accumulator_t` in rocPRIM backend has been changed to utilise `rocprim::accumulator_t`. -* The usage of `rocprim::invoke_result_binary_op_t` has been replaced with `rocprim::accumulator_t`. - -#### Removed - -* The AMD GPU targets `gfx803` and `gfx900` are no longer built by default. If you want to build for these architectures, specify them explicitly in the `AMDGPU_TARGETS` cmake option. -* Deprecated `hipcub::AsmThreadLoad` is removed, use `hipcub::ThreadLoad` instead. -* Deprecated `hipcub::AsmThreadStore` is removed, use `hipcub::ThreadStore` instead. -* Deprecated `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed. -* This release removes support for custom builds on gfx940 and gfx941. -* Removed C++14 support. Only C++17 is supported. - -#### Resolved issues - -* Fixed an issue where `Sort(keys, compare_op, valid_items, oob_default)` in `block_merge_sort.hpp` would not fill in elements that are out of range (items after `valid_items`) with `oob_default`. -* Fixed an issue where `ScatterToStripedFlagged` in `block_exhange.hpp` was calling the wrong function. - -#### Known issues - -* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` have been removed from hipCUB's CUB backend. They were already deprecated as of version 2.12.0 of hipCUB and they were removed from CCCL (CUB) as of CCCL's 2.6.0 release. -* `BlockScan::InclusiveScan` for the CUDA backend does not compute the block aggregate correctly when passing an initial value parameter. This behavior is not matched by the AMD backend. - -#### Upcoming changes - -* `BlockAdjacentDifference::FlagHeads`, `BlockAdjacentDifference::FlagTails` and `BlockAdjacentDifference::FlagHeadsAndTails` were deprecated as of version 2.12.0 of hipCUB, and will be removed from the rocPRIM backend in a future release for the next ROCm major version (ROCm 7.0.0). - -### **hipFFT** (1.0.20) - -#### Added - -* Support for gfx950. - -#### Removed - -* Removed hipfft-rider legacy compatibility from clients. -* Removed support for the gfx940 and gfx941 targets from the client programs. -* Removed backward compatibility symlink for include directories. - -### **hipfort** (0.7.0) - -#### Added - -* Documentation clarifying how hipfort is built for the CUDA platform. - -#### Changed - -* Updated and reorganized documentation for clarity and consistency. - -### **HIPIFY** (20.0.0) - -#### Added - -* CUDA 12.9.1 support. -* cuDNN 9.11.0 support. -* cuTENSOR 2.2.0.0 support. -* LLVM 20.1.8 support. - -#### Resolved issues - -* `hipDNN` support is removed by default. -* [#1859](https://github.com/ROCm/HIPIFY/issues/1859)[hipify-perl] Fix warnings on unsupported Driver or Runtime APIs which were erroneously not reported. -* [#1930](https://github.com/ROCm/HIPIFY/issues/1930) Revise `JIT API`. -* [#1962](https://github.com/ROCm/HIPIFY/issues/1962) Support for cuda-samples helper headers. -* [#2035](https://github.com/ROCm/HIPIFY/issues/2035) Removed `const_cast;` in `hiprtcCreateProgram` and `hiprtcCompileProgram`. - -### **hipRAND** (3.0.0) - -#### Added - -* Support for gfx950. - -#### Changed - -* Deprecated the hipRAND Fortran API in favor of hipfort. - -#### Removed - -* Removed C++14 support, so only C++17 is supported. - -### **hipSOLVER** (3.0.0) - -#### Added - -* Added compatibility-only functions: - * csrlsvqr - * `hipsolverSpCcsrlsvqr`, `hipsolverSpZcsrlsvqr` - -#### Resolved issues - -* Corrected the value of `lwork` returned by various `bufferSize` functions to be consistent with CUDA cuSOLVER. The following functions now return `lwork` so that the workspace size (in bytes) is `sizeof(T) * lwork`, rather than `lwork`. To restore the original behavior, set the environment variable `HIPSOLVER_BUFFERSIZE_RETURN_BYTES`. - * `hipsolverXorgbr_bufferSize`, `hipsolverXorgqr_bufferSize`, `hipsolverXorgtr_bufferSize`, `hipsolverXormqr_bufferSize`, `hipsolverXormtr_bufferSize`, `hipsolverXgesvd_bufferSize`, `hipsolverXgesvdj_bufferSize`, `hipsolverXgesvdBatched_bufferSize`, `hipsolverXgesvdaStridedBatched_bufferSize`, `hipsolverXsyevd_bufferSize`, `hipsolverXsyevdx_bufferSize`, `hipsolverXsyevj_bufferSize`, `hipsolverXsyevjBatched_bufferSize`, `hipsolverXsygvd_bufferSize`, `hipsolverXsygvdx_bufferSize`, `hipsolverXsygvj_bufferSize`, `hipsolverXsytrd_bufferSize`, `hipsolverXsytrf_bufferSize`. - -### **hipSPARSE** (4.0.1) - -#### Added - -* `int8`, `int32`, and `float16` data types to `hipDataTypeToHCCDataType` so that sparse matrix descriptors can be used with them. -* Half float mixed precision to `hipsparseAxpby` where X and Y use `float16` and the result and compute type use `float`. -* Half float mixed precision to `hipsparseSpVV` where X and Y use `float16` and the result and compute type use `float`. -* Half float mixed precision to `hipsparseSpMM` where A and B use `float16` and C and the compute type use `float`. -* Half float mixed precision to `hipsparseSDDMM` where A and B use `float16` and C and the compute type use `float`. -* Half float uniform precision to the `hipsparseScatter` and `hipsparseGather` routines. -* Half float uniform precision to the `hipsparseSDDMM` routine. -* `int8` precision to the `hipsparseCsr2cscEx2` routine. -* The `almalinux` operating system name to correct the GFortran dependency. - -#### Changed - -* Switched to defaulting to C++17 when building hipSPARSE from source. Previously hipSPARSE was using C++14 by default. - -#### Resolved issues - -* Fixed a compilation [issue](https://github.com/ROCm/hipSPARSE/issues/555) related to using `std::filesystem` and C++14. -* Fixed an issue where the clients-common package was empty by moving the `hipsparse_clientmatrices.cmake` and `hipsparse_mtx2csr` files to it. - -#### Known issues - -* In `hipsparseSpSM_solve()`, the external buffer is passed as a parameter. This does not match the CUDA cuSPARSE API. This extra external buffer parameter will be removed in a future release. For now, this extra parameter can be ignored and nullptr passed in because it is unused internally. - -### **hipSPARSELt** (0.2.4) - -#### Added - -* Support for the LLVM target gfx950. -* Support for the following data type combinations for the LLVM target gfx950: - * `FP8`(E4M3) inputs, `F32` output, and `F32` Matrix Core accumulation. - * `BF8`(E5M2) inputs, `F32` output, and `F32` Matrix Core accumulation. -* Support for ROC-TX if `HIPSPARSELT_ENABLE_MARKER=1` is set. -* Support for the cuSPARSELt v0.6.3 backend. - -#### Removed - -* Support for LLVM targets gfx940 and gfx941 has been removed. -* `hipsparseLtDatatype_t` has been removed. - -#### Optimized - -* Improved the library loading time. -* Provided more kernels for the `FP16` data type. - -### **hipTensor** (2.0.0) - -#### Added - -* Element-wise binary operation support. -* Element-wise trinary operation support. -* Support for GPU target gfx950. -* Dynamic unary and binary operator support for element-wise operations and permutation. -* CMake check for `f8` datatype availability. -* `hiptensorDestroyOperationDescriptor` to free all resources related to the provided descriptor. -* `hiptensorOperationDescriptorSetAttribute` to set attribute of a `hiptensorOperationDescriptor_t` object. -* `hiptensorOperationDescriptorGetAttribute` to retrieve an attribute of the provided `hiptensorOperationDescriptor_t` object. -* `hiptensorCreatePlanPreference` to allocate the `hiptensorPlanPreference_t` and enabled users to limit the applicable kernels for a given plan or operation. -* `hiptensorDestroyPlanPreference` to free all resources related to the provided preference. -* `hiptensorPlanPreferenceSetAttribute` to set attribute of a `hiptensorPlanPreference_t` object. -* `hiptensorPlanGetAttribute` to retrieve information about an already-created plan. -* `hiptensorEstimateWorkspaceSize` to determine the required workspace size for the given operation. -* `hiptensorCreatePlan` to allocate a `hiptensorPlan_t` object, select an appropriate kernel for a given operation and prepare a plan that encodes the execution. -* `hiptensorDestroyPlan` to free all resources related to the provided plan. - -#### Changed - -* Removed architecture support for gfx940 and gfx941. -* Generalized opaque buffer for any descriptor. -* Replaced `hipDataType` with `hiptensorDataType_t` for all supported types, for example, `HIP_R_32F` to `HIPTENSOR_R_32F`. -* Replaced `hiptensorComputeType_t` with `hiptensorComputeDescriptor_t` for all supported types. -* Replaced `hiptensorInitTensorDescriptor` with `hiptensorCreateTensorDescriptor`. -* Changed handle type and API usage from `*handle` to `handle`. -* Replaced `hiptensorContractionDescriptor_t` with `hipTensorOperationDescriptor_t`. -* Replaced `hiptensorInitContractionDescriptor` with `hiptensorCreateContraction`. -* Replaced `hiptensorContractionFind_t` with `hiptensorPlanPreference_t`. -* Replaced `hiptensorInitContractionFind` with `hiptensorCreatePlanPreference`. -* Replaced `hiptensorContractionGetWorkspaceSize` with `hiptensorEstimateWorkspaceSize`. -* Replaced `HIPTENSOR_WORKSPACE_RECOMMENDED` with `HIPTENSOR_WORKSPACE_DEFAULT`. -* Replaced `hiptensorContractionPlan_t` with `hiptensorPlan_t`. -* Replaced `hiptensorInitContractionPlan` with `hiptensorCreatePlan`. -* Replaced `hiptensorContraction` with `hiptensorContract`. -* Replaced `hiptensorPermutation` with `hiptensorPermute`. -* Replaced `hiptensorReduction` with `hiptensorReduce`. -* Replaced `hiptensorElementwiseBinary` with `hiptensorElementwiseBinaryExecute`. -* Replaced `hiptensorElementwiseTrinary` with `hiptensorElementwiseTrinaryExecute`. -* Removed function `hiptensorReductionGetWorkspaceSize`. - -### **llvm-project** (20.0.0) - -#### Added - -* The compiler `-gsplit-dwarf` option to enable the generation of separate debug information file at compile time. When used, separate debug information files are generated for host and for each offload architecture. For additional information, see [DebugFission](https://gcc.gnu.org/wiki/DebugFission). -* `llvm-flang`, AMD's next-generation Fortran compiler. It's a re-implementation of the Fortran frontend that can be found at `llvm/llvm-project/flang` on GitHub. -* Comgr support for an in-memory virtual file system (VFS) for storing temporary files generated during intermediate compilation steps to improve performance in the device library link step. -* Compiler support of a new target-specific builtin `__builtin_amdgcn_processor_is` for late or deferred queries of the current target processor, and `__builtin_amdgcn_is_invocable` to determine the current target processor ability to invoke a particular builtin. -* HIPIFY support for CUDA 12.9.1 APIs. Added support for all new device and host APIs, including FP4, FP6, and FP128, and support for the corresponding ROCm HIP equivalents. - -#### Changed - -* Updated clang/llvm to AMD clang version 20.0.0 (equivalent to LLVM 20.0.0 with additional out-of-tree patches). -* HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) have been removed from this release. - -#### Optimized - -* Improved compiler memory load and store instructions. - -#### Upcoming changes - -* `__AMDGCN_WAVEFRONT_SIZE__` macro and HIP’s `warpSize` variable as `constexpr` are deprecated and will be disabled in a future release. Users are encouraged to update their code if needed to ensure future compatibility. For more information, see [AMDGCN_WAVEFRONT_SIZE deprecation](#amdgpu-wavefront-size-compiler-macro-deprecation). -* The `roc-obj-ls` and `roc-obj-extract` tools are deprecated. To extract all Clang offload bundles into separate code objects use `llvm-objdump --offloading `. For more information, see [Changes to ROCm Object Tooling](#changes-to-rocm-object-tooling). - -### **MIGraphX** (2.13.0) - -#### Added - -* Support for OCP `FP8` on AMD Instinct MI350X GPUs. -* Support for PyTorch 2.7 via Torch-MIGraphX. -* Support for the Microsoft ONNX Contrib Operators (Self) Attention, RotaryEmbedding, QuickGelu, BiasAdd, BiasSplitGelu, SkipLayerNorm. -* Support for Sigmoid and AddN TensorFlow operators. -* GroupQuery Attention support for LLMs. -* Support for edge mode in the ONNX Pad operator. -* ONNX runtime Python driver. -* FLUX e2e example. -* C++ and Python APIs to save arguments to a graph as a msgpack file, and then read the file back. -* rocMLIR fusion for kv-cache attention. -* Introduced a check for file-write errors. - -#### Changed - -* `quantize_bf16` for quantizing the model to `BF16` has been made visible in the MIGraphX user API. -* Print additional kernel/module information in the event of compile failure. -* Use hipBLASLt instead of rocBLAS on newer GPUs. -* 1x1 convolutions are now rewritten to GEMMs. -* `BF16::max` is now represented by its encoding rather than its expected value. -* Direct warnings now go to `cout` rather `cerr`. -* `FP8` uses hipBLASLt rather than rocBLAS. -* ONNX models are now topologically sorted when nodes are unordered. -* Improved layout of Graphviz output. -* Enhanced debugging for migraphx-driver: consumed environment variables are printed, timestamps and duration are added to the summary. -* Add a trim size flag to the verify option for migraphx-driver. -* Node names are printed to track parsing within the ONNX graph when using the `MIGRAPHX_TRACE_ONNX_PARSER` flag. -* Update accuracy checker to output test data with the `--show-test-data` flag. -* The `MIGRAPHX_TRACE_BENCHMARKING` option now allows the problem cache file to be updated after finding the best solution. - -#### Removed - -* `ROCM_USE_FLOAT8` macro. -* The `BF16` GEMM test was removed for Navi21, as it is unsupported by rocBLAS and hipBLASLt on that platform. - -#### Optimized - -* Use common average in `compile_ops` to reduce run-to-run variations when tuning. -* Improved the performance of the TopK operator. -* Conform to a single layout (NHWC or NCHW) during compilation rather than combining two. -* Slice Channels Conv Optimization (slice output fusion). -* Horizontal fusion optimization after pointwise operations. -* Reduced the number of literals used in `GridSample` linear sampler. -* Fuse multiple outputs for pointwise operations. -* Fuse reshapes on pointwise inputs for MLIR output fusion. -* MUL operation not folded into the GEMM when the GEMM is used more than once. -* Broadcast not fused after convolution or GEMM MLIR kernels. -* Avoid reduction fusion when operator data-types mismatch. - -#### Resolved issues - -* Compilation workaround ICE in clang 20 when using `views::transform`. -* Fix bug with `reshape_lazy` in MLIR. -* Quantizelinear fixed for Nearbyint operation. -* Check for empty strings in ONNX node inputs for operations like Resize. -* Parse Resize fix: only check `keep_aspect_ratio_policy` attribute for sizes input. -* Nonmaxsuppression: fixed issue where identical boxes/scores not ordered correctly. -* Fixed a bug where events were created on the wrong device in a multi-gpu scenario. -* Fixed out of order keys in value for comparisons and hashes when caching best kernels. -* Fixed Controlnet MUL types do not match error. -* Fixed check for scales if ROI input is present in Resize operation. -* Einsum: Fixed a crash on empty squeeze operations. - -### **MIOpen** (3.5.0) - -#### Added - -* [Conv] Misa kernels for gfx950. -* [Conv] Enabled Split-K support for CK backward data solvers (2D). -* [Conv] Enabled CK wrw solver on gfx950 for the `BF16` data type. -* [BatchNorm] Enabled NHWC in OpenCL. -* Grouped convolution + activation fusion. -* Grouped convolution + bias + activation fusion. -* Composable Kernel (CK) can now be built inline as part of MIOpen. - -#### Changed - -* Changed to using the median value with outliers removed when deciding on the best solution to run. -* [Conv] Updated the igemm asm solver. - -#### Optimized - -* [BatchNorm] Optimized NHWC OpenCL kernels and improved heuristics. -* [RNN] Dynamic algorithm optimization. -* [Conv] Eliminated redundant clearing of output buffers. -* [RNN] Updated selection heuristics. -* Updated tuning for the AMD Instinct MI300 Series. - -#### Resolved issues - -* Fixed a segmentation fault when the user specified a smaller workspace than what was required. -* Fixed a layout calculation logic error that returned incorrect results and enabled less restrictive layout selection. -* Fixed memory access faults in misa kernels due to out-of-bounds memory usage. -* Fixed a performance drop on the gfx950 due to transpose kernel use. -* Fixed a memory access fault caused by not allocating enough workspace. -* Fixed a name typo that caused kernel mismatches and long startup times. - -### **MIVisionX** (3.3.0) - -#### Added - -* Support to enable/disable BatchPD code in VX_RPP extensions by checking the RPP_LEGACY_SUPPORT flag. - -#### Changed - -* VX_RPP extension: Version 3.1.0 release. -* Update the parameters and kernel API of Blur, Fog, Jitter, LensCorrection, Rain, Pixelate, Vignette and ResizeCrop wrt tensor kernels replacing the legacy BatchPD API calls in VX_RPP extensions. - -#### Known issues - -* Installation on RHEL and SLES requires the manual installation of the `FFMPEG` and `OpenCV` dev packages. - -#### Upcoming changes - -* Optimized audio augmentations support for VX_RPP. - -### **RCCL** (2.26.6) - -#### Added - -* Support for the extended fine-grained system memory pool. -* Support for gfx950. -* Support for `unroll=1` in device-code generation to improve performance. -* Set a default of 112 channels for a single node with `8 * gfx950`. -* Enabled LL128 protocol on the gfx950. -* The ability to choose the unroll factor at runtime using `RCCL_UNROLL_FACTOR`. This can be set at runtime to 1, 2, or 4. This change currently increases compilation and linking time because it triples the number of kernels generated. -* Added MSCCL support for AllGather multinode on the gfx942 and gfx950 (for instance, 16 and 32 GPUs). To enable this feature, set the environment variable `RCCL_MSCCL_FORCE_ENABLE=1`. The maximum message size for MSCCL AllGather usage is `12292 * sizeof(datatype) * nGPUs`. -* Thread thresholds for LL/LL128 are selected in Tuning Models for the AMD Instinct MI300X. This impacts the number of channels used for AllGather and ReduceScatter. The channel tuning model is bypassed if `NCCL_THREAD_THRESHOLDS`, `NCCL_MIN_NCHANNELS`, or `NCCL_MAX_NCHANNELS` are set. -* Multi-node tuning for AllGather, AllReduce, and ReduceScatter that leverages LL/LL64/LL128 protocols to use nontemporal vector load/store for tunable message size ranges. -* LL/LL128 usage ranges for AllReduce, AllGather, and ReduceScatter are part of the tuning models, which enable architecture-specific tuning in conjunction with the existing Rome Models scheme in RCCL. -* Two new APIs are exposed as part of an initiative to separate RCCL code. These APIs are `rcclGetAlgoInfo` and `rcclFuncMaxSendRecvCount`. However, user-level invocation requires that RCCL be built with `RCCL_EXPOSE_STATIC` enabled. - -#### Changed - -* Compatibility with NCCL 2.23.4. -* Compatibility with NCCL 2.24.3. -* Compatibility with NCCL 2.25.1. -* Compatibility with NCCL 2.26.6. - -#### Resolved issues - -* Resolved an issue when using more than 64 channels when multiple collectives are used in the same `ncclGroup()` call. -* Fixed unit test failures in tests ending with the `ManagedMem` and `ManagedMemGraph` suffixes. -* Fixed a suboptimal algorithmic switching point for AllReduce on the AMD Instinct MI300X. -* Fixed the known issue "When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault" with a design change to use `comm` instead of `rank` for `mscclStatus`. The global map for `comm` to `mscclStatus` is still not thread safe but should be explicitly handled by mutexes for read-write operations. This is tested for correctness, but there is a plan to use a thread-safe map data structure in an upcoming release. - -### **rocAL** (2.3.0) - -#### Added -* Extended support to rocAL's video decoder to use rocDecode hardware decoder. -* Setup - installs rocdecode dev packages for Ubuntu, RedHat, and SLES. -* Setup - installs turbojpeg dev package for Ubuntu and Redhat. -* rocAL's image decoder has been extended to support the rocJPEG hardware decoder. -* Numpy reader support for reading npy files in rocAL. -* Test case for numpy reader in C++ and python tests. - -#### Resolved issues -* `TurboJPEG` no longer needs to be installed manually. It is now installed by the package installer. -* Hardware decode no longer requires that ROCm be installed with the `graphics` usecase. - -#### Known issues -* Package installation on SLES requires manually installing `TurboJPEG`. -* Package installation on RHEL and SLES requires manually installing the `FFMPEG Dev` package. - -#### Upcoming changes - -* rocJPEG support for JPEG decode. - -### **rocALUTION** (4.0.0) - -#### Added - -* Support for gfx950. - -#### Changed - -* Switch to defaulting to C++17 when building rocALUTION from source. Previously rocALUTION was using C++14 by default. - -#### Optimized - -* Improved the user documentation. - -#### Resolved issues - -* Fix for GPU hashing algorithm when not compiling with -O2/O3. - -### **rocBLAS** (5.0.0) - -#### Added - -* Support for gfx950. -* Internal API logging for `gemm` debugging using `ROCBLAS_LAYER = 8`. -* Support for the AOCL 5.0 gcc build as a client reference library. -* The use of `PkgConfig` for client reference library fallback detection. - -#### Changed - -* `CMAKE_CXX_COMPILER` is now passed on during compilation for a Tensile build. -* The default atomics mode is changed from `allowed` to `not allowed`. - -#### Removed - -* Support code for non-production gfx targets. -* `rocblas_hgemm_kernel_name`, `rocblas_sgemm_kernel_name`, and `rocblas_dgemm_kernel_name` API functions. -* The use of `warpSize` as a constexpr. -* The use of deprecated behavior of `hipPeekLastError`. -* `rocblas_float8.h` and `rocblas_hip_f8_impl.h` files. -* `rocblas_gemm_ex3`, `rocblas_gemm_batched_ex3`, and `rocblas_gemm_strided_batched_ex3` API functions. - -#### Optimized - -* Optimized `gemm` by using `gemv` kernels when applicable. -* Optimized `gemv` for small `m` and `n` with a large batch count on gfx942. -* Improved the performance of Level 1 `dot` for all precisions and variants when `N > 100000000` on gfx942. -* Improved the performance of Level 1 `asum` and `nrm2` for all precisions and variants on gfx942. -* Improved the performance of Level 2 `sger` (single precision) on gfx942. -* Improved the performance of Level 3 `dgmm` for all precisions and variants on gfx942. - -#### Resolved issues - -* Fixed environment variable path-based logging to append multiple handle outputs to the same file. -* Support numerics when `trsm` is running with `rocblas_status_perf_degraded`. -* Fixed the build dependency installation of `joblib` on some operating systems. -* Return `rocblas_status_internal_error` when `rocblas_[set,get]_ [matrix,vector]` is called with a host pointer in place of a device pointer. -* Reduced the default verbosity level for internal GEMM backend information. -* Updated from the deprecated rocm-cmake to ROCmCMakeBuildTools. -* Corrected AlmaLinux GFortran package dependencies. - -#### Upcoming changes - -* Deprecated the use of negative indices to indicate the default solution is being used for `gemm_ex` with `rocblas_gemm_algo_solution_index`. - -### **ROCdbgapi** (0.77.3) - -#### Added -* Support for the `gfx950` architectures. - -#### Removed -* Support for the `gfx940` and `gfx941` architectures. - -### **rocDecode** (1.0.0) - -#### Added - -* VP9 IVF container file parsing support in bitstream reader. -* CTest for VP9 decode on bitstream reader. -* HEVC/AVC/AV1/VP9 stream syntax error handling. -* HEVC stream bit depth change handling and DPB buffer size change handling through decoder reconfiguration. -* AVC stream DPB buffer size change handling through decoder reconfiguration. -* A new avcodec-based decoder built as a separate `rocdecode-host` library. - -#### Changed - -* rocDecode now uses the Cmake `CMAKE_PREFIX_PATH` directive. -* Changed asserts in query API calls in RocVideoDecoder utility class to error reports, to avoid hard stop during query in case error occurs and to let the caller decide actions. -* `libdrm_amdgpu` is now explicitly linked with rocdecode. - -#### Removed - -* `GetStream()` interface call from RocVideoDecoder utility class. - -#### Optimized - -* Decode session starts latency reduction. -* Bitstream type detection optimization in bitstream reader. - -#### Resolved issues - -* Fixed a bug in the `videoDecodePicFiles` picture files sample that can results in incorrect output frame count. -* Fixed a decoded frame output issue in video size change cases. -* Removed incorrect asserts of `bitdepth_minus_8` in `GetBitDepth()` and `num_chroma_planes` in `GetNumChromaPlanes()` API calls in the RocVideoDecoder utility class. - -### **rocFFT** (1.0.34) - -#### Added - -* Support for gfx950. - -#### Removed - -* Removed ``rocfft-rider`` legacy compatibility from clients. -* Removed support for the gfx940 and gfx941 targets from the client programs. -* Removed backward compatibility symlink for include directories. - -#### Optimized - -* Removed unnecessary HIP event/stream allocation and synchronization during MPI transforms. -* Implemented single-precision 1D kernels for lengths: - - 4704 - - 5488 - - 6144 - - 6561 - - 8192 -* Implemented single-kernel plans for some large 1D problem sizes, on devices with at least 160KiB of LDS. - -#### Resolved issues - -* Fixed kernel faults on multi-device transforms that gather to a single device, when the input/output bricks are not - contiguous. - -### **ROCgdb** (16.3) - -#### Added - -- Support for the `gfx950` architectures. - -#### Removed - -- Support for the `gfx940` and `gfx941` architectures. - -### **rocJPEG** (1.1.0) - -#### Added -* cmake config files. -* CTEST - New tests were introduced for JPEG batch decoding using various output formats, such as yuv_planar, y, rgb, and rgb_planar, both with and without region-of-interest (ROI). - -#### Changed -* Readme - cleanup and updates to pre-reqs. -* The `decode_params` argument of the `rocJpegDecodeBatched` API is now an array of `RocJpegDecodeParams` structs representing the decode parameters for the batch of JPEG images. -* `libdrm_amdgpu` is now explicitly linked with rocjpeg. - -#### Removed -* Dev Package - No longer installs pkg-config. - -#### Resolved issues -* Fixed a bug that prevented copying the decoded image into the output buffer when the output buffer is larger than the input image. -* Resolved an issue with resizing the internal memory pool by utilizing the explicit constructor of the vector's type during the resizing process. -* Addressed and resolved CMake configuration warnings. - -### **ROCm Bandwidth Test** (2.6.0) - -#### Added - -* Plugin architecture: - * `rocm_bandwidth_test` is now the `framework` for individual `plugins` and features. The `framework` is available at: `/opt/rocm/bin/` - - * Individual `plugins`: The `plugins` (shared libraries) are available at: `/opt/rocm/lib/rocm_bandwidth_test/plugins/` - -```{note} -Review the [README](https://github.com/ROCm/rocm_bandwidth_test/blob/amd-mainline/README.md) file for details about the new options and outputs. -``` - -#### Changed - -* The `CLI` and options/parameters have changed due to the new plugin architecture, where the plugin parameters are parsed by the plugin. - -#### Removed - -- The old CLI, parameters, and switches. - -### **ROCm Compute Profiler** (3.2.3) - -#### Added - -##### CDNA4 (AMD Instinct MI350/MI355) support - -* Support for AMD Instinct MI350 Series GPUs with the addition of the following counters: - * VALU co-issue (Two VALUs are issued instructions) efficiency - * Stream Processor Instruction (SPI) Wave Occupancy - * Scheduler-Pipe Wave Utilization - * Scheduler FIFO Full Rate - * CPC ADC Utilization - * F6F4 data type metrics - * Update formula for total FLOPs while taking into account F6F4 ops - * LDS STORE, LDS LOAD, LDS ATOMIC instruction count metrics - * LDS STORE, LDS LOAD, LDS ATOMIC bandwidth metrics - * LDS FIFO full rate - * Sequencer -> TA ADDR Stall rates - * Sequencer -> TA CMD Stall rates - * Sequencer -> TA DATA Stall rates - * L1 latencies - * L2 latencies - * L2 to EA stalls - * L2 to EA stalls per channel - -* Roofline support for AMD Instinct MI350 Series GPUs. - -##### Textual User Interface (TUI) (beta version) - -* Text User Interface (TUI) support for analyze mode - * A command line based user interface to support interactive single-run analysis. - * To launch, use `--tui` option in analyze mode. For example, ``rocprof-compute analyze --tui``. - -##### PC Sampling (beta version) - -* Stochastic (hardware-based) PC sampling has been enabled for AMD Instinct MI300X Series and later GPUs. - -* Host-trap PC Sampling has been enabled for AMD Instinct MI200 Series and later GPUs. - -* Support for sorting of PC sampling by type: offset or count. - -* PC Sampling Support on CLI and TUI analysis. - -##### Roofline - -* Support for Roofline plot on CLI (single run) analysis. - -* `FP4` and `FP6` data types have been added for roofline profiling on AMD Instinct MI350 Series. - -##### rocprofv3 support - -* ``rocprofv3`` is supported as the default backend for profiling. -* Support to obtain performance information for all channels for TCC counters. -* Support for profiling on AMD Instinct MI 100 using ``rocprofv3``. -* Deprecation warning for ``rocprofv3`` interface in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. - -##### Others - -* Docker files to package the application and dependencies into a single portable and executable standalone binary file. - -* Analysis report based filtering - * ``-b`` option in profile mode now also accepts metric id(s) for analysis report based filtering. - * ``-b`` option in profile mode also accepts hardware IP block for filtering; however, this filter support will be deprecated soon. - * ``--list-metrics`` option added in profile mode to list possible metric id(s), similar to analyze mode. - -* Support MEM chart on CLI (single run). - -* ``--specs-correction`` option to provide missing system specifications for analysis. - -#### Changed - -* Changed the default ``rocprof`` version to ``rocprofv3``. This is used when environment variable ``ROCPROF`` is not set. -* Changed ``normal_unit`` default to ``per_kernel``. -* Decreased profiling time by not collecting unused counters in post-analysis. -* Updated Dash to >=3.0.0 (for web UI). -* Changed the condition when Roofline PDFs are generated during general profiling and ``--roof-only`` profiling (skip only when ``--no-roof`` option is present). -* Updated Roofline binaries: - * Rebuild using latest ROCm stack. - * Minimum OS distribution support minimum for roofline feature is now Ubuntu 22.04, RHEL 8, and SLES15 SP6. - -#### Removed - -* Roofline support for Ubuntu 20.04 and SLES below 15.6. -* Removed support for AMD Instinct MI50 and MI60. - -#### Optimized - -* ROCm Compute Profiler CLI has been improved to better display the GPU architecture analytics. - -#### Resolved issues - -* Fixed kernel name and kernel dispatch filtering when using ``rocprofv3``. -* Fixed an issue of TCC channel counters collection in ``rocprofv3``. -* Fixed peak FLOPS of `F8`, `I8`, `F16`, and `BF16` on AMD Instinct MI300. -* Fixed not detecting memory clock issue when using ``amd-smi``. -* Fixed standalone GUI crashing. -* Fixed L2 read/write/atomic bandwidths on AMD Instinct MI350 Series. - -#### Known issues - -* On AMD Instinct MI100, accumulation counters are not collected, resulting in the following metrics failing to show up in the analysis: Instruction Fetch Latency, Wavefront Occupancy, LDS Latency. - * As a workaround, use the environment variable ``ROCPROF=rocprof``, to use ``rocprof v1`` for profiling on AMD Instinct MI100. - -* GPU id filtering is not supported when using ``rocprofv3``. - -* Analysis of previously collected workload data will not work due to sysinfo.csv schema change. - * As a workaround, re-run the profiling operation for the workload and interrupt the process after 10 seconds. - Followed by copying the ``sysinfo.csv`` file from the new data folder to the old one. - This assumes your system specification hasn't changed since the creation of the previous workload data. - -* Analysis of new workloads might require providing shader/memory clock speed using -``--specs-correction`` operation if amd-smi or rocminfo does not provide clock speeds. - -* Memory chart on ROCm Compute Profiler CLI might look corrupted if the CLI width is too narrow. - -* Roofline feature is currently not functional on Azure Linux 3.0 and Debian 12. - -#### Upcoming changes - -* ``rocprof v1/v2/v3`` interfaces will be removed in favor of the ROCprofiler-SDK interface, which directly accesses ``rocprofv3`` C++ tool. Using ``rocprof v1/v2/v3`` interfaces will trigger a deprecation warning. - * To use ROCprofiler-SDK interface, set environment variable `ROCPROF=rocprofiler-sdk` and optionally provide profile mode option ``--rocprofiler-sdk-library-path /path/to/librocprofiler-sdk.so``. Add ``--rocprofiler-sdk-library-path`` runtime option to choose the path to ROCprofiler-SDK library to be used. -* Hardware IP block based filtering using ``-b`` option in profile mode will be removed in favor of analysis report block based filtering using ``-b`` option in profile mode. -* MongoDB database support will be removed, and a deprecation warning has been added to the application interface. -* Usage of ``rocm-smi`` is deprecated in favor of ``amd-smi``, and a deprecation warning has been added to the application interface. - -### **ROCm Data Center Tool** (1.1.0) - -#### Added - -* More profiling and monitoring metrics, especially for AMD Instinct MI300 and newer GPUs. -* Advanced logging and debugging options, including new log levels and troubleshooting guidance. - -#### Changed - -* Completed migration from legacy [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) to [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/). -* Reorganized the configuration files internally and improved [README/installation](https://github.com/ROCm/rdc/blob/release/rocm-rel-7.0/README.md) instructions. -* Updated metrics and monitoring support for the latest AMD data center GPUs. - -#### Optimized - -- Integration with [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/) for performance metrics collection. -- Standalone and embedded operating modes, including streamlined authentication and configuration options. -- Support and documentation for diagnostic commands and GPU group management. -- [RVS](https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/latest/) test integration and reporting. - -### **ROCm SMI** (7.8.0) - -#### Added - -- Support for GPU metrics 1.8. - - Added new fields for `rsmi_gpu_metrics_t` including: - - Adding the following metrics to allow new calculations for violation status: - - Per XCP metrics `gfx_below_host_limit_ppt_acc[XCP][MAX_XCC]` - GFX Clock Host limit Package Power Tracking violation counts. - - Per XCP metrics `gfx_below_host_limit_thm_acc[XCP][MAX_XCC]` - GFX Clock Host limit Thermal (TVIOL) violation counts. - - Per XCP metrics `gfx_low_utilization_acc[XCP][MAX_XCC]` - violation counts for how did low utilization caused the GPU to be below application clocks. - - Per XCP metrics `gfx_below_host_limit_total_acc[XCP][MAX_XCC]`- violation counts for how long GPU was held below application clocks any limiter (see above new violation metrics). - - Increasing available JPEG engines to 40. - Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI. - -#### Removed - -- Removed backwards compatibility for `rsmi_dev_gpu_metrics_info_get()`'s `jpeg_activity` and `vcn_activity` fields. Alternatively use `xcp_stats.jpeg_busy` and `xcp_stats.vcn_busy`. - - Backwards compatibility is removed for `jpeg_activity` and `vcn_activity` fields, if the `jpeg_busy` or `vcn_busy` field is available. - - Providing both `vcn_activity`/`jpeg_activity` and XCP (partition) stats `vcn_busy`/`jpeg_busy` caused confusion for users about which field to use. By removing backward compatibility, it is easier to identify the relevant field. - - The `jpeg_busy` field increased in size (for supported ASICs), making backward compatibility unable to fully copy the structure into `jpeg_activity`. - -```{note} -See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-7.0/CHANGELOG.md) for details, examples, and in-depth descriptions. -``` - -### **ROCm Systems Profiler** (1.1.0) - -#### Added - -- Profiling and metric collection capabilities for VCN engine activity, JPEG engine activity, and API tracing for rocDecode, rocJPEG, and VA-APIs. -- How-to document for VCN and JPEG activity sampling and tracing. -- Support for tracing Fortran applications. -- Support for tracing MPI API in Fortran. - -#### Changed - -- Replaced ROCm SMI backend with AMD SMI backend for collecting GPU metrics. -- ROCprofiler-SDK is now used to trace RCCL API and collect communication counters. - - Use the setting `ROCPROFSYS_USE_RCCLP = ON` to enable profiling and tracing of RCCL application data. -- Updated the Dyninst submodule to v13.0. -- Set the default value of `ROCPROFSYS_SAMPLING_CPUS` to `none`. - -#### Resolved issues - -- Fixed GPU metric collection settings with `ROCPROFSYS_AMD_SMI_METRICS`. -- Fixed a build issue with CMake 4. -- Fixed incorrect kernel names shown for kernel dispatch tracks in Perfetto. -- Fixed formatting of some output logs. - -### **ROCm Validation Suite** (1.2.0) - -#### Added - -- Support for AMD Instinct MI350X and MI355X GPUs. -- Introduced rotating buffer mechanism for GEMM operations. -- Support for read and write tests in Babel. -- Support for AMD Radeon RX9070 and RX9070GRE graphics cards. - -#### Changed - -- Migrated SMI API usage from `rocm-smi` to `amd-smi`. -- Updated `FP8` GEMM operations to use hipBLASLt instead of rocBLAS. - -### **rocPRIM** (4.0.0) - -#### Added - -* Support for gfx950. -* `rocprim::accumulator_t` to ensure parity with CCCL. -* Test for `rocprim::accumulator_t`. -* `rocprim::invoke_result_r` to ensure parity with CCCL. -* Function `is_build_in` into `rocprim::traits::get`. -* Virtual shared memory as a fallback option in `rocprim::device_merge` when it exceeds shared memory capacity, similar to `rocprim::device_select`, `rocprim::device_partition`, and `rocprim::device_merge_sort`, which already include this feature. -* Initial value support to device level inclusive scans. -* New optimization to the backend for `device_transform` when the input and output are pointers. -* `LoadType` to `transform_config`, which is used for the `device_transform` when the input and output are pointers. -* `rocprim:device_transform` for n-ary transform operations API with as input `n` number of iterators inside a `rocprim::tuple`. -* `rocprim::key_value_pair::operator==`. -* The `rocprim::unrolled_copy` thread function to copy multiple items inside a thread. -* The `rocprim::unrolled_thread_load` function to load multiple items inside a thread using `rocprim::thread_load`. -* `rocprim::int128_t` and `rocprim::uint128_t` to benchmarks for improved performance evaluation on 128-bit integers. -* `rocprim::int128_t` to the supported autotuning types to improve performance for 128-bit integers. -* The `rocprim::merge_inplace` function for merging in-place. -* Initial value support for warp- and block-level inclusive scan. -* Support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the `WITH_ROCRAND=ON` build flag. -* Tests and documentation to `lookback_scan_state`. It is still in the `detail` namespace. - -#### Changed - -* Changed the parameters `long_radix_bits` and `LongRadixBits` from `segmented_radix_sort` to `radix_bits` and `RadixBits`, respectively. -* Marked the initialisation constructor of `rocprim::reverse_iterator` `explicit`, use `rocprim::make_reverse_iterator`. -* Merged `radix_key_codec` into type_traits system. -* Renamed `type_traits_interface.hpp` to `type_traits.hpp`, rename the original `type_traits.hpp` to `type_traits_functions.hpp`. -* The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change. -The previous default accumulator types could lead to situations in which unexpected overflow occurred, such as when the input or initial type was smaller than the output type. This is a complete list of affected functions and how their default accumulator types are changing: - - * `rocprim::inclusive_scan` - * Previous default: `class AccType = typename std::iterator_traits::value_type>` - * Current default: `class AccType = rocprim::accumulator_t::value_type>` - * `rocprim::deterministic_inclusive_scan` - * Previous default: `class AccType = typename std::iterator_traits::value_type>` - * Current default: `class AccType = rocprim::accumulator_t::value_type>` - * `rocprim::exclusive_scan` - * Previous default: `class AccType = detail::input_type_t>` - * Current default: `class AccType = rocprim::accumulator_t>` - * `rocprim::deterministic_exclusive_scan` - * Previous default: `class AccType = detail::input_type_t>` - * Current default: `class AccType = rocprim::accumulator_t>` -* Undeprecated internal `detail::raw_storage`. -* A new version of `rocprim::thread_load` and `rocprim::thread_store` replaces the deprecated `rocprim::thread_load` and `rocprim::thread_store` functions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. -* Renamed `rocprim::load_cs` to `rocprim::load_nontemporal` and `rocprim::store_cs` to `rocprim::store_nontemporal` to express the intent of these load and store methods better. -* All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example, `rocprim::ROCPRIM_300400_NS::symbol` instead of `rocPRIM::symbol`, letting the user link multiple libraries built with different versions of rocPRIM. - -#### Removed - -* `rocprim::detail::float_bit_mask` and relative tests, use `rocprim::traits::float_bit_mask` instead. -* `rocprim::traits::is_fundamental`, use `rocprim::traits::get::is_fundamental()` directly. -* The deprecated parameters `short_radix_bits` and `ShortRadixBits` from the `segmented_radix_sort` config. They were unused, it is only an API change. -* The deprecated `operator<<` from the iterators. -* The deprecated `TwiddleIn` and `TwiddleOut`. Use `radix_key_codec` instead. -* The deprecated flags API of `block_adjacent_difference`. Use `subtract_left()` or `block_discontinuity::flag_heads()` instead. -* The deprecated `to_exclusive` functions in the warp scans. -* The `rocprim::load_cs` from the `cache_load_modifier` enum. Use `rocprim::load_nontemporal` instead. -* The `rocprim::store_cs` from the `cache_store_modifier` enum. Use `rocprim::store_nontemporal` instead. -* The deprecated header file `rocprim/detail/match_result_type.hpp`. Include `rocprim/type_traits.hpp` instead. This header included: - * `rocprim::detail::invoke_result`. Use `rocprim::invoke_result` instead. - * `rocprim::detail::invoke_result_binary_op`. Use `rocprim::invoke_result_binary_op` instead. - * `rocprim::detail::match_result_type`. Use `rocprim::invoke_result_binary_op_t` instead. -* The deprecated `rocprim::detail::radix_key_codec` function. Use `rocprim::radix_key_codec` instead. -* Removed `rocprim/detail/radix_sort.hpp`, functionality can now be found in `rocprim/thread/radix_key_codec.hpp`. -* Removed C++14 support. Only C++17 is supported. -* Due to the removal of `__AMDGCN_WAVEFRONT_SIZE` in the compiler, the following deprecated warp size-related symbols have been removed: - * `rocprim::device_warp_size()` - * For compile-time constants, this is replaced with `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()`. Use this when allocating global or shared memory. - * For run-time constants, this is replaced with `rocprim::arch::wavefront::size().` - * `rocprim::warp_size()` - * Use `rocprim::host_warp_size()`, `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. - * `ROCPRIM_WAVEFRONT_SIZE` - * Use `rocprim::arch::wavefront::min_size()` or `rocprim::arch::wavefront::max_size()` instead. - * `__AMDGCN_WAVEFRONT_SIZE` - * This was a fallback define for the compiler's removed symbol, having the same name. -* This release removes support for custom builds on gfx940 and gfx941. - -#### Optimized - -* Improved performance of `rocprim::device_select` and `rocprim::device_partition` when using multiple streams on the AMD Instinct MI300 Series. - -#### Resolved issues - -* Fixed an issue where `device_batch_memcpy` reported benchmarking throughput being 2x lower than it was in reality. -* Fixed an issue where `device_segmented_reduce` reported autotuning throughput being 5x lower than it was in reality. -* Fixed device radix sort not returning the correct required temporary storage when a double buffer contains `nullptr`. -* Fixed constness of equality operators (`==` and `!=`) in `rocprim::key_value_pair`. -* Fixed an issue for the comparison operators in `arg_index_iterator` and `texture_cache_iterator`, where `<` and `>` comparators were swapped. -* Fixed an issue for the `rocprim::thread_reduce` not working correctly with a prefix value. - -#### Known issues - -* When using `rocprim::deterministic_inclusive_scan_by_key` and `rocprim::deterministic_exclusive_scan_by_key` the intermediate values can change order on Navi3x. However, if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs. - -#### Upcoming changes - -* `rocprim::invoke_result_binary_op` and `rocprim::invoke_result_binary_op_t` are deprecated. Use `rocprim::accumulator_t` instead. - -### **ROCprofiler-SDK** (1.0.0) - -#### Added - -- Support for [rocJPEG](https://rocm.docs.amd.com/projects/rocJPEG/en/latest/index.html) API Tracing. -- Support for AMD Instinct MI350X and MI355X GPUs. -- `rocprofiler_create_counter` to facilitate adding custom derived counters at runtime. -- Support in `rocprofv3` for iteration based counter multiplexing. -- Perfetto support for counter collection. -- Support for negating `rocprofv3` tracing options when using aggregate options such as `--sys-trace --hsa-trace=no`. -- `--agent-index` option in `rocprofv3` to specify the agent naming convention in the output: - - absolute == node_id - - relative == logical_node_id - - type-relative == logical_node_type_id -- MI300 and MI350 stochastic (hardware-based) PC sampling support in ROCProfiler-SDK and `rocprofv3`. -- Python bindings for `rocprofiler-sdk-roctx`. -- SQLite3 output support for `rocprofv3` using `--output-format rocpd`. -- `rocprofiler-sdk-rocpd` package: - - Public API in `include/rocprofiler-sdk-rocpd/rocpd.h`. - - Library implementation in `librocprofiler-sdk-rocpd.so`. - - Support for `find_package(rocprofiler-sdk-rocpd)`. - - `rocprofiler-sdk-rocpd` DEB and RPM packages. -- `--version` option in `rocprofv3`. -- `rocpd` Python package. -- Thread trace as experimental API. -- ROCprof Trace Decoder as experimental API: - - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). -- Thread trace option in the `rocprofv3` tool under the `--att` parameters: - - See [using thread trace with rocprofv3](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-7.0.0/how-to/using-thread-trace.html) - - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). -- `rocpd` output format documentation: - - Requires [ROCprof Trace Decoder plugin](https://github.com/rocm/rocprof-trace-decoder). -- Perfetto support for scratch memory. -- Support in the `rocprofv3` avail tool for command-line arguments. -- Documentation for `rocprofv3` advanced options. -- AQLprofile is now available as open source. - -#### Changed - -- SDK to NOT to create a background thread when every tool returns a nullptr from `rocprofiler_configure`. -- `vaddr-to-file-offset` mapping in `disassembly.hpp` to use the dedicated comgr API. -- `rocprofiler_uuid_t` ABI to hold 128 bit value. -- `rocprofv3` shorthand argument for `--collection-period` to `-P` (upper-case) while `-p` (lower-case) is reserved for later use. -- Default output format for `rocprofv3` to `rocpd` (SQLite3 database). -- `rocprofv3` avail tool to be renamed from `rocprofv3_avail` to `rocprofv3-avail` tool. -- `rocprofv3` tool to facilitate thread trace and PC sampling on the same agent. - -##### Removed - -* Support for compilation of gfx940 and gfx941 targets. - -#### Resolved issues - -- Fixed missing callbacks around internal thread creation within counter collection service. -- Fixed potential data race in the ROCprofiler-SDK double buffering scheme. -- Fixed usage of std::regex in the core ROCprofiler-SDK library that caused segfaults or exceptions when used under dual ABI. -- Fixed Perfetto counter collection by introducing accumulation per dispatch. -- Fixed code object disassembly for missing function inlining information. -- Fixed queue preemption error and `HSA_STATUS_ERROR_INVALID_PACKET_FORMAT` error for stochastic PC-sampling in MI300X, leading to stabler runs. -- Fixed the system hang issue for host-trap PC-sampling on AMD Instinct MI300X. -- Fixed `rocpd` counter collection issue when counter collection alone is enabled. `rocpd_kernel_dispatch` table is updated to be populated by counters data instead of kernel_dispatch data. -- Fixed `rocprofiler_*_id_t` structs for inconsistency related to a "null" handle: - - The correct definition for a null handle is `.handle = 0` while some definitions previously used `UINT64_MAX`. -- Fixed kernel trace csv output generated by `rocpd`. - -### **rocPyDecode** (0.6.0) - -#### Added - -* ``rocpyjpegdecode`` package. -* ``src/rocjpeg`` source new subfolder. -* ``samples/rocjpeg`` new subfolder. - -#### Changed -* Minimum version for rocdecode and rocjpeg updated to V1.0.0. - -### **rocRAND** (4.0.0) - -#### Added - -* Support for gfx950. -* Additional unit tests for `test_log_normal_distribution.cpp`, `test_normal_distribution.cpp`, `test_rocrand_mtgp32_prng.cpp`, `test_rocrand_scrambled_sobol32_qrng.cpp`, `test_rocrand_scrambled_sobol64_qrng.cpp`, `test_rocrand_sobol32_qrng.cpp`, `test_rocrand_sobol64_qrng.cpp`, `test_rocrand_threefry2x32_20_prng.cpp`, `test_rocrand_threefry2x64_20_prng.cpp`, `test_rocrand_threefry4x32_20_prng.cpp`, `test_rocrand_threefry4x64_20_prng.cpp`, and `test_uniform_distribution.cpp`. -* New unit tests for `include/rocrand/rocrand_discrete.h` in `test_rocrand_discrete.cpp`, `include/rocrand/rocrand_mrg31k3p.h` in `test_rocrand_mrg31k3p_prng.cpp`, `include/rocrand/rocrand_mrg32k3a.h` in `test_rocrand_mrg32k3a_prng.cpp`, and `include/rocrand/rocrand_poisson.h` in `test_rocrand_poisson.cpp`. - -#### Changed - -* Changed the return type for `rocrand_generate_poisson` for the `SOBOL64` and `SCRAMBLED_SOBOL64` engines. -* Changed the unnecessarily large 64-bit data type for constants used for skipping in `MRG32K3A` to the 32-bit data type. -* Updated several `gfx942` auto tuning parameters. -* Modified error handling and expanded the error information for the case of double-deallocation of the (scrambled) sobol32 and sobol64 constants and direction vectors. - -#### Removed - -* Removed inline assembly and the `ENABLE_INLINE_ASM` CMake option. Inline assembly was used to optimize multiplication in the Mrg32k3a and Philox 4x32-10 generators. It is no longer needed because the current HIP compiler is able to produce code with the same or better performance. -* Removed instances of the deprecated clang definition `__AMDGCN_WAVEFRONT_SIZE`. -* Removed C++14 support. Beginning with this release, only C++17 is supported. -* Directly accessing the (scrambled) sobol32 and sobol64 constants and direction vectors is no longer supported. For: - * `h_scrambled_sobol32_constants`, use `rocrand_get_scramble_constants32` instead. - * `h_scrambled_sobol64_constants`, use `rocrand_get_scramble_constants64` instead. - * `rocrand_h_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. - * `rocrand_h_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. - * `rocrand_h_scrambled_sobol32_direction_vectors`, use `rocrand_get_direction_vectors32` instead. - * `rocrand_h_scrambled_sobol64_direction_vectors`, use `rocrand_get_direction_vectors64` instead. - -#### Resolved issues - -* Fixed an issue where `mt19937.hpp` would cause kernel errors during auto tuning. - -#### Upcoming changes - -* Deprecated the rocRAND Fortran API in favor of hipfort. - -### **ROCr Debug Agent** (2.1.0) - -#### Added - -* The `-e` and `--precise-alu-exceptions` flags to enable precise ALU exceptions reporting on supported configurations. - -### **ROCr Runtime** (1.18.0) - -#### Added - -* New API `hsa_amd_memory_get_preferred_copy_engine` to get preferred copy engine that can be used to when calling `hsa_amd_memory_async_copy_on_engine`. -* New API `hsa_amd_portable_export_dmabuf_v2` extension of existing `hsa_amd_portable_export_dmabuf` API to support new flags parameter. This allows specifying the new `HSA_AMD_DMABUF_MAPPING_TYPE_PCIE` flag when exporting dma-bufs. -* New flag `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` adds support for new `HSA_AMD_VMEM_ADDRESS_NO_REGISTER` when calling `hsa_amd_vmem_address_reserve` API. This allows virtual address range reservations for SVM allocations to be tracked when running in ASAN mode. -* New sub query `HSA_AMD_AGENT_INFO_CLOCK_COUNTERS` returns a snapshot of the underlying driver's clock counters that can be used for profiling. - -### **rocSHMEM** (3.0.0) - -#### Added - -* Reverse Offload conduit. -* New APIs: `rocshmem_ctx_barrier`, `rocshmem_ctx_barrier_wave`, `rocshmem_ctx_barrier_wg`, `rocshmem_barrier_all`, `rocshmem_barrier_all_wave`, `rocshmem_barrier_all_wg`, `rocshmem_ctx_sync`, `rocshmem_ctx_sync_wave`, `rocshmem_ctx_sync_wg`, `rocshmem_sync_all`, `rocshmem_sync_all_wave`, `rocshmem_sync_all_wg`, `rocshmem_init_attr`, `rocshmem_get_uniqueid`, and `rocshmem_set_attr_uniqueid_args`. -* `dlmalloc` based allocator. -* XNACK support. -* Support for initialization with MPI communicators other than `MPI_COMM_WORLD`. - -#### Changed - -* Changed collective APIs to use `_wg` suffix rather than `_wg_` infix. - -#### Resolved issues - -* Resolved segfault in `rocshmem_wg_ctx_create`, now provides `nullptr` if `ctx` cannot be created. - -### **rocSOLVER** (3.30.0) - -#### Added - -* Hybrid computation support for existing routines: STEQR - -#### Optimized - -* Improved the performance of BDSQR and downstream functions, such as GESVD. -* Improved the performance of STEQR and downstream functions, such as SYEV/HEEV. -* Improved the performance of LARFT and downstream functions, such as GEQR2 and GEQRF. - -#### Resolved issues - -* Fixed corner cases that can produce NaNs in SYEVD for valid input matrices. - -### **rocSPARSE** (4.0.2) - -#### Added - -* The `SpGEAM` generic routine for computing sparse matrix addition in CSR format. -* The `v2_SpMV` generic routine for computing sparse matrix vector multiplication. As opposed to the deprecated `rocsparse_spmv` routine, this routine does not use a fallback algorithm if a non-implemented configuration is encountered and will return an error in such a case. For the deprecated `rocsparse_spmv` routine, the user can enable warning messages in situations where a fallback algorithm is used by either calling the `rocsparse_enable_debug` routine upfront or exporting the variable `ROCSPARSE_DEBUG` (with the shell command `export ROCSPARSE_DEBUG=1`). -* Half float mixed precision to `rocsparse_axpby` where X and Y use `float16` and the result and compute type use `float`. -* Half float mixed precision to `rocsparse_spvv` where X and Y use `float16` and the result and compute type use `float`. -* Half float mixed precision to `rocsparse_spmv` where A and X use `float16` and Y and the compute type use `float`. -* Half float mixed precision to `rocsparse_spmm` where A and B use `float16` and C and the compute type use `float`. -* Half float mixed precision to `rocsparse_sddmm` where A and B use `float16` and C and the compute type use `float`. -* Half float uniform precision to the `rocsparse_scatter` and `rocsparse_gather` routines. -* Half float uniform precision to the `rocsparse_sddmm` routine. -* The `rocsparse_spmv_alg_csr_rowsplit` algorithm. -* Support for gfx950. -* ROC-TX instrumentation support in rocSPARSE (not available on Windows or in the static library version on Linux). -* The `almalinux` operating system name to correct the GFortran dependency. - -#### Changed - -* Switch to defaulting to C++17 when building rocSPARSE from source. Previously rocSPARSE was using C++14 by default. - -#### Removed - -* The deprecated `rocsparse_spmv_ex` routine. -* The deprecated `rocsparse_sbsrmv_ex`, `rocsparse_dbsrmv_ex`, `rocsparse_cbsrmv_ex`, and `rocsparse_zbsrmv_ex` routines. -* The deprecated `rocsparse_sbsrmv_ex_analysis`, `rocsparse_dbsrmv_ex_analysis`, `rocsparse_cbsrmv_ex_analysis`, and `rocsparse_zbsrmv_ex_analysis` routines. - -#### Optimized - -* Reduced the number of template instantiations in the library to further reduce the shared library binary size and improve compile times. -* Allow SpGEMM routines to use more shared memory when available. This can speed up performance for matrices with a large number of intermediate products. -* Use of the `rocsparse_spmv_alg_csr_adaptive` or `rocsparse_spmv_alg_csr_default` algorithms in `rocsparse_spmv` to perform transposed sparse matrix multiplication (`C=alpha*A^T*x+beta*y`) resulted in unnecessary analysis on A and needless slowdown during the analysis phase. This has been improved by skipping the analysis when performing the transposed sparse matrix multiplication. -* Improved the user documentation. - -#### Resolved issues - -* Fixed an issue in the public headers where `extern "C"` was not wrapped by `#ifdef __cplusplus`, which caused failures when building C programs with rocSPARSE. -* Fixed a memory access fault in the `rocsparse_Xbsrilu0` routines. -* Fixed failures that could occur in `rocsparse_Xbsrsm_solve` or `rocsparse_spsm` with BSR format when using host pointer mode. -* Fixed ASAN compilation failures. -* Fixed a failure that occurred when using const descriptor `rocsparse_create_const_csr_descr` with the generic routine `rocsparse_sparse_to_sparse`. The issue was not observed when using non-const descriptor `rocsparse_create_csr_descr` with `rocsparse_sparse_to_sparse`. -* Fixed a memory leak in the rocSPARSE handle. - -#### Upcoming changes - -* Deprecated the `rocsparse_spmv` routine. Use the `rocsparse_v2_spmv` routine instead. -* Deprecated the `rocsparse_spmv_alg_csr_stream` algorithm. Use the `rocsparse_spmv_alg_csr_rowsplit` algorithm instead. -* Deprecated the `rocsparse_itilu0_alg_sync_split_fusion` algorithm. Use one of `rocsparse_itilu0_alg_async_inplace`, `rocsparse_itilu0_alg_async_split`, or `rocsparse_itilu0_alg_sync_split` instead. - -### **rocThrust** (4.0.0) - -#### Added - -* Additional unit tests for: binary_search, complex, c99math, catrig, ccosh, cexp, clog, csin, csqrt, and ctan. -* `test_param_fixtures.hpp` to store all the parameters for typed test suites. -* `test_real_assertions.hpp` to handle unit test assertions for real numbers. -* `test_imag_assertions.hpp` to handle unit test assertions for imaginary numbers. -* `clang++` is now used to compile google benchmarks on Windows. -* Support for gfx950. -* Merged changes from upstream CCCL/thrust 2.6.0. - -#### Changed - -* Updated the required version of Google Benchmark from 1.8.0 to 1.9.0. -* Renamed `cpp14_required.h` to `cpp_version_check.h`. -* Refactored `test_header.hpp` into `test_param_fixtures.hpp`, `test_real_assertions.hpp`, `test_imag_assertions.hpp`, and `test_utils.hpp`. This is done to prevent unit tests from having access to modules that they're not testing. This will improve the accuracy of code coverage reports. - -#### Removed - -* `device_malloc_allocator.h` has been removed. This header file was unused and should not impact users. -* Removed C++14 support. Only C++17 is now supported. -* `test_header.hpp` has been removed. The `HIP_CHECK` function, as well as the `test` and `inter_run_bwr` namespaces, have been moved to `test_utils.hpp`. -* `test_assertions.hpp` has been split into `test_real_assertions.hpp` and `test_imag_assertions.hpp`. - -#### Resolved issues - -* Fixed an issue with internal calls to unqualified `distance()` which would be ambiguous due to the visible implementation through ADL. - -#### Known issues - -* The order of the values being compared by `thrust::exclusive_scan_by_key` and `thrust::inclusive_scan_by_key` can change between runs when integers are being compared. This can cause incorrect output when a non-commutative operator such as division is being used. - -#### Upcoming changes - -* `thrust::device_malloc_allocator` is deprecated as of this version. It will be removed in an upcoming version. - -### **rocWMMA** (2.0.0) - -#### Added - -* Internal register layout transforms to support interleaved MMA layouts. -* Support for the gfx950 target. -* Mixed input `BF8`/`FP8` types for MMA support. -* Fragment scheduler API objects to embed thread block cooperation properties in fragments. - -#### Changed - -* Augmented load/store/MMA internals with static loop unrolling. -* Updated linkage of `rocwmma::synchronize_workgroup` to inline. -* rocWMMA `mma_sync` API now supports `wave tile` fragment sizes. -* rocWMMA cooperative fragments are now expressed with fragment scheduler template arguments. -* rocWMMA cooperative fragments now use the same base API as non-cooperative fragments. -* rocWMMA cooperative fragments register usage footprint has been reduced. -* rocWMMA fragments now support partial tile sizes with padding. - -#### Removed - -* Support for the gfx940 and gfx941 targets. -* The rocWMMA cooperative API. -* Wave count template parameters from transforms APIs. - -#### Optimized - -* Added internal flow control barriers to improve assembly code generation and overall performance. -* Enabled interleaved layouts by default in MMA to improve overall performance. - -#### Resolved issues - -* Fixed a validation issue for small precision compute types `< B32` on gfx9. -* Fixed CMake validation of compiler support for `BF8`/`FP8` types. - -### **RPP** (2.0.0) - -#### Added - -* Bitwise NOT, Bitwise AND, and Bitwise OR augmentations on HOST (CPU) and HIP backends. -* Tensor Concat augmentation on HOST (CPU) and HIP backends. -* JPEG Compression Distortion augmentation on HIP backend. -* `log1p`, defined as `log (1 + x)`, tensor augmentation support on HOST (CPU) and HIP backends. -* JPEG Compression Distortion augmentation on HOST (CPU) backend. - -#### Changed - -* Handle creation and destruction APIs have been consolidated. Use `rppCreate()` for handle initialization and `rppDestroy()` for handle destruction. -* The `logical_operations` function category has been renamed to `bitwise_operations`. -* TurboJPEG package installation enabled for RPP Test Suite with `sudo apt-get install libturbojpeg0-dev`. Instructions have been updated in utilities/test_suite/README.md. -* The `swap_channels` augmentation has been changed to `channel_permute`. `channel_permute` now also accepts a new argument, `permutationTensor` (pointer to an unsigned int tensor), that provides the permutation order to swap the RGB channels of each input image in the batch in any order: - - `RppStatus rppt_swap_channels_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, rppHandle_t rppHandle);` - - changed to: - - `RppStatus rppt_channel_permute_host(RppPtr_t srcPtr, RpptDescPtr srcDescPtr, RppPtr_t dstPtr, RpptDescPtr dstDescPtr, Rpp32u *permutationTensor , rppHandle_t rppHandle);` - -#### Removed - -* Older versions of RPP handle creation inlcuding `rppCreateWithBatchSize()`, `rppCreateWithStream()`, and `rppCreateWithStreamAndBatchSize()`. These have been replaced with `rppCreate()`. -* Older versions of RPP handle destruction API including `rppDestroyGPU()` and `rppDestroyHost()`. These have been replaced with `rppDestroy()`. - -#### Resolved issues - -* Test package - Debian packages will install required dependencies. - -### **Tensile** (4.44.0) - -#### Added - -- Support for gfx950. -- Code object compression via bundling. -- Support for non-default HIP SDK installations on Windows. -- Master solution library documentation. -- Compiler version-dependent assembler and architecture capabilities. -- Documentation from GitHub Wiki to ROCm docs. - -#### Changed - -- Loosened check for CLI compiler choices. -- Introduced 4-tuple targets for bundler invocations. -- Introduced PATHEXT extensions on Windows when searching for toolchain components. -- Enabled passing fully qualified paths to toolchain components. -- Enabled environment variable overrides when searching for a ROCm stack. -- Improved default toolchain configuration. -- Ignored f824 flake errors. - -#### Removed - -- Support for the gfx940 and gfx941 targets. -- Unused tuning files. -- Disabled tests. - -#### Resolved issues - -- Fixed configure time path not being invoked at build. -- Fixed find_package for msgpack to work with versions 5 and 6. -- Fixed RHEL 9 testing. -- Fixed gfx908 builds. -- Fixed the 'argument list too long' error. -- Fixed version typo in 6.3 changelog. -- Fixed improper use of aliases as nested namespace specifiers. - -## ROCm 6.4.3 - -See the [ROCm 6.4.3 release notes](https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html) -for a complete overview of this release. - -### **ROCm SMI** (7.7.0) - -#### Added - -- Support for getting the GPU Board voltage. - -```{note} -See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. -``` - -## ROCm 6.4.2 - -See the [ROCm 6.4.2 release notes](https://rocm.docs.amd.com/en/docs-6.4.2/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (25.5.1) - -#### Added - -- Compute Unit Occupancy information per process. - -- Support for getting the GPU Board voltage. - -- New firmware PLDM_BUNDLE. `amd-smi firmware` can now show the PLDM Bundle on supported systems. - -- `amd-smi ras --afid --cper-file ` to decode CPER records. - -#### Changed - -- Padded `asic_serial` in `amdsmi_get_asic_info` with 0s. - -- Renamed field `COMPUTE_PARTITION` to `ACCELERATOR_PARTITION` in CLI call `amd-smi --partition`. - -#### Resolved issues - -- Corrected VRAM memory calculation in `amdsmi_get_gpu_process_list`. Previously, the VRAM memory usage reported by `amdsmi_get_gpu_process_list` was inaccurate and was calculated using KB instead of KiB. - -```{note} -See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. -``` - -### **HIP** (6.4.2) - -#### Added - -* HIP API implementation for `hipEventRecordWithFlags`, records an event in the specified stream with flags. -* Support for the pointer attribute `HIP_POINTER_ATTRIBUTE_CONTEXT`. -* Support for the flags `hipEventWaitDefault` and `hipEventWaitExternal`. - -#### Optimized - -* Improved implementation in `hipEventSynchronize`, HIP runtime now makes internal callbacks as non-blocking operations to improve performance. - -#### Resolved issues - -* Issue of dependency on `libgcc-s1` during rocm-dev install on Debian Buster. HIP runtime removed this Debian package dependency, and uses `libgcc1` instead for this distros. -* Building issue for `COMGR` dynamic load on Fedora and other Distros. HIP runtime now doesn't link against `libamd_comgr.so`. -* Failure in the API `hipStreamDestroy`, when stream type is `hipStreamLegacy`. The API now returns error code `hipErrorInvalidResourceHandle` on this condition. -* Kernel launch errors, such as `shared object initialization failed`, `invalid device function` or `kernel execution failure`. HIP runtime now loads `COMGR` properly considering the file with its name and mapped image. -* Memory access fault in some applications. HIP runtime fixed offset accumulation in memory address. -* The memory leak in virtual memory management (VMM). HIP runtime now uses the size of handle for allocated memory range instead of actual size for physical memory, which fixed the issue of address clash with VMM. -* Large memory allocation issue. HIP runtime now checks GPU video RAM and system RAM properly and sets size limits during memory allocation either on the host or the GPU device. -* Support of `hipDeviceMallocContiguous` flags in `hipExtMallocWithFlags()`. It now enables `HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG` in the memory pool allocation on GPU device. -* Radom memory segmentation fault in handling `GraphExec` object release and `hipDeviceSyncronization`. HIP runtime now uses internal device synchronize function in `__hipUnregisterFatBinary`. - -### **hipBLASLt** (0.12.1) - -#### Added - -* Support for gfx1151 on Linux, complementing the previous support in the HIP SDK for Windows. - -### **RCCL** (2.22.3) - -#### Added - -* Added support for the LL128 protocol on gfx942. - -### **rocBLAS** (4.4.1) - -#### Resolved issues - -* rocBLAS might have failed to produce correct results for cherk/zherk on gfx90a/gfx942 with problem sizes k > 500 due to the imaginary portion on the C matrix diagonal not being zeros. rocBLAS now zeros the imaginary portion. - -### **ROCm Compute Profiler** (3.1.1) - -#### Added - -* 8-bit floating point (FP8) metrics support for AMD Instinct MI300 GPUs. -* Additional data types for roofline: FP8, FP16, BF16, FP32, FP64, I8, I32, I64 (dependent on the GPU architecture). -* Data type selection option ``--roofline-data-type / -R`` for roofline profiling. The default data type is FP32. - -#### Changed - -* Changed dependency from `rocm-smi` to `amd-smi`. - -#### Resolved issues - -* Fixed a crash related to Agent ID caused by the new format of the `rocprofv3` output CSV file. - -### **ROCm Systems Profiler** (1.0.2) - -#### Optimized - -* Improved readability of the OpenMP target offload traces by showing on a single Perfetto track. - -#### Resolved issues - -* Fixed the file path to the script that merges Perfetto files from multi-process MPI runs. The script has also been renamed from `merge-multiprocess-output.sh` to `rocprof-sys-merge-output.sh`. - -### **ROCm Validation Suite** (1.1.0) - -#### Added - -* NPS2/DPX and NPS4/CPX partition modes support for AMD Instinct MI300X. - -### **rocPRIM** (3.4.1) - -#### Upcoming changes - -* Changes to the template parameters of warp and block algorithms will be made in an upcoming release. -* Due to an upcoming compiler change, the following symbols related to warp size have been marked as deprecated and will be removed in an upcoming major release: - * `rocprim::device_warp_size()`. This has been replaced by `rocprim::arch::wavefront::min_size()` and `rocprim::arch::wavefront::max_size()` for compile-time constants. Use these when allocating global or shared memory. For run-time constants, use `rocprim::arch::wavefront::size()`. - * `rocprim::warp_size()` - * `ROCPRIM_WAVEFRONT_SIZE` - -* The default scan accumulator types for device-level scan algorithms will be changed in an upcoming release, resulting in a breaking change. Previously, the default accumulator type was set to the input type for the inclusive scans and to the initial value type for the exclusive scans. This could lead to unexpected overflow if the input or initial type was smaller than the output type when the accumulator type wasn't explicitly set using the `AccType` template parameter. The new default accumulator types will be set to the type that results when the input or initial value type is applied to the scan operator. - - The following is the complete list of affected functions and how their default accumulator types are changing: - - * `rocprim::inclusive_scan` - * current default: `class AccType = typename std::iterator_traits::value_type>` - * future default: `class AccType = rocprim::invoke_result_binary_op_t::value_type, BinaryFunction>` - * `rocprim::deterministic_inclusive_scan` - * current default: `class AccType = typename std::iterator_traits::value_type>` - * future default: `class AccType = rocprim::invoke_result_binary_op_t::value_type, BinaryFunction>` - * `rocprim::exclusive_scan` - * current default: `class AccType = detail::input_type_t>` - * future default: `class AccType = rocprim::invoke_result_binary_op_t, BinaryFunction>` - * `rocprim::deterministic_exclusive_scan` - * current default: `class AccType = detail::input_type_t>` - * future default: `class AccType = rocprim::invoke_result_binary_op_t, BinaryFunction>` - -* `rocprim::load_cs` and `rocprim::store_cs` are deprecated and will be removed in an upcoming release. Alternatively, you can use `rocprim::load_nontemporal` and `rocprim::store_nontemporal` to load and store values in specific conditions (like bypassing the cache) for `rocprim::thread_load` and `rocprim::thread_store`. - -### **rocSHMEM** (2.0.1) - -#### Resolved issues - -* Incorrect output for `rocshmem_ctx_my_pe` and `rocshmem_ctx_n_pes`. -* Multi-team errors by providing team specific buffers in `rocshmem_ctx_wg_team_sync`. -* Missing implementation of `rocshmem_g` for IPC conduit. - -### **rocSOLVER** (3.28.2) - -#### Added - -* Hybrid computation support for existing routines, such as STERF. -* SVD for general matrices based on Cuppen's Divide and Conquer algorithm: - - GESDD (with batched and strided\_batched versions) - -#### Optimized - -* Reduced the device memory requirements for STEDC, SYEVD/HEEVD, and SYGVD/HEGVD. -* Improved the performance of STEDC and divide and conquer Eigensolvers. -* Improved the performance of SYTRD, the initial step of the Eigensolvers that start with the tridiagonalization of the input matrix. - -## ROCm 6.4.1 - -See the [ROCm 6.4.1 release notes](https://rocm.docs.amd.com/en/docs-6.4.1/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (25.4.2) - -#### Added - -* Dumping CPER entries from RAS tool `amdsmi_get_gpu_cper_entries()` to Python and C APIs. - - Dumping CPER entries consist of `amdsmi_cper_hdr_t`. - - Dumping CPER entries is also enabled in the CLI interface through `sudo amd-smi ras --cper`. -* `amdsmi_get_gpu_busy_percent` to the C API. - -#### Changed - -* Modified VRAM display for `amd-smi monitor -v`. - -#### Optimized - -* Improved load times for CLI commands when the GPU has multiple partitions. - -#### Resolved issues - -* Fixed partition enumeration in `amd-smi list -e`, `amdsmi_get_gpu_enumeration_info()`, `amdsmi_enumeration_info_t`, `drm_card`, and `drm_render` fields. - -#### Known issues - -* When using the `--follow` flag with `amd-smi ras --cper`, CPER entries are not streamed continuously as intended. This will be fixed in an upcoming ROCm release. - -> [!NOTE] -> See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. - -### **HIP** (6.4.1) - -#### Added - -* New log mask enumeration `LOG_COMGR` enables logging precise code object information. - -#### Changed - -* HIP runtime uses device bitcode before SPIRV. -* The implementation of preventing `hipLaunchKernel` latency degradation with number of idle streams is reverted or disabled by default. - -#### Optimized - -* Improved kernel logging includes de-mangling shader names. -* Refined implementation in HIP APIs `hipEventRecords` and `hipStreamWaitEvent` for performance improvement. - -#### Resolved issues - -* Stale state during the graph capture. The return error was fixed, HIP runtime now always uses the latest dependent nodes during `hipEventRecord` capture. -* Segmentation fault during kernel execution. HIP runtime now allows maximum stack size as per ISA on the GPU device. - -### **hipBLASLt** (0.12.1) - -#### Resolved issues - -* Fixed an accuracy issue for some solutions using an `FP32` or `TF32` data type with a TT transpose. - -### **RCCL** (2.22.3) - -#### Changed - -* MSCCL++ is now disabled by default. To enable it, set `RCCL_MSCCLPP_ENABLE=1`. - -#### Resolved issues - -* Fixed an issue where early termination, in rare circumstances, could cause the application to stop responding by adding synchronization before destroying a proxy thread. -* Fixed the accuracy issue for the MSCCLPP `allreduce7` kernel in graph mode. - -#### Known issues - -* When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault. The recommended workaround is to disable MSCCL with `export RCCL_MSCCL_ENABLE=0`. - This issue will be fixed in a future ROCm release. - -* Within the RCCL-UnitTests test suite, failures occur in tests ending with the - `.ManagedMem` and `.ManagedMemGraph` suffixes. These failures only affect the - test results and do not affect the RCCL component itself. This issue will be - resolved in a future ROCm release. - -### **rocALUTION** (3.2.3) - -#### Added - -* The `-a` option has been added to the `rmake.py` build script. This option allows you to select specific architectures when building on Microsoft Windows. - -#### Resolved issues - -* Fixed an issue where the `HIP_PATH` environment variable was being ignored when compiling on Microsoft Windows. - -### **ROCm Data Center Tool** (0.3.0) - -#### Added - -- Support for GPU partitions. -- `RDC_FI_GPU_BUSY_PERCENT` metric. - -#### Changed - -- Updated `rdc_field` to align with `rdc_bootstrap` for current metrics. - -#### Resolved issues - -- Fixed [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/docs-6.4.0/index.html) eval metrics and memory leaks. - -### **ROCm SMI** (7.5.0) - -#### Resolved issues - -- Fixed partition enumeration. It now refers to the correct DRM Render and Card paths. - -> [!NOTE] -> See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. - -### **ROCm Systems Profiler** (1.0.1) - -#### Added - -* How-to document for [network performance profiling](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/how-to/nic-profiling.html) for standard Network Interface Cards (NICs). - -#### Resolved issues - -* Fixed a build issue with Dyninst on GCC 13. - -### **ROCr Runtime** (1.15.0) - -#### Resolved issues - -* Fixed a rare occurrence issue on AMD Instinct MI25, MI50, and MI100 GPUs, where the `SDMA` copies might start before the dependent Kernel finishes and could cause memory corruption. - -## ROCm 6.4.0 - -See the [ROCm 6.4.0 release notes](https://rocm.docs.amd.com/en/docs-6.4.0/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (25.3.0) - -#### Added - -- Added enumeration mapping `amdsmi_get_gpu_enumeration_info()` to Python and C APIs. The mapping is also enabled in the CLI interface via `amd-smi list -e`. - -- Added dynamic virtualization mode detection. - - Added new C and Python API `amdsmi_get_gpu_virtualization_mode_info`. - - Added new C and Python enum `amdsmi_virtualization_mode_t`. - -- Added TVIOL_ACTIVE to `amd-smi monitor`. - -- Added support for GPU metrics 1.7 to `amdsmi_get_gpu_metrics_info()`. - -- Added new API `amdsmi_get_gpu_xgmi_link_status()` and CLI `amd-smi xgmi --link-status`. - -- Added fclk and socclk info to `amd-smi metric -c/--clock`. - -- Added new command `amd-smi set -c/--clock-level`. - -- Added new command `amd-smi static -C/--clock`. - -#### Changed - -- Updated AMD SMI library version number format to reflect changes in backward compatibility and offer more semantic versioning. - - Removed year from AMD SMI library version number. - - Version format changed from 25.3.0.0 (Year.Major.Minor.Patch) to 25.3.0 (Major.Minor.Patch). - - Removed year in all version references. - -- Added new Python dependencies: `python3-setuptools` and `python3-wheel`. - -- Removed initialization requirements for `amdsmi_get_lib_version()` and added `amdsmi_get_rocm_version()` to the Python API & CLI. - -- Added an additional argument `sensor_ind` to `amdsmi_get_power_info()`. - - This change breaks previous C API calls and will require a change. - - Python API now accepts `sensor_ind` as an optional argument. This does not impact previous usage. - -- Deprecated enum `AMDSMI_NORMAL_STRING_LENGTH` in favor of `AMDSMI_MAX_STRING_LENGTH`. - -- Changed to use thread local mutex by default. - - Most `sysfs` reads do not require cross-process level mutex and writes to `sysfs` should be protected by the kernel already. - - Users can still switch to the old behavior by setting the environment variable `AMDSMI_MUTEX_CROSS_PROCESS=1`. - -- Changed `amdsmi_vram_vendor_type_t` enum names impacting the `amdsmi_vram_info_t` structure. This change also impacts the usage of the `vram_vendor` output of `amdsmi_get_gpu_vram_info()`. - -- Changed the `amdsmi_nps_caps_t` struct impacting `amdsmi_memory_partition_config_t`, `amdsmi_accelerator_partition_t`, `amdsmi_accelerator_partition_profile_config_t`. - Affected functions are: - - - `amdsmi_get_gpu_memory_partition_config()` - - - `amdsmi_get_gpu_accelerator_partition_profile()` - - - `amdsmi_get_gpu_accelerator_partition_profile_config()` - -- Corrected CLI CPU argument name. `--cpu-pwr-svi-telemtry-rails` is now `--cpu-pwr-svi-telemetry-rails`. - -- Added amdgpu driver version and amd_hsmp driver version to the `amd-smi version` command. - -- All `amd-smi set` and `amd-smi reset` options are now mutually exclusive. You can now only use one `set` option as a time. - -- Changed the name of the `power` field to `energy_accumulator` in the Python API for `amdsmi_get_energy_count()`. - -- Added violation status output for Graphics Clock Below Host Limit to `amd-smi` CLI: `amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`. - Users can retrieve violation status through either our Python or C++ APIs. Only available for MI300 Series+ ASICs. - -- Updated API `amdsmi_get_violation_status()` structure and CLI `amdsmi_violation_status_t` to include GFX Clk below host limit. - -- Updated API `amdsmi_get_gpu_vram_info()` structure and CLI `amd-smi static --vram`. - -#### Removed - -- Removed `GFX_BUSY_ACC` from `amd-smi metric --usage` as it did not provide helpful output to users. - -#### Optimized - -- Added additional help information to `amd-smi set --help` command. The subcommands now detail what values are accepted as input. - -- Modified `amd-smi` CLI to allow case insensitive arguments if the argument does not begin with a single dash. - -- Converted `xgmi` read and write from KBs to dynamically selected readable units. - -#### Resolved issues - -- Fixed `amdsmi_get_gpu_asic_info` and `amd-smi static --asic` not displaying graphics version correctly for Instinct MI200 Series, Instinct MI100 Series, and RDNA3-based GPUs. - -#### Known issues - -- AMD SMI only reports 63 GPU devices when setting CPX on all 8 GPUs. When setting CPX as a partition mode, there is a DRM node limitation of 64. - -This is a known limitation of the Linux kernel; not the driver. Other drivers, such as those using PCIe space (for example, `ast`), might occupy the necessary DRM nodes. You can check the number of DRM nodes using `ls /sys/class/drm`. - -Some workaround options are as follows: - - - Remove other devices occupying DRM nodes. - - Recommended steps for removing unnecessary drivers: - - 1. Unload amdgpu - `sudo rmmod amdgpu`. - - 2. Remove unnecessary driver(s) - ex. `sudo rmmod ast`. - - 3. Reload amgpu - `sudo modprobe amdgpu`. - - 4. Confirm `amd-smi list` reports all nodes (this can vary per MI ASIC). - - - Update your OS kernel. - - - Build and install your own kernel. - -#### Upcoming changes - -- The `AMDSMI_LIB_VERSION_YEAR` enum and API fields will be deprecated in a future ROCm release. - -- The `pasid` field in struct `amdsmi_process_info_t` will be deprecated in a future ROCm release. - -> [!NOTE] -> See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. - -### **AMDMIGraphX** (2.12.0) - -#### Added - -* Support for gfx1201. -* hipBLASLt support for contiguous transpose GEMM fusion and GEMM pointwise fusions for improved performance. -* Support for hardware-specific FP8 datatypes (FP8 OCP and FP8 FNUZ). -* Support for the BF16 datatype. -* ONNX Operator Support for `com.microsoft.MultiHeadAttention`, `com.microsoft.NhwcConv`, and `com.microsoft.MatMulIntgerFloat` -* The `migraphx-driver` can now produce output for Netron. -* The `migraphx-driver` now includes a `time` parameter (similar to `perf`) that is more accurate for very fast kernels. -* An end-to-end Stable Diffusion 3 example with an option to disable T5 encoder on VRAM-limited GPUs has been added. -* Support to track broadcast axes in `shape_transform_descriptor`. -* Support for unsigned types with `rocMLIR`. -* Script to convert mxr files to ONNX models. -* The `MIGRAPHX_SET_GEMM_PROVIDER` environment variable to choose between rocBLAS and hipBLASLt. Set `MIGRAPHX_SET_GEMM_PROVIDER` to `rocblas` to use rocBLAS, or to `hipblaslt` to use hipBLASLt. - -#### Changed - -* Switched to using hipBLASLt instead of rocBLAS (except for gfx90a GPU architecture). -* Included the min/max/median of the `perf` run as part of the summary report. -* Enabled non-packed inputs for `rocMLIR`. -* Always output a packed type for q/dq after determining non-packed tensors were inefficient. -* Even if using NHWC, MIGraphX will always convert group convolutions to NCHW for improved performance. -* Renamed the `layout_nhwc` to `layout_convolution` and ensured that either the weights are the same layout as the inputs or set the input and weights to NHWC. -* The minimum Cmake version is now 3.27. - -#### Removed - -* Removed `fp8e5m2fnuz` rocBLAS support. -* `__AMDGCN_WAVEFRONT_SIZE` has been deprecated. -* Removed a warning that printed to stdout when using FP8 types. -* Remove zero-point parameter for dequantizelinear when it is zero. - -#### Optimized - -* Prefill buffers when MLIR produces a multioutput buffer. -* Improved the resize operator performance, which should improve the overall performance of models that use it. -* Allowed the `reduce` operator to be split across an axis to improve fusion performance. The `MIGRAPHX_SPLIT_REDUCE_SIZE` environment variable has been added to allow the minimum size of the reduction to be adjusted for a possible model-specific performance improvement. -* Added `MIGRAPHX_DISABLE_PASSES` environment variable for debugging. -* Added `MIGRAPHX_MLIR_DUMP` environment variable to be set to a folder where individual final rocMLIR modules can be saved for investigation. -* Improved the C++ API to allow onnxruntime access to fp8 quantization. - -#### Resolved issues - -* Fixed multistream execution with larger models. -* Peephole LSTM Error. -* Fixed BertSquad example that could include a broken tokenizers package. -* Fixed Attention fusion ito not error with a shape mismatch when a trailing pointwise contains a literal. -* Fixed instruction::replace() logic to handle more complex cases. -* MatMulNBits could fail with a shape error. -* Fixed an issue where some models might fail to compile with an error `flatten: Shapes are not in standard layout`. - -### **Composable Kernel** (1.1.0) - -#### Added - -* Batched CK Tile General Matrix Multiplication (GEMM) with splitK support. -* Grouped CK Tile GEMM with splitK support. -* CK Tile GEMM compute pipeline v3. -* Universal CK Tile block GEMM with interwave and intrawave schedulers . -* BF16 and INT8 WMMA GEMMs for Navi3x and Navi4x. -* Batched GEMM with output elementwise operation optimized for gfx942. -* Interwave scheduler for CK Tile GEMM mem pipeline. -* Spatially local tile partitioner in CK Tile GEMM. -* Multiple FMHA forward splitKV optimizations for decode including new N-Warp S-Shuffle pipeline. -* General FMHA forward general optimizations including refining tensor view padding configurations. -* FMHA fwd N-Warp S-Shuffle pipeline (FMHA fwd splitKV pipeline variant) . -* FMHA fwd splitKV optimization for decode (`seqlen_q=1`). -* hdim=96 support for FMHA forward. -* Variable-length paged KV cache support for FMHA forward. -* Paged KV cache support in group mode FMHA fwd splitKV kernels. -* Grouped convolution backward weight optimized irregular vector size loads. -* NGCHW BF16 grouped convolution forward support. -* Generic support for two-stage grouped convolution backward weight. -* Dynamic elementwise operation selected in runtime for convolutions. -* CK Tile transpose operator. -* CK Tile MOE operators: fused, sorting, and smooth quant. -* OCP FP8 support for gfx12. -* Support for FP8, BF16, FP16, OCP FP8, BF8, pk_int4 data types in CK Tile GEMM. -* Support for microscaling data types: MX FP4, FP6, and FP8. -* Support for gfx1201 target. -* Support for large batch tensors in grouped convolution backward data. -* Support for grouped convolution backward weight BF16 NGCHW. -* Support for cshuffle algorithm in CK Tile GEMM epilogue . -* Backend support for PyTorch 2.6. -* Test filters to select smoke tests or regression tests. -* Error threshold calculation for CK Tile GEMM examples. - -#### Changed - -* Expanded code generation to support dynamic compilation using hipRTC. -* Updated attention forward qs_ks_vs pipeline to support hdim=512. - -#### Removed - -* Removed support for gfx40 and gfx41. - -#### Optimized - -* Improved accuracy of BFP16 convolution. -* Improved memory access pattern for all CK Tile GEMM layouts. -* Improved CK Tile Layernorm performance and added different quantization methods. - -#### Resolved issues - -* Fixed CK Tile GEMM hotloop scheduler to use proper MFMA attributes. - - -### **HIP** (6.4.0) - -#### Added - -* New HIP APIs - - - `hipDeviceGetTexture1DLinearMaxWidth` returns the maximum width of elements in a 1D linear texture, which can be allocated on the specified device. - - `hipStreamBatchMemOp` enqueues an array of batch memory operations in the stream, for stream synchronization. - - `hipGraphAddBatchMemOpNode` creates a batch memory operation node and adds it to a graph. - - `hipGraphBatchMemOpNodeGetParams` returns the pointer of parameters from the batch memory operation node. - - `hipGraphBatchMemOpNodeSetParams` sets parameters for the batch memory operation node. - - `hipGraphExecBatchMemOpNodeSetParams` sets the parameters for a batch memory operation node in the given executable graph. - - `hipLinkAddData` adds SPIR-V code object data to linker instance with options. - - `hipLinkAddFile` adds SPIR-V code object file to linker instance with options. - - `hipLinkCreate` creates linker instance at runtime with options. - - `hipLinkComplete` completes linking of program and output linker binary to use with hipModuleLoadData. - - `hipLinkDestroy` deletes linker instance. - -#### Changed - -* `roc-obj` tools is deprecated and will be removed in an upcoming release. - - - Perl package installation is not required, and users will need to install this themselves if they want to. - - Support for ROCm Object tooling has moved into `llvm-objdump` provided by package `rocm-llvm`. - -* SDMA retainer logic is removed for engine selection in operation of runtime buffer copy. - -#### Optimized - -* `hipGraphLaunch` parallelism is improved for complex data-parallel graphs. -* Make the round-robin queue selection in command scheduling. For multi-streams execution, HSA queue from null stream lock is freed and won't occupy the queue ID after the kernel in the stream is finished. -* The HIP runtime doesn't free bitcode object before code generation. It adds a cache, which allows compiled code objects to be reused instead of recompiling. This improves performance on multi-GPU systems. -* Runtime now uses unified copy approach: - - - Unpinned `H2D` copies are no longer blocking until the size of 1 MB. - - Kernel copy path is enabled for unpinned `H2D`/`D2H` methods. - - The default environment variable `GPU_FORCE_BLIT_COPY_SIZE` is set to `16`, which limits the kernel copy to sizes less than 16 KB, while copies larger than that would be handled by `SDMA` engine. - - Blit code is refactored, and ASAN instrumentation is cleaned up. - -#### Resolved issues - -* Out-of-memory error on Microsoft Windows. When the user calls `hipMalloc` for device memory allocation while specifying a size larger than the available device memory, the HIP runtime fixes the error in the API implementation, allocating the available device memory plus system memory (shared virtual memory). -* Error of dependency on `libgcc-s1` during rocm-dev install on Debian Buster. HIP runtime now uses `libgcc1` for this distros. -* Stack corruption during kernel execution. HIP runtime now adds a maximum stack size limit based on the GPU device feature. - -#### Upcoming changes - -The following lists the backward incompatible changes planned for upcoming major ROCm releases. - -* Signature changes in APIs to correspond with NVIDIA CUDA APIs, - - - `hiprtcCreateProgram` - - `hiprtcCompileProgram` - - `hipCtxGetApiVersion` - -* Behavior of `hipPointerGetAttributes` is changed to match corresponding CUDA API in version 11 and later releases. -* Return error/value code updates in the following hip APIs to match the corresponding CUDA APIs, - - - `hipModuleLaunchKernel` - - `hipExtModuleLaunchKernel` - - `hipModuleLaunchCooperativeKernel` - - `hipGetTextureAlignmentOffset` - - `hipTexObjectCreate` - - `hipBindTexture2D` - - `hipBindTextureToArray` - - `hipModuleLoad` - - `hipLaunchCooperativeKernelMultiDevice` - - `hipExtLaunchCooperativeKernelMultiDevice` - -* HIPRTC implementation, the compilation of `hiprtc` now uses namespace ` __hip_internal`, instead of the standard headers `std`. -* Stream capture mode updates in the following HIP APIs. Streams can only be captured in relax mode, to match the behavior of the corresponding CUDA APIs, - - - `hipMallocManaged` - - `hipMemAdvise` - - `hipLaunchCooperativeKernelMultiDevice` - - `hipDeviceSetCacheConfig` - - `hipDeviceSetSharedMemConfig` - - `hipMemPoolCreate` - - `hipMemPoolDestory` - - `hipDeviceSetMemPool` - - `hipEventQuery` - -* The implementation of `hipStreamAddCallback` is updated, to match the behavior of CUDA. -* Removal of `hiprtc` symbols from hip library. - - - `hiprtc` will be a independent library, and all symbols supported in HIP library are removed. - - Any application using `hiprtc` APIs should link explicitly with `hiprtc` library. - - This change makes the use of `hiprtc` library on Linux the same as on Windows, and matches the behavior of CUDA `nvrtc`. - -* Removal of deprecated struct `HIP_MEMSET_NODE_PARAMS`, Developers can use definition `hipMemsetParams` instead. - -### **hipBLAS** (2.4.0) - -#### Changed - -* Updated the build dependencies. - -#### Resolved issues - -* Fixed the Windows reference library interface for rocSOLVER functions for hipBLAS clients. - -### **hipBLASLt** (0.12.0) - -#### Added - -* Support ROC-TX if `HIPBLASLT_ENABLE_MARKER=1` is set. -* Output the profile logging if `HIPBLASLT_LOG_MASK=64` is set. -* Support for the `FP16` compute type. -* Memory bandwidth information to the hipblaslt-bench output. -* Support the user offline tuning mechanism. -* More samples. - -#### Changed - -* Output the bench command along with the solution index if `HIPBLASLT_LOG_MASK=32` is set. - -#### Optimized - -* Improve the overall performance of the XF32/FP16/BF16/FP8/BF8 data types. -* Reduce the library size. - -#### Resolved issues - -* Fixed multi-threads bug. -* Fixed multi-streams bug. - -### **hipCUB** (3.4.0) - -#### Added - -* Added regression tests to `rtest.py`. These tests recreate scenarios that have caused hardware problems in past emulation environments. Use `python rtest.py [--emulation|-e|--test|-t]=regression` to run these tests. -* Added extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer than smoke and regression tests. Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests. -* Added `ForEach`, `ForEachN`, `ForEachCopy`, `ForEachCopyN` and `Bulk` functions to have parity with CUB. -* Added the `hipcub::CubVector` type for CUB parity. -* Added `--emulation` option for `rtest.py` -* Unit tests can be run with `[--emulation|-e|--test|-t]=;`. -* Added `DeviceSelect::FlaggedIf` and its inplace overload. -* Added CUB macros missing from hipCUB: `HIPCUB_MAX`, `HIPCUB_MIN`, `HIPCUB_QUOTIENT_FLOOR`, `HIPCUB_QUOTIENT_CEILING`, `HIPCUB_ROUND_UP_NEAREST` and `HIPCUB_ROUND_DOWN_NEAREST`. -* Added `hipcub::AliasTemporaries` function for CUB parity. - -#### Changed - -* Removed usage of `std::unary_function` and `std::binary_function` in `test_hipcub_device_adjacent_difference.cpp`. -* Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run time and never exceed 2 GB of VRAM usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests. -* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options. -* The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.5.0. If it is not found, it will be downloaded from the NVIDIA CCCL repository. -* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release. - -#### Known issues - -* When building on Microsoft Windows using HIP SDK for ROCm 6.4, ``hipMalloc`` returns ``hipSuccess`` even when the size passed to it is too large and the allocation fails. Because of this, limits have been set for the maximum test case sizes for some unit tests such as HipcubDeviceRadixSort's SortKeysLargeSizes . - -### **hipFFT** (1.0.18) - -#### Added - -* Implemented the `hipfftMpAttachComm`, `hipfftXtSetDistribution`, and `hipfftXtSetSubformatDefault` APIs to allow - computing FFTs that are distributed between multiple MPI (Message Passing Interface) processes. These APIs can be enabled - with the `HIPFFT_MPI_ENABLE` CMake option, which defaults to `OFF`. The backend FFT library called by hipFFT must support MPI for these APIs to work. - - The backend FFT library called by hipFFT must support MPI for these APIs to work. - -#### Changed - -* Building with the address sanitizer option sets xnack+ for the relevant GPU - architectures. -* Use the `find_package` CUDA toolkit instead of CUDA in CMake for modern CMake compatibility. -* The `AMDGPU_TARGETS` build variable should be replaced with `GPU_TARGETS`. `AMDGPU_TARGETS` is deprecated. - -#### Resolved issues - -* Fixed the client packages so they depend on hipRAND instead of rocRAND. - -### **hipfort** (0.6.0) - -#### Upcoming changes - -* The hipfc compiler wrapper has been deprecated and will be removed - in a future release. Users are encouraged to directly invoke their - Fortran or HIP compilers as appropriate for each source file. - -### **HIPIFY** (19.0.0) - -#### Added -* NVIDIA CUDA 12.6.3 support -* cuDNN 9.7.0 support -* cuTENSOR 2.0.2.1 support -* LLVM 19.1.7 support -* Full support for direct hipification of `cuRAND` into `rocRAND` under the `--roc` option. -* Support for `fp8` math device/host API. For more information see [#1617](https://github.com/ROCm/HIPIFY/issues/1617) in the HIPIFY Github repository. - -#### Resolved issues -* `MIOpen` support in hipify-perl under the `-miopen` option -* Use `const_cast` for the last arguments in the `hiprtcCreateProgram` and `hiprtcCompileProgram` function calls, as in CUDA, they are of the `const char* const*` type -* Support for `fp16` device/host API. For more information see [#1769](https://github.com/ROCm/HIPIFY/issues/1769) in the HIPIFY Github repository. -* Fixed instructions on building LLVM for HIPIFY on Linux. For more information see [#1800](https://github.com/ROCm/HIPIFY/issues/1800) in the HIPIFY Github repository. - -#### Known issues - -* `hipify-clang` build failure against LLVM 15-18 on `Ubuntu`, `CentOS`, and `Fedora`. For more information see [#833](https://github.com/ROCm/HIPIFY/issues/833) in the HIPIFY Github repository. - -### **hipRAND** (2.12.0) - -#### Changed - -* When building hipRAND on Windows, use `HIP_PATH` (instead of the former `HIP_DIR`) to specify the path to the HIP SDK installation. -* When building with the `rmake.py` script, `HIP_PATH` will default to `C:\hip` if it is not set. - -#### Resolved issues - -* Fixed an issue causing hipRAND build failures on Windows when the HIP SDK was installed in a location with a path that contains spaces. - -### **hipSOLVER** (2.4.0) - -#### Added - -* The `csrlsvqr` compatibility-only functions `hipsolverSpScsrlsvqr`, `hipsolverSpDcsrlsvqr`, `hipsolverSpCcsrlsvqr`, `hipsolverSpZcsrlsvqr` - -### **hipSPARSE** (3.2.0) - -#### Added - -* Added the `azurelinux` operating system name to correct the GFortran dependency. - -#### Optimized - -* Removed an unused `GTest` dependency from `hipsparse-bench`. - -### **hipSPARSELt** (0.2.3) - -#### Added - -* Support for alpha vector scaling - -#### Changed - -* The check mechanism of the inputs when using alpha vector scaling - -### **hipTensor** (1.5.0) - -#### Added - -* Added benchmarking suites for contraction, permutation, and reduction. YAML files are categorized into bench and validation folders for organization. -* Added emulation test suites for contraction, permutation, and reduction. -* Support has been added for changing the default data layout using the `HIPTENSOR_DEFAULT_STRIDES_COL_MAJOR` environment variable. - -#### Changed - -* `GPU_TARGETS` is now used instead of `AMDGPU_TARGETS` in `cmakelists.txt`. -* Binary sizes can be reduced on supported compilers by using the `--offload-compress` compiler flag. - -#### Optimized - -* Optimized the hyper-parameter selection algorithm for permutation. - -#### Resolved issues - -* For a CMake bug workaround, set `CMAKE_NO_BUILTIN_CHRPATH` when `BUILD_OFFLOAD_COMPRESS` is unset. - -#### Upcoming changes - -* hipTensor will enhance performance and usability while unifying the API design across all operations (elementwise, reductions, and tensor contractions), enabling consistent multi-stage execution and plan reuse. As part of this change, the API functions `hiptensorInitTensorDescriptor`, `hiptensorContractionDescriptor_t` , `hiptensorInitContractionDescriptor`, `hiptensorInitContractionFind`, `hiptensorContractionGetWorkspaceSize`, `hiptensorInitContractionPlan`, `hiptensorContraction`, `hiptensorElementwiseBinary`, `hiptensorElementwiseTrinary`, `hiptensorPermutation`, and `hiptensorReduction` will be deprecated in a future ROCm release. - -### **llvm-project** (19.0.0) - -#### Added - -* Support for `amdgpu_max_num_work_groups` in the compiler. This attribute - can be set by end users or library developers. It provides an upper limit - for workgroups as described in [AMD GPU Attributes](https://clang.llvm.org/docs/AttributeReference.html#amdgpu-max-num-work-groups). - When set, the AMDGPU target backend might produce better machine code. - -### **MIOpen** (3.4.0) - -#### Added - -* [Conv] Enabled tuning through the `miopenSetConvolutionFindMode` API. -* [RNN] Added the new algorithm type `miopenRNNroundedDynamic` for LSTM. -* [TunaNet] Enabled NHWC for AMD Instinct MI300. - -#### Optimized - -* Updated KernelTuningNet for CK solvers. - -#### Resolved issues - -* Fixed tuning timing results. -* Accuracy for ASM solvers. - -### **MIVisionX** (3.2.0) - -#### Changed - -* OpenCV is now installed with the package installer on Ubuntu. -* AMD Clang is now the default CXX and C compiler. -* The version of OpenMP included in the ROCm LLVM project is now used instead of `libomp-dev/devel`. - -#### Known issues - -* Installation on CentOS, RedHat, and SLES requires manually installing the `FFMPEG` and `OpenCV` dev packages. -* Hardware decode requires the ROCm `graphics` use case. - -#### Upcoming changes - -* Optimized audio augmentations support for VX_RPP - -### **rccl** (2.22.3) - -#### Added - -* Added the `RCCL_SOCKET_REUSEADDR` and `RCCL_SOCKET_LINGER` environment parameters. -* Setting `NCCL_DEBUG=TRACE NCCL_DEBUG_SUBSYS=VERBS` will generate traces for fifo and data `ibv_post_send` calls. -* Added the `--log-trace` flag to enable traces through the `install.sh` script (for example, `./install.sh --log-trace`). - -#### Changed - -* Changed compatibility to include NCCL 2.22.3. - -### **rocAL** (2.2.0) - -#### Changed - -* AMD Clang is now the default CXX and C compiler. - -#### Known issues - -* The package installation requires manually installing `TurboJPEG`. -* Installation on CentOS, RedHat, and SLES requires manually installing the `FFMPEG Dev` package. -* Hardware decode requires installing ROCm with the `graphics` use case. - -### **rocALUTION** (3.2.2) - -#### Changed - -* Improved documentation - -### **rocBLAS** (4.4.0) - -#### Added - -* Added ROC-TX support in rocBLAS (not available on Windows or in the static library version on Linux). -* On gfx12, all functions now support full `rocblas_int` dynamic range for `batch_count`. -* Added the `--ninja` build option. -* Added support for the `GPU_TARGETS` CMake variable. - -#### Changed - -* The rocblas-test client removes the stress tests unless YAML-based testing or `gtest_filter` adds them. -* OpenMP default threading for rocBLAS clients is reduced to less than the logical core count. -* `gemm_ex` testing and timing reuses device memory. -* `gemm_ex` timing initializes matrices on device. - -#### Optimized - -* Significantly reduced workspace memory requirements for Level 1 ILP64: `iamax` and `iamin`. -* Reduced the workspace memory requirements for Level 1 ILP64: `dot`, `asum`, and `nrm2`. -* Improved the performance of Level 2 gemv for the problem sizes (`TransA == N && m > 2*n`) and (`TransA == T`). -* Improved the performance of Level 3 syrk and herk for the problem size (`k > 500 && n < 4000`). - -#### Resolved issues - -* gfx12: `ger`, `geam`, `geam_ex`, `dgmm`, `trmm`, `symm`, `hemm`, ILP64 `gemm`, and larger data support. -* Added a `gfortran` package dependency for Azure Linux OS. -* Resolved outdated SLES operating system package dependencies (`cxxtools` and `joblib`) in `install.sh -d`. -* Fixed code object stripping for RPM packages. - -#### Upcoming changes - -* CMake variable `AMDGPU_TARGETS` is deprecated. Use `GPU_TARGETS` instead. - -### **ROCdbgapi** (0.77.2) - -#### Added - -* Support for generic code object targets: - - `gfx9-generic` - - `gfx9-4-generic` - - `gfx10-1-generic` - - `gfx10-3-generic` - - `gfx11-generic` - - `gfx12-generic` - -#### Changed - -* The name reported for detected agents is now based on the `amdgpu.ids` database provided by `libdrm`. - -### **rocDecode** (0.10.0) - -#### Added - -* The new bitstream reader feature has been added. The bitstream reader contains built-in stream file parsers, including an elementary stream file parser and an IVF container file parser. The reader can parse AVC, HEVC, and AV1 elementary stream files, and AV1 IVF container files. Additional supported formats will be added. -* VP9 support has been added. -* More CTests have been added: VP9 test and tests on video decode raw sample. -* Two new samples, videodecoderaw and videodecodepicfiles, have been added. videodecoderaw uses the bitstream reader instead of the FFMPEG demuxer to get picture data, and videodecodepicfiles shows how to decode an elementary video stream stored in multiple files, with each file containing bitstream data of a coded picture. - -#### Changed - -* AMD Clang++ is now the default CXX compiler. -* Moved MD5 code out of the RocVideoDecode utility. - -#### Removed - -* FFMPEG executable requirement for the package. - -### **rocFFT** (1.0.32) - -#### Changed - -* Building with the address sanitizer option sets xnack+ on the relevant GPU - architectures and adds address-sanitizer support to runtime-compiled - kernels. -* The `AMDGPU_TARGETS` build variable should be replaced with `GPU_TARGETS`. `AMDGPU_TARGETS` is deprecated. - -#### Removed - -* Ahead-of-time compiled kernels for the gfx906, gfx940, and gfx941 architectures. These architectures still work the same way, but their kernels are now compiled at runtime. -* Consumer GPU architectures from the precompiled kernel cache that ships with - rocFFT. rocFFT continues to ship with a cache of precompiled RTC kernels for data center - and workstation architectures. As before, user-level caches can be enabled by setting the - environment variable `ROCFFT_RTC_CACHE_PATH` to a writeable file location. - -#### Optimized - -* Improved MPI transform performance by using all-to-all communication for global transpose operations. - Point-to-point communications are still used when all-to-all is unavailable. -* Improved the performance of unit-strided, complex interleaved, forward, and inverse length (64,64,64) FFTs. - -#### Resolved issues - -* Fixed incorrect results from 2-kernel 3D FFT plans that used non-default output strides. For more information, see the [rocFFT GitHub issue](https://github.com/ROCm/rocFFT/issues/507). -* Plan descriptions can now be reused with different strides for different plans. For more information, see the [rocFFT GitHub issue](https://github.com/ROCm/rocFFT/issues/504). -* Fixed client packages to depend on hipRAND instead of rocRAND. -* Fixed potential integer overflows during large MPI transforms. - -### **ROCm Compute Profiler** (3.1.0) - -#### Added - -* Roofline support for Ubuntu 24.04. -* Experimental support `rocprofv3` (not enabled as default). - -#### Resolved issues - -* Fixed PoP of VALU Active Threads. -* Workaround broken mclk for old version of rocm-smi. - -### **ROCgdb** (15.2) - -#### Added - -- Support for debugging shaders compiled for the following generic targets: - - `gfx9-generic` - - `gfx9-4-generic` - - `gfx10-1-generic` - - `gfx10-3-generic` - - `gfx11-generic` - - `gfx12-generic` - -### **ROCm Data Center Tool** (0.3.0) - -#### Added - -* RDC policy feature -* Power and thermal throttling metrics -* RVS [IET](https://github.com/ROCm/ROCmValidationSuite/tree/a6177fc5e3f2679f98bbbc80dc536d535a43fb69/iet.so), [PEBB](https://github.com/ROCm/ROCmValidationSuite/tree/a6177fc5e3f2679f98bbbc80dc536d535a43fb69/pebb.so), and [memory bandwidth tests](https://github.com/ROCm/ROCmValidationSuite/tree/a6177fc5e3f2679f98bbbc80dc536d535a43fb69/babel.so) -* Link status -* RDC_FI_PROF_SM_ACTIVE metric - -#### Changed - -* Migrated from [ROCProfiler](https://github.com/ROCm/rocprofiler) to [ROCprofiler-SDK](https://github.com/ROCm/rocprofiler-sdk) -* Improved README.md for better usability -* Moved `rdc_options` into `share/rdc/conf/` - -#### Resolved issues - -* Fixed ABSL in clang18+ - -### **rocJPEG** (0.8.0) - -#### Changed - -* AMD Clang++ is now the default CXX compiler. -* The jpegDecodeMultiThreads sample has been renamed to jpegDecodePerf, and batch decoding has been added to this sample instead of single image decoding for improved performance. - -### **ROCm SMI** (7.5.0) - -#### Added - -- Added support for GPU metrics 1.7 to `rsmi_dev_gpu_metrics_info_get()`. - -- Added new GPU metrics 1.7 to `rocm-smi --showmetrics`. - -#### Resolved issues - -- Fixed `rsmi_dev_target_graphics_version_get`, `rocm-smi --showhw`, and `rocm-smi --showprod` not displaying graphics version correctly for Instinct MI200 Series, MI100 Series, and RDNA3-based GPUs. - -> [!NOTE] -> See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions. - -### **ROCm Systems Profiler** (1.0.0) - -#### Added - -- Support for VA-API and rocDecode tracing. -- Aggregation of MPI data collected across distributed nodes and ranks. The data is concatenated into a single proto file. - -#### Changed -- Backend refactored to use [ROCprofiler-SDK](https://github.com/ROCm/rocprofiler-sdk) rather than [ROCProfiler](https://github.com/ROCm/rocprofiler) and [ROCTracer](https://github.com/ROCm/ROCTracer). - -#### Resolved issues - -- Fixed hardware counter summary files not being generated after profiling. - -- Fixed an application crash when collecting performance counters with rocprofiler. - -- Fixed interruption in config file generation. - -- Fixed segmentation fault while running rocprof-sys-instrument. -- Fixed an issue where running `rocprof-sys-causal` or using the `-I all` option with `rocprof-sys-sample` caused the system to become non-responsive. - -- Fixed an issue where sampling multi-GPU Python workloads caused the system to stop responding. - -### **ROCm Validation Suite** (1.1.0) - -#### Added - -* Configuration files for MI210. -* Support for OCP fp8 data type. -* GPU index-based CLI execution. - -#### Changed - -* JSON logging with updated schema. - -### **rocPRIM** (3.4.0) - -#### Added - -* The parallel `find_first_of` device function with autotuned configurations has been added. This function is similar to `std::find_first_of`. It searches for the first occurrence of any of the provided elements. -* Tuned configurations for segmented radix sort for gfx942 have been added to improve performance on the gfx942 architecture. -* The parallel device-level function, `rocprim::adjacent_find`, which is similar to the C++ Standard Library `std::adjacent_find` algorithm, has been added. -* Configuration autotuning has been added to device adjacent find (`rocprim::adjacent_find`) for improved performance on selected architectures. -* `rocprim::numeric_limits` has been added. This is an extension of `std::numeric_limits` that supports 128-bit integers. -* `rocprim::int128_t` and `rocprim::uint128_t` have been added. -* The parallel `search` and `find_end` device functions have been added. These are similar to `std::search` and `std::find_end`. These functions search for the first and last occurrence of the sequence, respectively. -* A parallel device-level function, `rocprim::search_n`, has been added. `rocprim::search_n` is similar to the C++ Standard Library `std::search_n` algorithm. -* New constructors, a `base` function, and a `constexpr` specifier have been added to all functions in `rocprim::reverse_iterator` to improve parity with the C++17 `std::reverse_iterator`. -* hipGraph support has been added to the device run-length-encode for non-trivial runs (`rocprim::run_length_encode_non_trivial_runs`). -* Configuration autotuning has been added to the device run-length-encode for non-trivial runs (`rocprim::run_length_encode_non_trivial_runs`) for improved performance on selected architectures. -* Configuration autotuning has been added to the device run-length-encode for trivial runs (`rocprim::run_length_encode`) for improved performance on selected architectures. -* The `--emulation` option has been added to `rtest.py`. Unit tests can be run with `python rtest.py [--emulation|-e|--test|-t]=`. -* Extended and regression tests have been added to `rtest.py`. Extended tests are tests that don't fit the criteria of smoke or regression tests, and take longer than smoke or regression tests to run. Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run extended tests, and `python rtest.py [--emulation|-e|--test|-t]=regression` to run regression tests. -* Added a new type traits interface to enable users to provide additional type trait information to rocPRIM, facilitating better compatibility with custom types. - -#### Changed - -* Changed the subset of tests that are run for smoke tests such that the smoke test will complete faster and never exceed 2 GB of VRAM usage. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests. -* The `rtest.py` options have changed. `rtest.py` is now run with at least either `--test|-t` or `--emulation|-e`, but not both options. -* Changed the internal algorithm of block radix sort to use a rank match. This improves the performance of various radix sort-related algorithms. -* Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts. -* The C++ version has changed from 14 to 17. C++14 will be deprecated in the next major release. -* You can use CMake HIP language support with CMake 3.18 and later. To use HIP language support, run `cmake` with `-DUSE_HIPCXX=ON` instead of setting the `CXX` variable to the path to a HIP-aware compiler. - -#### Removed - -* HIP-CPU support - -#### Resolved issues - -* Fixed an issue where `rmake.py` generated incorrect cmake commands in a Linux environment. -* Fixed an issue where `rocprim::partial_sort_copy` would yield a compile error if the input iterator was a const. -* Fixed incorrect 128-bit signed and unsigned integer type traits. -* Fixed a compilation issue when `rocprim::radix_key_codec<...>` is specialized with a 128-bit integer. -* Fixed the warp-level reduction `rocprim::warp_reduce.reduce` DPP implementation to avoid undefined intermediate values during the reduction. -* Fixed an issue that caused a segmentation fault when `hipStreamLegacy` was passed to certain API functions. - -#### Upcoming changes - -* Using the initialization constructor of `rocprim::reverse_iterator` will throw a deprecation warning. It will be marked as explicit in the next major release. - -### **ROCProfiler** (2.0.0) - -#### Added -* Ops 16, 32, and 64 metrics for RDC. -* Tool deprecation message for ROCProfiler and ROCProfilerV2. - -#### Changed -* Updated README for kernel filtration. - -#### Resolved issues - -* Fixed the program crash issue due to invalid UTF-8 characters in a trace log. - -### **ROCprofiler-SDK** (0.6.0) - -#### Added - -* Support for `select()` operation in counter expression. -* `reduce()` operation for counter expression with respect to dimension. -* `--collection-period` feature in `rocprofv3` to enable filtering using time. -* `--collection-period-unit` feature in `rocprofv3` to control time units used in the collection period option. -* Deprecation notice for ROCProfiler and ROCProfilerV2. -* Support for rocDecode API Tracing. -* Usage documentation for ROCTx. -* Usage documentation for MPI applications. -* SDK: `rocprofiler_agent_v0_t` support for agent UUIDs. -* SDK: `rocprofiler_agent_v0_t` support for agent visibility based on gpu isolation environment variables such as `ROCR_VISIBLE_DEVICES` and so on. -* Accumulation VGPR support for `rocprofv3`. -* Host-trap based PC sampling support for `rocprofv3`. -* Support for OpenMP tool. - -### **rocPyDecode** (0.3.1) - -#### Added - -* VP9 support - -#### Changed - -* AMD Clang is now the default CXX and C compiler. - -#### Removed - -* All MD5 functionality, APIs, and sample code have been removed. - -#### Resolved issues - -* Ubuntu 24.04 compile failure with FFmpeg version 5.X and above has been fixed. - -### **rocRAND** (3.3.0) - -#### Added - -* Extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. They take much longer to run relative to smoke and regression tests. Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests. -* Added regression tests to `rtest.py`. These tests recreate scenarios that have caused hardware problems in past emulation environments. Use `python rtest.py [--emulation|-e|--test|-t]=regression` to run these tests. -* Added smoke test options, which run a subset of the unit tests and ensure that less than 2 GB of VRAM will be used. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests. -* The `--emulation` option for `rtest.py`. - -#### Changed - -* `--test|-t` is no longer a required flag for `rtest.py`. Instead, the user can use either `--emulation|-e` or `--test|-t`, but not both. -* Removed the TBB dependency for multi-core processing of host-side generation. - -### **ROCr Debug Agent** (2.0.4) - -#### Added - -* Functionality to print the associated kernel name for each wave. - -### **ROCr Runtime** (1.15.0) - -#### Added - -* Support for asynchronous scratch reclaim on AMD Instinct MI300X GPUs. Asynchronous scratch reclaim allows scratch memory that was assigned to Command Processor(cp) queues to be reclaimed back in case the application runs out of device memory or if the `hsa_amd_agent_set_async_scratch_limit` API is called with the threshold parameter as 0. - -### **rocSOLVER** (3.28.0) - -#### Added - -* Application of a sequence of plane rotations to a given matrix for LASR -* Algorithm selection mechanism for hybrid computation -* Hybrid computation support for existing routines: - - BDSQR - - GESVD - -#### Optimized - -* Improved the performance of SYEVJ. -* Improved the performance of GEQRF. - -### **rocSPARSE** (3.4.0) - -#### Added - -* Added support for `rocsparse_matrix_type_triangular` in `rocsparse_spsv`. -* Added test filters `smoke`, `regression`, and `extended` for emulation tests. -* Added `rocsparse_[s|d|c|z]csritilu0_compute_ex` routines for iterative ILU. -* Added `rocsparse_[s|d|c|z]csritsv_solve_ex` routines for iterative triangular solve. -* Added `GPU_TARGETS` to replace the now deprecated `AMDGPU_TARGETS` in CMake files. -* Added BSR format to the SpMM generic routine `rocsparse_spmm`. - -#### Changed - -* By default, the rocSPARSE shared library is built using the `--offload-compress` compiler option which compresses the fat binary. This significantly reduces the shared library binary size. - -#### Optimized - -* Improved the performance of `rocsparse_spmm` when used with row order for `B` and `C` dense matrices and the row split algorithm `rocsparse_spmm_alg_csr_row_split`. -* Improved the adaptive CSR sparse matrix-vector multiplication algorithm when the sparse matrix has many empty rows at the beginning or at the end of the matrix. This improves the routines `rocsparse_spmv` and `rocsparse_spmv_ex` when the adaptive algorithm `rocsparse_spmv_alg_csr_adaptive` is used. -* Improved stream CSR sparse matrix-vector multiplication algorithm when the sparse matrix size (number of rows) decreases. This improves the routines `rocsparse_spmv` and `rocsparse_spmv_ex` when the stream algorithm `rocsparse_spmv_alg_csr_stream` is used. -* Compared to `rocsparse_[s|d|c|z]csritilu0_compute`, the routines `rocsparse_[s|d|c|z]csritilu0_compute_ex` introduce several free iterations. A free iteration is an iteration that does not compute the evaluation of the stopping criteria, if enabled. This allows the user to tune the algorithm for performance improvements. -* Compared to `rocsparse_[s|d|c|z]csritsv_solve`, the routines `rocsparse_[s|d|c|z]csritsv_solve_ex` introduce several free iterations. A free iteration is an iteration that does not compute the evaluation of the stopping criteria. This allows the user to tune the algorithm for performance improvements. -* Improved the user documentation. - -#### Resolved issues - -* Fixed an issue in `rocsparse_spgemm`, `rocsparse_[s|d|c|z]csrgemm`, and `rocsparse_[s|d|c|z]bsrgemm` where incorrect results could be produced when rocSPARSE was built with optimization level `O0`. This was caused by a bug in the hash tables that could allow keys to be inserted twice. -* Fixed an issue in the routine `rocsparse_spgemm` when using `rocsparse_spgemm_stage_symbolic` and `rocsparse_spgemm_stage_numeric`, where the routine would crash when `alpha` and `beta` were passed as host pointers and where `beta != 0`. -* Fixed an issue in `rocsparse_bsrilu0`, where the algorithm was running out of bounds of the `bsr_val` array. - -#### Upcoming changes - -* Deprecated the `rocsparse_[s|d|c|z]csritilu0_compute` routines. Users should use the newly added `rocsparse_[s|d|c|z]csritilu0_compute_ex` routines going forward. -* Deprecated the `rocsparse_[s|d|c|z]csritsv_solve` routines. Users should use the newly added `rocsparse_[s|d|c|z]csritsv_solve_ex` routines going forward. -* Deprecated the use of `AMDGPU_TARGETS` in CMake files. Users should use `GPU_TARGETS` going forward. - -### **ROCTracer** (4.1.0) - -#### Added - -* Tool deprecation message for ROCTracer. - -### **rocThrust** (3.3.0) - -#### Added - -* Added a section to install Thread Building Block (TBB) inside `cmake/Dependencies.cmake` if TBB is not already available. -* Made TBB an optional dependency with the new `BUILD_HIPSTDPAR_TEST_WITH_TBB` flag. When the flag is `OFF` and TBB is not already on the machine, it will compile without TBB. Otherwise, it will compile with TBB. -* Added extended tests to `rtest.py`. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer than smoke and regression tests. Use `python rtest.py [--emulation|-e|--test|-t]=extended` to run these tests. -* Added regression tests to `rtest.py`. These tests recreate scenarios that have caused hardware problems in past emulation environments. Use `python rtest.py [--emulation|-e|--test|-t]=regression` to run these tests. -* Added smoke test options, which run a subset of the unit tests and ensure that less than 2 GB of VRAM will be used. Use `python rtest.py [--emulation|-e|--test|-t]=smoke` to run these tests. -* Added `--emulation` option for `rtest.py` -* Merged changes from upstream CCCL/thrust 2.4.0 and CCCL/thrust 2.5.0. -* Added `find_first_of`, `find_end`, `search`, and `search_n` to HIPSTDPAR. -* Updated HIPSTDPAR's `adjacent_find` to use the rocPRIM implementation. - -#### Changed - -* Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release. -* `--test|-t` is no longer a required flag for `rtest.py`. Instead, the user can use either `--emulation|-e` or `--test|-t`, but not both. -* Split HIPSTDPAR's forwarding header into several implementation headers. -* Fixed `copy_if` to work with large data types (512 bytes). - -#### Known issues - -* `thrust::inclusive_scan_by_key` might produce incorrect results when it's used with -O2 or -O3 optimization. This is caused by a recent compiler change and a fix will be made available at a later date. - -### **rocWMMA** (1.7.0) - -#### Added - -* Added interleaved layouts that enhance the performance of GEMM operations. -* Emulation test suites. These suites are lightweight and well suited for execution on emulator platforms. - -#### Changed - -* Used `GPU_TARGETS` instead of `AMDGPU_TARGETS` in `cmakelists.txt`. -* Binary sizes can be reduced on supported compilers by using the `--offload-compress` compiler flag. - -#### Resolved issues - -* For a CMake bug workaround, set `CMAKE_NO_BUILTIN_CHRPATH` when `BUILD_OFFLOAD_COMPRESS` is unset. - -#### Upcoming changes - -* rocWMMA will augment the fragment API objects with additional meta-properties that improve API expressiveness and configurability of parameters including multiple-wave cooperation. As part of this change, cooperative rocWMMA API functions `load_matrix_coop_sync` and `store_matrix_coop_sync` will be deprecated in a future ROCm release. - -### **rpp** (1.9.10) - -#### Added - -* RPP Tensor Gaussian Filter and Tensor Box Filter support on HOST (CPU) backend. -* RPP Fog and Rain augmentation on HOST (CPU) and HIP backends. -* RPP Warp Perspective on HOST (CPU) and HIP backends. -* RPP Tensor Bitwise-XOR support on HOST (CPU) and HIP backends. -* RPP Threshold on HOST (CPU) and HIP backends. -* RPP Audio Support for Spectrogram and Mel Filter Bank on HIP backend. - -#### Changed - -* AMD Clang is now the default CXX and C compiler. -* AMD RPP can now pass HOST (CPU) build with g++. -* Test Suite case numbers have been replaced with ENUMs for all augmentations to enhance test suite readability. -* Test suite updated to return error codes from RPP API and display them. - -#### Resolved issues - -* CXX Compiler: Fixed HOST (CPU) g++ issues. -* Deprecation warning fixed for the `sprintf is deprecated` warning. -* Test suite build fix - RPP Test Suite Pre-requisite instructions updated to lock to a specific `nifti_clib` commit. -* Fixed broken image links for pixelate and jitter. - -### **Tensile** (4.43.0) - -#### Added - -- Nightly builds with performance statistics. -- ASM cache capabilities for reuse. -- Virtual environment (venv) for `TensileCreateLibrary` invocation on Linux. -- Flag to keep `build_tmp` when running Tensile. -- Generalized profiling scripts. -- Support for gfx1151. -- Single-threaded support in `TensileCreateLibrary`. -- Logic to remove temporary build artifacts. - -#### Changed - -- Disabled ASM cache for tests. -- Replaced Perl script with `hipcc.bat` as a compiler on Microsoft Windows. -- Improved CHANGELOG.md. -- Enabled external CI. -- Improved Tensile documentation. -- Refactored kernel source and header creation. -- Refactored `writeKernels` in `TensileCreateLibrary`. -- Suppressed developer warnings to simplify the Tensile output. -- Introduced an explicit cast when invoking `min`. -- Introduced cache abbreviations to compute kernel names. - -#### Removed - -- OCL backend -- Unsupported tests -- Deep copy in `TensileCreateLibrary` - -#### Optimized - -- Linearized ASM register search to reduce build time. - -#### Resolved issues - -- Fixed Stream-K dynamic grid model. -- Fixed logic related to caching ASM capabilities. -- Fixed `accvgpr` overflow. -- Fixed test failures in SLES containers when running `TensileTests`. -- Fixed a regression that prevents `TensileCreateLibrary` from completing when fallback logic is not available. - -## ROCm 6.3.3 - -See the [ROCm 6.3.3 release notes](https://rocm.docs.amd.com/en/docs-6.3.3/about/release-notes.html) -for a complete overview of this release. - -### **ROCm Systems Profiler** (0.1.2) - -#### Resolved issues - -* Fixed an error that prevented GPU hardware activity from being presented in certain workloads. - -## ROCm 6.3.2 - -See the [ROCm 6.3.2 release notes](https://rocm.docs.amd.com/en/docs-6.3.2/about/release-notes.html) -for a complete overview of this release. - -### **HIP** (6.3.2) - -#### Added - -* Tracking of Heterogeneous System Architecture (HSA) handlers: - - Adds an atomic counter to track the outstanding HSA handlers. - - Waits on CPU for the callbacks if the number exceeds the defined value. -* Codes to capture Architected Queueing Language (AQL) packets for HIP graph memory copy node between host and device. HIP enqueues AQL packets during graph launch. -* Control to use system pool implementation in runtime commands handling. By default, it is disabled. -* A new path to avoid `WaitAny` calls in `AsyncEventsLoop`. The new path is selected by default. -* Runtime control on decrement counter only if the event is popped. There is a new way to restore dead signals cleanup for the old path. -* A new logic in runtime to track the age of events from the kernel mode driver. - -#### Optimized - -* HSA callback performance. The HIP runtime creates and submits commands in the queue and interacts with HSA through a callback function. HIP waits for the CPU status from HSA to optimize the handling of events, profiling, commands, and HSA signals for higher performance. -* Runtime optimization which combines all logic of `WaitAny` in a single processing loop and avoids extra memory allocations or reference counting. The runtime won't spin on the CPU if all events are busy. -* Multi-threaded dispatches for performance improvement. -* Command submissions and processing between CPU and GPU by introducing a way to limit the software batch size. -* Switch to `std::shared_mutex` in book/keep logic in streams from multiple threads simultaneously, for performance improvement in specific customer applications. -* `std::shared_mutex` is used in memory object mapping, for performance improvement. - -#### Resolved issues - -* Race condition in multi-threaded producer/consumer scenario with `hipMallocFromPoolAsync`. -* Segmentation fault with `hipStreamLegacy` while using the API `hipStreamWaitEvent`. -* Usage of `hipStreamLegacy` in HIP event record. -* A soft hang in graph execution process from HIP user object. The fix handles the release of graph execution object properly considering synchronization on the device/stream. The user application now behaves the same with `hipUserObject` on both the AMD ROCm and NVIDIA CUDA platforms. - -### **hipfort** (0.5.1) - -#### Added - -* Support for building with LLVM Flang. - -#### Resolved issues - -* Fixed the exported `hipfort::hipsparse` CMake target. - -### **ROCm Systems Profiler** (0.1.1) - -#### Resolved issues - -* Fixed an error when building from source on some SUSE and RHEL systems when using the `ROCPROFSYS_BUILD_DYNINST` option. - -### **ROCProfiler** (2.0.0) - -#### Changed - -* Replaced `CU_UTILIZATION` metric with `SIMD_UTILIZATION` for better accuracy. - -#### Resolved issues - -* Fixed the `VALUBusy` and `SALUBusy` activity metrics for accuracy on MI300. - -### **ROCprofiler-SDK** (0.5.0) - -#### Added - -* Support for system-wide collection of SQ counters across all HSA processes. - -#### Changed - -* `rocprofiler_sample_device_counting_service` API updated to return counter output immediately, when called in synchronous mode. - -## ROCm 6.3.1 - -See the [ROCm 6.3.1 release notes](https://rocm.docs.amd.com/en/docs-6.3.1/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.7.1) - -#### Changed - -* `amd-smi monitor` displays `VCLOCK` and `DCLOCK` instead of `ENC_CLOCK` and `DEC_CLOCK`. - -#### Resolved issues - -* Fixed `amd-smi monitor`'s reporting of encode and decode information. `VCLOCK` and `DCLOCK` are - now associated with both `ENC_UTIL` and `DEC_UTIL`. - -> [!NOTE] -> See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANGELOG.md) for more details and examples. - -### **HIP** (6.3.1) - -#### Added - -* An activeQueues set that tracks only the queues that have a command submitted to them, which allows fast iteration in ``waitActiveStreams``. - -#### Resolved issues - -* A deadlock in a specific customer application by preventing hipLaunchKernel latency degradation with number of idle streams. - -### **HIPIFY** (18.0.0) - -#### Added - -* Support for: - * NVIDIA CUDA 12.6.2 - * cuDNN 9.5.1 - * LLVM 19.1.3 - * Full `hipBLAS` 64-bit APIs - * Full `rocBLAS` 64-bit APIs - -#### Resolved issues - -* Added missing support for device intrinsics and built-ins: `__all_sync`, `__any_sync`, `__ballot_sync`, `__activemask`, `__match_any_sync`, `__match_all_sync`, `__shfl_sync`, `__shfl_up_sync`, `__shfl_down_sync`, and `__shfl_xor_sync`. - -### **MIVisionX** (3.1.0) - -#### Changed - -* AMD Clang is now the default CXX and C compiler. -* The dependency on rocDecode has been removed and automatic rocDecode installation is now disabled in the setup script. - -#### Resolved issues - -* Canny failure on Instinct MI300 has been fixed. -* Ubuntu 24.04 CTest failures have been fixed. - -#### Known issues - -* CentOS, Red Hat, and SLES requires the manual installation of `OpenCV` and `FFMPEG`. -* Hardware decode requires that ROCm is installed with `--usecase=graphics`. - -#### Upcoming changes - -* Optimized audio augmentations support for VX_RPP. - -### **RCCL** (2.21.5) - -#### Changed - -* Enhanced the user documentation. - -#### Resolved Issues - -* Corrected some user help strings in `install.sh`. - -### **ROCm Compute Profiler** (3.0.0) - -#### Resolved issues - -* Fixed a minor issue for users upgrading to ROCm 6.3 from 6.2 post-rename from `omniperf`. - -### **ROCm Systems Profiler** (0.1.0) - -#### Added - -* Improvements to support OMPT target offload. - -#### Resolved issues - -* Fixed an issue with generated Perfetto files. See [issue #3767](https://github.com/ROCm/ROCm/issues/3767) for more information. - -* Fixed an issue with merging multiple `.proto` files. - -* Fixed an issue causing GPU resource data to be missing from traces of Instinct MI300A systems. - -* Fixed a minor issue for users upgrading to ROCm 6.3 from 6.2 post-rename from `omnitrace`. - -### **ROCprofiler-SDK** (0.5.0) - -#### Added - -* SIMD_UTILIZATION metric. -* New ROCm Data Center (RDC) ops metrics. - -## ROCm 6.3.0 - -See the [ROCm 6.3.0 release notes](https://rocm.docs.amd.com/en/docs-6.3.0/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.7.1) - -#### Added - -- Support for `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs. - -- Support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()` - -- New violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`. This feature is only available on MI300+ ASICs - -- Ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`. Partition-specific features are only available on MI300+ ASICs - -- Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value. This feature is only available on MI300+ ASICs - -- Ability to retrieve a set of GPUs that are nearest to a given device at a specific link type level - - Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries. - -- More supported utilization count types to `amdsmi_get_utilization_count()` - -- `amd-smi set -L/--clk-limit ...` command. This is equivalent to rocm-smi's `--extremum` command which sets sclk's or mclk's soft minimum or soft maximum clock frequency. - -- Unittest functionality to test `amdsmi` API calls in Python - -- GPU memory overdrive percentage to `amd-smi metric -o` - - Added `amdsmi_get_gpu_mem_overdrive_level()` function to AMD SMI C and Python Libraries. - -- Ability to retrieve connection type and P2P capabilities between two GPUs - - Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries. - - Added retrieving P2P link capabilities to CLI `amd-smi topology`. - -- New `amdsmi_kfd_info_t` type and added information under `amd-smi list` - -- Subsystem device ID to `amd-smi static --asic`. There are no underlying changes to `amdsmi_get_gpu_asic_info`. - -- `Target_Graphics_Version` to `amd-smi static --asic` and `amdsmi_get_gpu_asic_info()`. - -#### Changed - -- Updated BDF commands to use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`. This change aligns BDF output with ROCm SMI. - -- Moved Python tests directory path install location. - - `/opt//share/amd_smi/pytest/..` to `/opt//share/amd_smi/tests/python_unittest/..` - - Removed PyTest dependency. Python testing now depends on the unittest framework only. - -- Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`. - - Changes propagate forwards into the Python interface as well. Backwards compatibility is maintained. - -- Updated Partition APIs and struct information and added `partition_id` to `amd-smi static --partition`. - - As part of an overhaul to partition information, some partition information will be made available in the `amdsmi_accelerator_partition_profile_t`. - - This struct will be filled out by a new API, `amdsmi_get_gpu_accelerator_partition_profile()`. - - Future data from these APIs will eventually be added to `amd-smi partition`. - -#### Removed - -- `amd-smi reset --compute-partition` and `... --memory-partition` and associated APIs - - This change is part of the partition redesign. Reset functionality will be reintroduced in a later update. - - Associated APIs include `amdsmi_reset_gpu_compute_partition()` and `amdsmi_reset_gpu_memory_partition()` - -- Usage of `_validate_positive` is removed in parser and replaced with `_positive_int` and `_not_negative_int` as appropriate. - - This will allow `0` to be a valid input for several options in setting CPUs where appropriate (for example, as a mode or NBIOID). - -#### Optimized - -- Adjusted ordering of `gpu_metrics` calls to ensure that `pcie_bw` values remain stable in `amd-smi metric` & `amd-smi monitor`. - - With this change additional padding was added to `PCIE_BW` `amd-smi monitor --pcie` - -#### Known issues - -- See [AMD SMI manual build issue](https://rocm.docs.amd.com/en/docs-6.3.0/about/release-notes.html#amd-smi-manual-build-issue). - -#### Resolved issues - -- Improved Offline install process and lowered dependency for PyYAML. - -- Fixed CPX not showing total number of logical GPUs. - -- Fixed incorrect implementation of the Python API `amdsmi_get_gpu_metrics_header_info()`. - -- `amdsmitst` `TestGpuMetricsRead` now prints metric in correct units. - -#### Upcoming changes - -- Python API for `amdsmi_get_energy_count()` will deprecate the `power` field in a future ROCm release and use `energy_accumulator` field instead. - -- New memory and compute partition APIs will be added in a future ROCm release. - - These APIs will be updated to fully populate the CLI and allowing compute (accelerator) partitions to be set by profile ID. - - One API will be provided, to reset both memory and compute (accelerator). - - The following APIs will remain: - - ```C - amdsmi_status_t - amdsmi_set_gpu_compute_partition(amdsmi_processor_handle processor_handle, - amdsmi_compute_partition_type_t compute_partition); - amdsmi_status_t - amdsmi_get_gpu_compute_partition(amdsmi_processor_handle processor_handle, - char *compute_partition, uint32_t len); - amdsmi_status_t - amdsmi_get_gpu_memory_partition(amdsmi_processor_handle processor_handle, - - char *memory_partition, uint32_t len); - amdsmi_status_t - amdsmi_set_gpu_memory_partition(amdsmi_processor_handle processor_handle, - amdsmi_memory_partition_type_t memory_partition); - ``` - -- `amd-smi set --compute-partition "SPX/DPX/CPX..."` will no longer be supported in a future ROCm release. - - This is due to aligning with Host setups and providing more robust partition information through the APIs outlined above. Furthermore, new APIs which will be available on both BM/Host can set by profile ID. - -- Added a preliminary `amd-smi partition` command. - - The new partition command can display GPU information, including memory and accelerator partition information. - - The command will be at full functionality once additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()` has been implemented. - -> [!NOTE] -> See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/6.3.x/CHANGELOG.md) for more details and examples. - -### **HIP** (6.3.0) - -#### Added - -* New HIP APIs: - - `hipGraphExecGetFlags` returns the flags on executable graph. - - `hipGraphNodeSetParams` updates the parameters of a created node. - - `hipGraphExecNodeSetParams` updates the parameters of a created node on an executable graph. - - `hipDrvGraphMemcpyNodeGetParams` gets a memcpy node's parameters. - - `hipDrvGraphMemcpyNodeSetParams` sets a memcpy node's parameters. - - `hipDrvGraphAddMemFreeNode` creates a memory free node and adds it to a graph. - - `hipDrvGraphExecMemcpyNodeSetParams` sets the parameters for a memcpy node in the given graphExec. - - `hipDrvGraphExecMemsetNodeSetParams` sets the parameters for a memset node in the given graphExec. - -#### Changed - -* Un-deprecated HIP APIs: - - `hipHostAlloc` - - `hipFreeHost` - -#### Optimized - -* Disabled CPU wait in device synchronize to avoid idle time in applications such as Hugging Face models and PyTorch. -* Optimized multi-threaded dispatches to improve performance. -* Limited the software batch size to control the number of command submissions for runtime to handle efficiently. -* Optimizes HSA callback performance when a large number of events are recorded by multiple threads and submitted to multiple GPUs. - -#### Resolved issues - -* Soft hang in runtime wait event when run TensorFlow. -* Memory leak in the API `hipGraphInstantiate` when kernel is launched using `hipExtLaunchKernelGGL` with event. -* Memory leak when the API `hipGraphAddMemAllocNode` is called. -* The `_sync()` version of crosslane builtins such as `shfl_sync()`, - `__all_sync()` and `__any_sync()`, continue to be hidden behind the - preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled - unconditionally in the next ROCm release. - -### **hipBLAS** (2.3.0) - -#### Added - -* Level 3 functions have an additional `ILP64` API for both C and Fortran (`_64` name suffix) with `int64_t` function arguments - -#### Changed - -* `amdclang` is used as the default compiler instead of `g++`. -* Added a dependency on the `hipblas-common` package. - -### **hipBLASLt** (0.10.0) - -#### Added - -* Support for the V2 CPP extension API for backward compatibility -* Support for data type `INT8` in with `INT8` out -* Support for data type `FP32`/`FP64` for gfx110x -* Extension API `hipblaslt_ext::matmulIsTuned` -* Output `atol` and `rtol` for `hipblaslt-bench` validation -* Output the bench command for the hipblaslt CPP ext API path if `HIPBLASLT_LOG_MASK=32` is set -* Support odd sizes for `FP8`/`BF8` GEMM - -#### Changed - -* Reorganized and added more sample code. -* Added a dependency with the `hipblas-common` package and removed the dependency with the `hipblas` package. - -#### Optimized - -* Support fused kernel for `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for the `FP8`/`BF8` data type -* Improved the library loading time. -* Improved the overall performance of the first returned solution. - -#### Upcoming changes - -* The V1 CPP extension API will be deprecated in a future release of hipBLASLt. - -### **hipCUB** (3.3.0) - -#### Added - -* Support for large indices in `hipcub::DeviceSegmentedReduce::*` has been added, with the exception - of `DeviceSegmentedReduce::Arg*`. Although rocPRIM's backend provides support for all reduce - variants, CUB does not support large indices in `DeviceSegmentedReduce::Arg*`. For this reason, - large index support is not available for `hipcub::DeviceSegmentedReduce::Arg*`. - -#### Changed - -* Changed the default value of `rmake.py -a` to `default_gpus`. This is equivalent to `gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102`. -* The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.3.2. - -#### Resolved issues - -* Fixed an issue in `rmake.py` where the list storing cmake options would contain individual characters instead of a full string of options. -* Fixed an issue where `config.hpp` was not included in all hipCUB headers, resulting in build errors. - -### **hipFFT** (1.0.17) - -#### Changed - -* The AMD backend is now compiled using amdclang++ instead of hipcc. The NVIDIA CUDA backend still uses hipcc-nvcc. -* CLI11 replaces Boost Program Options as the command line parser for clients. -* Building with the address sanitizer option sets xnack+ for the relevant GPU architectures. - -### **hipfort** (0.5.0) - -#### Added - -* Added ROC-TX to the hipfort interfaces. - -#### Changed - -* Updated the hipSOLVER bindings. - -### **HIPIFY** (18.0.0) - -#### Added - -* CUDA 12.6.1 support -* cuDNN 9.5.0 support -* LLVM 19.1.1 support -* rocBLAS 64-bit APIs support -* Initial support for direct hipification of cuDNN into MIOpen under the `--roc` option -* Initial support for direct hipification of cuRAND into rocRAND under the `--roc` option -* Added a filtering ability for the supplementary hipification scripts - -#### Resolved issues - -* Correct `roc` header files support - -#### Known issues - -* Support for `fp8` data types - -### **hipRAND** (2.11.0) - -> [!NOTE] -> In ROCm 6.3.0, the hipRAND package version is incorrectly set to `2.11.0`. -> In ROCm 6.2.4, the hipRAND package version was `2.11.1`. -> The hipRAND version number will be corrected in a future ROCm release. - -#### Changed - -* Updated the default value for the `-a` argument from `rmake.py` to `gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102`. - -#### Resolved issues - -* Fixed an issue in `rmake.py` where the list storing the CMake options would contain individual characters instead of a full string of options. - -### **hipSOLVER** (2.3.0) - -#### Added - -* Auxiliary functions: - * `hipsolverSetDeterministicMode`, `hipsolverGetDeterministicMode` -* Compatibility-only functions: - * `potrf` - * `hipsolverDnXpotrf_bufferSize` - * `hipsolverDnXpotrf` - * `potrs` - * `hipsolverDnXpotrs` - * `geqrf` - * `hipsolverDnXgeqrf_bufferSize` - * `hipsolverDnXgeqrf` - -#### Changed - -* Binaries in debug builds no longer have a `-d` suffix. -* Changed rocSPARSE and SuiteSparse to be runtime dependencies by default. The `BUILD_WITH_SPARSE` CMake option can still be used - to convert them into build-time dependencies (now off by default). -* The `--no-sparse` option for the install script now only affects the hipSOLVER clients and their dependency on hipSPARSE. Use the - `BUILD_HIPSPARSE_TESTS` CMake option to enable tests for the `hipsolverSp` API (on by default). - -#### Upcoming changes - -* The Fortran bindings provided in `hipsolver_module.f90` have been deprecated. - The Fortran bindings provided by the hipfort project are recommended instead. - -### **hipSPARSE** (3.1.2) - -#### Added - -* Added an alpha version of the `hipsparse-bench` executable to facilitate comparing NVIDIA CUDA cuSPARSE and rocSPARSE backends. - -#### Changed - -* Changed the default compiler from hipcc to amdclang in the install script and CMake files. -* Improved the user documentation. - -#### Resolved issues - -* Fixed the gfortran dependency for the Azure Linux operating system. - -#### Known issues - -* In `hipsparseSpSM_solve()`, the external buffer is passed as a parameter. This does not match the NVIDIA CUDA cuSPARSE API. This extra external buffer parameter will be removed in a future release. For now, this extra parameter can be ignored and `nullptr` passed as it is unused internally by `hipsparseSpSM_solve()`. - -### **hipSPARSELt** (0.2.2) - -#### Added - -* Support for a new data type combination: `INT8` inputs, `BF16` output, and `INT32` Matrix Core accumulation -* Support for row-major memory order (`HIPSPARSE_ORDER_ROW`) - -#### Changed - -* Changed the default compiler to amdclang++. - -#### Upcoming changes - -* `hipsparseLtDatatype_t` is deprecated and will be removed in the next major release of ROCm. `hipDataType` should be used instead. - -### **hipTensor** (1.4.0) - -#### Added - -* Added support for tensor reduction, including APIs, CPU reference, unit tests, and documentation - -#### Changed - -* ASAN builds only support xnack+ targets. -* ASAN builds use `-mcmodel=large` to accommodate library sizes greater than 2GB. -* Updated the permute backend to accommodate changes to element-wise operations. -* Updated the actor-critic implementation. -* Various documentation formatting updates. - -#### Optimized - -* Split kernel instances to improve build times. - -#### Resolved issues - -* Fixed a bug in randomized tensor input data generation. -* Fixed the default strides calculation to be in column-major order. -* Fixed a small memory leak by properly destroying HIP event objects in tests. -* Default strides calculations now follow column-major convention. - -### **llvm-project** (18.0.0) - -#### Resolved issues - -* Fixed an issue where the compiler would incorrectly compile a program that used the `__shfl(var, - srcLane, width)` function when one of the parameters to the function is undefined along some path - to the function. See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub. - -### **MIGraphX** (2.11.0) - -#### Added - -* Initial code to run on Windows -* Support for `FP8` and `INT4` -* Support for the Log2 internal operator -* Support for the GCC 14 compiler -* The `BitwiseAnd`, `Scan`, `SoftmaxCrossEntropyLoss`, `GridSample`, and `NegativeLogLikelihoodLoss` ONNX operators -* The `MatMulNBits`, `QuantizeLinear`/`DequantizeLinear`, `GroupQueryAttention`, `SkipSimplifiedLayerNormalization`, and `SimpliedLayerNormalizationMicrosoft` Contrib operators -* Dynamic batch parameter support to `OneHot` operator -* Split-K as an optional performance improvement -* Scripts to validate ONNX models from the ONNX Model Zoo -* GPU Pooling Kernel -* `--mlir` flag the migraphx-driver program to offload entire module to MLIR -* Fusing split-reduce with MLIR -* Multiple outputs for the MLIR + Pointwise fusions -* Pointwise fusions with MLIR across reshape operations -* `MIGRAPHX_MLIR_DUMP` environment variable to dump MLIR modules to MXRs -* The `3` option to `MIGRAPHX_TRACE_BENCHMARKING` to print the MLIR program for improved debug output -* `MIGRAPHX_ENABLE_HIPBLASLT_GEMM` environment variable to call hipBLASLt libraries -* `MIGRAPHX_VERIFY_DUMP_DIFF` to improve the debugging of accuracy issues -* `reduce_any` and `reduce_all` options to the `Reduce` operation via Torch MIGraphX -* Examples for RNNT, and ControlNet - -#### Changed - -* Switched to MLIR's 3D Convolution operator. -* MLIR is now used for Attention operations by default on gfx942 and newer ASICs. -* Names and locations for VRM specific libraries have changed. -* Use random mode for benchmarking GEMMs and convolutions. -* Python version is now printed with an actual version number. - -#### Removed - -* Disabled requirements for MIOpen and rocBLAS when running on Windows. -* Removed inaccurate warning messages when using exhaustive-tune. -* Remove the hard coded path in `MIGRAPHX_CXX_COMPILER` allowing the compiler to be installed in different locations. - -#### Optimized - -* Improved: - * Infrastructure code to enable better Kernel fusions with all supported data types - * Subsequent model compile time by creating a cache for already performant kernels - * Use of Attention fusion with models - * Performance of the Softmax JIT kernel and of the Pooling operator - * Tuning operations through a new 50ms delay before running the next kernel - * Performance of several convolution-based models through an optimized NHWC layout - * Performance for the `FP8` datatype - * GPU utilization - * Verification tools - * Debug prints - * Documentation, including gpu-driver utility documentation - * Summary section of the `migraphx-driver perf` command -* Reduced model compilation time -* Reordered some compiler passes to allow for more fusions -* Preloaded tiles into LDS to improve performance of pointwise transposes -* Exposed the `external_data_path` property in `onnx_options` to set the path from `onnxruntime` - -#### Resolved issues - -* Fixed a bug with gfx1030 that overwrote `dpp_reduce`. -* Fixed a bug in 1-arg dynamic reshape that created a failure. -* Fixed a bug with `dot_broadcast` and `inner_broadcast` that caused compile failures. -* Fixed a bug where some configs were failing when using exhaustive-tune. -* Fixed the ROCm Install Guide URL. -* Fixed an issue while building a whl package due to an apostrophe. -* Fixed the BERT Squad example requirements file to support different versions of Python. -* Fixed a bug that stopped the Vicuna model from compiling. -* Fixed failures with the verify option of migraphx-driver that would cause the application to exit early. - - -### **MIOpen** (3.3.0) - -#### Added - -- [RNN] LSTM forward pass -- [Mha] Mask is added for forward pass -- [GLU] Gated Linear Unit (this is an experimental feature) -- [PReLU] Implemented PReLU backward pass (this is an experimental feature) - -#### Optimized - -- MI300 TunaNet Update: CK forward pass and WRW Solvers updated - -#### Resolved issues - -- Fixed unset stream when calling `hipMemsetAsync`. -- Fixed a memory leak issue caused by an incorrect transpose in find 2.0. See PR [#3285](https://github.com/ROCm/MIOpen/pull/3285) on GitHub. -- Fixed a `memcopy` data race by replacing `hipMemcpy` with `hipMemcpyWithStream`. - -### **MIVisionX** (3.1.0) - -#### Changed - -* rocDecode is no longer installed by the setup script. -* The rocDecode dependency has been removed from the package installation. - -#### Known issues - -* See [MIVisionX memory access fault in Canny edge detection](https://github.com/ROCm/ROCm/issues/4086). -* Package installation requires the manual installation of OpenCV. -* Installation on CentOS/RedHat/SLES requires the manual installation of the `FFMPEG Dev` package. -* Hardware decode requires installation with `--usecase=graphics` in addition to `--usecase=rocm`. - -#### Upcoming changes - -* Optimized audio augmentations support for VX_RPP - -### **RCCL** (2.21.5) - -#### Added - -* MSCCL++ integration for specific contexts -* Performance collection to `rccl_replayer` -* Tuner Plugin example for Instinct MI300 -* Tuning table for a large number of nodes -* Support for amdclang++ -* New Rome model - -#### Changed - -* Compatibility with NCCL 2.21.5 -* Increased channel count for MI300X multi-node -* Enabled MSCCL for single-process multi-threaded contexts -* Enabled CPX mode for MI300X -* Enabled tracing with `rocprof` -* Improved version reporting -* Enabled GDRDMA for Linux kernel 6.4.0+ - -#### Resolved issues - -* Fixed an issue where, on systems running Linux kernel 6.8.0 such as Ubuntu 24.04, Direct Memory - Access (DMA) transfers between the GPU and NIC were disabled, impacting multi-node RCCL - performance. See [issue #3772](https://github.com/ROCm/ROCm/issues/3772) on GitHub. -* Fixed model matching with PXN enable - -#### Known issues - -* MSCCL is temporarily disabled for AllGather collectives. - - This can impact in-place messages (< 2 MB) with ~2x latency. - - Older RCCL versions are not impacted. - - This issue will be addressed in a future ROCm release. -* Unit tests do not exit gracefully when running on a single GPU. - - This issue will be addressed in a future ROCm release. - -### **rocAL** (2.1.0) - -#### Added - -* rocAL Pybind support for package installation has been added. To use the rocAL python module, set the `PYTHONPATH`: `export PYTHONPATH=/opt/rocm/lib:$PYTHONPATH` -* Last batch policy, pad last batch, stick to shard, and shard size support have been added for the coco, caffe, caffe2, mxnet, tf, and cifar10 image readers. - -#### Changed - -* rocDecode is no longer installed by the setup script. -* The rocDecode dependency has been removed from the package installation. - -#### Optimized - -* CTest has been updated. - -#### Resolved issues - -* Test failures have been fixed. - -#### Known issues - -* The package installation requires the manual installation of `TurboJPEG` and `RapidJSON`. -* CentOS/RedHat/SLES requires the manual installation of the `FFMPEG Dev` package. -* Hardware decode requires installation with `--usecase=graphics` in addition to `--usecase=rocm`. - -#### Upcoming changes - -* Optimized audio augmentations support. - -### **rocALUTION** (3.2.1) - -#### Changed - -* The default compiler has been changed from `hipcc` to `amdclang` in the installation script and cmake files. -* Changed the address sanitizer build targets. Now only `gfx908:xnack+`, `gfx90a:xnack+`, `gfx940:xnack+`, `gfx941:xnack+`, and `gfx942:xnack+` are built with `BUILD_ADDRESS_SANITIZER=ON`. - -#### Resolved issues - -* Fixed hang in `RS-AMG` for Navi on some specific matrix sparsity patterns. -* Fixed wrong results in `Apply` on multi-GPU setups. - -### **rocBLAS** (4.3.0) - -#### Added - -* Level 3 and EX functions have an additional `ILP64` API for both C and Fortran (`_64` name suffix) with `int64_t` function arguments - -#### Changed - -* amdclang is used as the default compiler instead of hipcc -* Internal performance scripts use AMD SMI instead of the deprecated ROCm SMI - -#### Optimized - -* Improved performance of Level 2 gbmv -* Improved performance of Level 2 gemv for float and double precisions for problem sizes (`TransA == N && m==n && m % 128 == 0`) measured on a gfx942 GPU - -#### Resolved issues - -* Fixed the `stbsv_strided_batched_64` Fortran binding - -#### Upcoming changes - -* `rocblas_Xgemm_kernel_name` APIs are deprecated - -### **ROCdbgapi** (0.77.0) - -#### Added - -* Support for setting precise ALU exception reporting - -### **rocDecode** (0.8.0) - -#### Changed - -* Clang is now the default CXX compiler. -* The new minimum supported version of `va-api` is 1.16. -* New build and runtime options have been added to the `rocDecode-setup.py` setup script. - -#### Removed - -* Make tests have been removed. CTEST is now used for both Make tests and package tests. -* `mesa-amdgpu-dri-drivers` has been removed as a dependency on RHEL and SLES. - -#### Resolved issues - -* Fixed a bug in the size of output streams in the `videoDecodeBatch` sample. - -### **rocFFT** (1.0.31) - -#### Added - -* rocfft-test now includes a `--smoketest` option. -* Implemented experimental APIs to allow computing FFTs on data - distributed across multiple MPI ranks. These APIs can be enabled with the - `ROCFFT_MPI_ENABLE` CMake option. This option defaults to `OFF`. - - When `ROCFFT_MPI_ENABLE` is `ON`: - - * `rocfft_plan_description_set_comm` can be called to provide an - MPI communicator to a plan description, which can then be passed - to `rocfft_plan_create`. Each rank calls - `rocfft_field_add_brick` to specify the layout of data bricks on - that rank. - - * An MPI library with ROCm acceleration enabled is required at - build time and at runtime. - -#### Changed - -* Compilation uses amdclang++ instead of hipcc. -* CLI11 replaces Boost Program Options as the command line parser for clients and samples. -* Building with the address sanitizer option sets xnack+ on relevant GPU - architectures and address-sanitizer support is added to runtime-compiled - kernels. - -### **ROCgdb** (15.2) - -#### Added - -- Support for precise ALU exception reporting for supported architectures. Precise ALU exceptions reporting is controlled with the following commands: - - `set amdgpu precise-alu-exceptions` - - `show amdgpu precise-alu-exceptions` - -#### Changed - -- The `sysroot` or `solib-search-path` settings can now be used to locate files containing GPU code objects when opening a core dump. This allows opening GPU code objects on systems different from the one where the core dump was generated. - -#### Resolved issues - -- Fixed possible hangs when opening some AMDGPU core dumps in ROCgdb. -- Addressed cases where the `roccoremerge` utility improperly handled LOAD segment copy from the host core dump to the combined core dump. - -### **ROCm Compute Profiler** (3.0.0) - -#### Changed - -* Renamed to ROCm Compute Profiler from Omniperf. - * New package name: `rocprofiler-compute` - * New repository: [https://github.com/ROCm/rocprofiler-compute](https://github.com/ROCm/rocprofiler-compute) - * New binary name: `rocprof-compute` - -#### Known issues - -- See [ROCm Compute Profiler post-upgrade](https://github.com/ROCm/ROCm/issues/4082). - -- See [ROCm Compute Profiler CTest failure in CI](https://github.com/ROCm/ROCm/issues/4085). - -### **ROCm Data Center Tool** (0.3.0) - -#### Added - -* RVS integration -* Real time logging for diagnostic command -* `--version` command -* `XGMI_TOTAL_READ_KB` and `XGMI_TOTAL_WRITE_KB` monitoring metrics - -#### Known issues - -- See [ROCm Data Center Tool incorrect RHEL9 package version](https://github.com/ROCm/ROCm/issues/4089). - -### **ROCm SMI** (7.4.0) - -#### Added - -- **Added `rsmi_dev_memory_partition_capabilities_get` which returns driver memory partition capablities.** -Driver now has the ability to report what the user can set memory partition modes to. User can now see available -memory partition modes upon an invalid argument return from memory partition mode set (`rsmi_dev_memory_partition_set`). - -- Support for GPU metrics 1.6 to `rsmi_dev_gpu_metrics_info_get()`. Updated - `rsmi_dev_gpu_metrics_info_get()` and structure `rsmi_gpu_metrics_t` to include new fields for - PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and `pcie_lc_perf_other_end_recovery`. - -- Ability to view raw GPU metrics using `rocm-smi --showmetrics`. - -#### Changed - -- Added back in C++ tests for `memorypartition_read_write` - -- Updated `rsmi_dev_memory_partition_set` to not return until a successful restart of AMD GPU Driver. - -- All APIs now have the ability to catch driver reporting invalid arguments. - -#### Removals - -- Removed `--resetcomputepartition`, and `--resetmemorypartition` options and associated APIs. - - This change is part of the partition feature redesign. - - The related APIs `rsmi_dev_compute_partition_reset()` and `rsmi_dev_memory_partition_reset()`. - -#### Resolved issues - -- Fixed `rsmi_dev_target_graphics_version_get`, `rocm-smi --showhw`, and `rocm-smi --showprod` not displaying properly for MI2x or Navi 3x ASICs. - -#### Upcoming changes - -- C++ tests for `memorypartition_read_write` are to be re-enabled in a future ROCm release. - -> [!NOTE] -> See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/6.3.x/CHANGELOG.md) for more details and examples. - -### **ROCm Systems Profiler** (0.1.0) - -#### Changed - -* Renamed to ROCm Systems Profiler from Omnitrace. - * New package name: `rocprofiler-systems` - * New repository: [https://github.com/ROCm/rocprofiler-systems](https://github.com/ROCm/rocprofiler-systems) - * Reset the version to `0.1.0` - * New binary prefix: `rocprof-sys-*` - -#### Known issues - -- See [ROCm Systems Profiler post-upgrade](https://github.com/ROCm/ROCm/issues/4083). - -### **ROCm Validation Suite** (1.1.0) - -#### Added - -- Support for hipBLASLT blas library and option to select blas library in `conf` file. - -#### Changed - -- Babel parameters made runtime configurable. - -#### Known issues - -- See [ROCm Validation Suite needs specified configuration file](https://github.com/ROCm/ROCm/issues/4090). - -### **rocPRIM** (3.3.0) - -#### Added - -* The `--test smoke` option has been added to `rtest.py`. When `rtest.py` is called with this option it runs a subset of tests such that the total test time is 5 minutes. Use `python3 ./rtest.py --test smoke` or `python3 ./rtest.py -t smoke` to run the smoke test. -* The `--seed` option has been added to `run_benchmarks.py`. The `--seed` option specifies a seed for the generation of random inputs. When the option is omitted, the default behavior is to use a random seed for each benchmark measurement. -* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), to device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and to device reduce by key (`rocprim::reduce_by_key`) to improve performance on selected architectures. -* Added `rocprim::uninitialized_array` to provide uninitialized storage in local memory for user-defined types. -* Added large segment support for `rocprim:segmented_reduce`. -* Added a parallel `nth_element` device function similar to `std::nth_element`. `nth_element` places elements that are smaller than the nth element before the nth element, and elements that are bigger than the nth element after the nth element. -* Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance. -* Added a parallel `partial_sort` and `partial_sort_copy` device functions similar to `std::partial_sort` and `std::partial_sort_copy`. `partial_sort` and `partial_sort_copy` arrange elements such that the elements are in the same order as a sorted list up to and including the middle index. - -#### Changed - -* Changed the default value of `rmake.py -a` to `default_gpus`. This is equivalent to `gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102`. -* Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different. -* Changed the default seed for `device_benchmark_segmented_reduce`. - -#### Removed - -* `rocprim::thread_load()` and `rocprim::thread_store()` have been deprecated. Use `dereference()` instead. - -#### Resolved issues - -* Fixed an issue in `rmake.py` where the list storing cmake options would contain individual characters instead of a full string of options. -* Resolved an issue in `rtest.py` where it crashed if the `build` folder was created without `release` or `debug` subdirectories. -* Resolved an issue with `rtest.py` on Windows where passing an absolute path to `--install_dir` caused a `FileNotFound` error. -* rocPRIM functions are no longer forcefully inlined on Windows. This significantly reduces the build time of debug builds. -* `block_load`, `block_store`, `block_shuffle`, `block_exchange`, and `warp_exchange` now use placement `new` instead of copy assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments. -* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected. -* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would produce incorrect results. -* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not cast to the correct type, resulting in incorrect results. -* Resolved an issue where tests were failing when they were compiled with `-D_GLIBCXX_ASSERTIONS=ON`. -* Resolved an issue where algorithms that used an internal serial merge routine caused a memory access fault that resulted in potential performance drops when using block sort, device merge sort (block merge), device merge, device partial sort, and device sort (merge sort). -* Fixed memory leaks in unit tests due to missing calls to `hipFree()` and the incorrect use of hipGraphs. -* Fixed an issue where certain inputs to `block_sort_merge()`, `device_merge_sort_merge_path()`, `device_merge()`, and `warp_sort_stable()` caused an assertion error during the call to `serial_merge()`. - -### **ROCProfiler** (2.0.0) - -#### Added - -- JSON output plugin for `rocprofv2`. The JSON file matches Google Trace Format making it easy to load on Perfetto, Chrome tracing, or Speedscope. For Speedscope, use `--disable-json-data-flows` option as speedscope doesn't work with data flows. -- `--no-serialization` flag to disable kernel serialization when `rocprofv2` is in counter collection mode. This allows `rocprofv2` to avoid deadlock when profiling certain programs in counter collection mode. -- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 accelerator -- New HIP APIs with struct defined inside union. -- Early checks to confirm the eligibility of ELF file in ATT plugin -- Support for kernel name filtering in `rocprofv2` -- Barrier bit to read and stop packets - -#### Changed - -- Extended lifetime for proxy queues -- Setting the `trace-start` option for `rocprof` to `off` now disables kernel tracing -- `libpciaccess-dev` functions now load with `dlopen` -- `PcieAccessApi*` api and `void* libpciaccess_handle` are now initialized to `nullptr` - -#### Removed - -- Obsolete BSD and GPL licenses -- `libsystemd-dev` from `CMakeLists.txt` - -#### Optimized - -- ROCProfiler Performance improved to reduce profiling time for large workloads of counter collection - -#### Resolved issues - -- Bandwidth measurement in AMD Instinct MI300 accelerator -- Perfetto plugin issue of `roctx` trace not getting displayed -- `--help` for counter collection -- Signal management issues in `queue.cpp` -- Perfetto tracks for multi-GPU -- Perfetto plugin usage with `rocsys` -- Incorrect number of columns in the output CSV files for counter collection and kernel tracing -- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 accelerator -- Build errors thrown during parsing of unions -- The system hang caused while running `--kernel-trace` with Perfetto for certain applications -- Missing profiler records issue caused while running `--trace-period` -- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 accelerator -- Segmentation fault on Navi32 - - -### **ROCprofiler-SDK** (0.5.0) - -#### Added - -- Start and end timestamp columns to the counter collection `csv` output -- Check to force tools to initialize context id with zero -- Support to specify hardware counters for collection using `rocprofv3` as `rocprofv3 --pmc [COUNTER [COUNTER ...]]` - -#### Changed - -- `--marker-trace` option for `rocprofv3` now supports the legacy ROC-TX library `libroctx64.so` when the application is linked against the new library `librocprofiler-sdk-roctx.so` -- Replaced deprecated `hipHostMalloc` and `hipHostFree` functions with `hipExtHostAlloc` and `hipFreeHost` for ROCm versions starting 6.3 -- Updated `rocprofv3` `--help` options -- Changed naming of "agent profiling" to a more descriptive "device counting service". To convert existing tool or user code to the new name, use the following sed: - ``` - find . -type f -exec sed -i 's/rocprofiler_agent_profile_callback_t/rocprofiler_device_counting_service_callback_t/g; s/rocprofiler_configure_agent_profile_counting_service/rocprofiler_configure_device_counting_service/g; s/agent_profile.h/device_counting_service.h/g; s/rocprofiler_sample_agent_profile_counting_service/rocprofiler_sample_device_counting_service/g' {} + - ``` -- Changed naming of "dispatch profiling service" to a more descriptive "dispatch counting service". To convert existing tool or user code to the new names, the following sed can be used: - ``` - -type f -exec sed -i -e 's/dispatch_profile_counting_service/dispatch_counting_service/g' -e 's/dispatch_profile.h/dispatch_counting_service.h/g' -e 's/rocprofiler_profile_counting_dispatch_callback_t/rocprofiler_dispatch_counting_service_callback_t/g' -e 's/rocprofiler_profile_counting_dispatch_data_t/rocprofiler_dispatch_counting_service_data_t/g' -e 's/rocprofiler_profile_counting_dispatch_record_t/rocprofiler_dispatch_counting_service_record_t/g' {} + - ``` -- `FETCH_SIZE` metric on gfx94x now uses `TCC_BUBBLE` for 128B reads -- PMC dispatch-based counter collection serialization is now per-device instead of being global across all devices - -#### Removed - -- `gfx8` metric definitions -- `rocprofv3` installation from `sbin` directory - -#### Resolved issues - -- Introduced subdirectory creation when `rocprofv3 --output-file` used to specify a folder path -- Fixed misaligned stores (undefined behavior) for buffer records -- Fixed crash when only scratch reporting is enabled -- Fixed `MeanOccupancy` metrics -- Fixed aborted-application validation test to properly check for `hipExtHostAlloc` command -- Fixed implicit reduction of SQ and GRBM metrics -- Fixed support for derived counters in reduce operation -- Bug fixed in max-in-reduce operation -- Introduced fix to handle a range of values for `select()` dimension in expressions parser -- Fixed Navi3x kernel tracing issues by setting the conditional `aql::set_profiler_active_on_queue` only when counter collection is registered - -### **rocPyDecode** (0.2.0) - -#### Added - -* RGB and YUV pytorch tensors -* Python distribution wheel (`.whl`) -* Multiple usecase samples - -#### Changed - -* Clang replaces `hipcc` as the default CXX compiler. - -#### Removed - -* Make tests have been removed. CTEST is now used for both Make tests and package tests. - -#### Optimized - -* Setup script - build and runtime install options -* Prerequisite installation helper Python scripts -* Same GPU memory viewed as pytorch tensor - -#### Resolved issues - -* Fixed setup issues. - -### **rocRAND** (3.2.0) - -#### Added - -* Added host generator for MT19937 -* Support for `rocrand_generate_poisson` in hipGraphs -* Added `engine`, `distribution`, `mode`, `throughput_gigabytes_per_second`, and `lambda` columns for the csv format in - `benchmark_rocrand_host_api` and `benchmark_rocrand_device_api`. To see these new columns, set `--benchmark_format=csv` - or `--benchmark_out_format=csv --benchmark_out="outName.csv"`. - -#### Changed - -* Updated the default value for the `-a` argument from `rmake.py` to `gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102`. -* `rocrand_discrete` for MTGP32, LFSR113 and ThreeFry generators now uses the alias method, which is faster than binary search in CDF. - -#### Resolved issues - -* Fixed an issue in `rmake.py` where the list storing the CMake options would contain individual characters instead of a full string of options. - -### **rocSOLVER** (3.27.0) - -#### Added - -* 64-bit APIs for existing functions: - - `LACGV_64` - - `LARF_64` - - `LARFG_64` - - `GEQR2_64` (with batched and strided\_batched versions) - - `GEQRF_64` (with batched and strided\_batched versions) - - `POTF2_64` (with batched and strided\_batched versions) - - `POTRF_64` (with batched and strided\_batched versions) - - `POTRS_64` (with batched and strided\_batched versions) - -#### Changed - -* The rocSPARSE library is now an optional dependency at runtime. If rocSPARSE - is not available, rocSOLVER's sparse refactorization and solvers functions - will return `rocblas_status_not_implemented`. - -#### Optimized - -* Improved the performance of LARFG, LARF, and downstream functions such as GEQR2 and GEQRF on wave64 architectures -* Improved the performance of BDSQR and GESVD -* Improved the performance of STEDC and divide and conquer Eigensolvers - -#### Resolved issues - -* Fixed a memory allocation issue in SYEVJ that could cause failures on clients that manage their own memory. -* Fixed a synchronizarion issue with SYEVJ that could led to a convergence failure for large matrices. -* Fixed a convergence issue in STEIN stemming from numerical orthogonality of the initial choice of eigenvectors. -* Fixed a synchronization issue in STEIN. - -#### Known issues - -* A known issue in STEBZ can lead to errors in routines based on bisection to compute eigenvalues for symmetric/Hermitian matrices (for example, SYEVX/HEEVX and SYGVX/HEGVX), as well as singular values (for example, BDSVDX and GESVDX). - -### **rocSPARSE** (3.3.0) - -#### Added - -* `rocsparse_create_extract_descr`, `rocsparse_destroy_extract_descr`, `rocsparse_extract_buffer_size`, `rocsparse_extract_nnz`, and `rocsparse_extract` APIs to allow extraction of the upper or lower part of sparse CSR or CSC matrices. - -#### Changed - -* Change the default compiler from hipcc to amdclang in install script and CMake files. -* Change address sanitizer build targets so that only gfx908:xnack+, gfx90a:xnack+, gfx940:xnack+, gfx941:xnack+, and gfx942:xnack+ are built when `BUILD_ADDRESS_SANITIZER=ON` is configured. - -#### Optimized - -* Improved user documentation - -#### Resolved issues - -* Fixed the `csrmm` merge path algorithm so that diagonal is clamped to the correct range. -* Fixed a race condition in `bsrgemm` that could on rare occasions cause incorrect results. -* Fixed an issue in `hyb2csr` where the CSR row pointer array was not being properly filled when `n=0`, `coo_nnz=0`, or `ell_nnz=0`. -* Fixed scaling in `rocsparse_Xhybmv` when only performing `y=beta*y`, for example, where `alpha==0` in `y=alpha*Ax+beta*y`. -* Fixed `rocsparse_Xgemmi` failures when the y grid dimension is too large. This occurred when `n >= 65536`. -* Fixed the gfortran dependency for the Azure Linux operating system. - -### **rocThrust** (3.2.0) - -#### Added - -* Merged changes from upstream CCCL/thrust 2.3.2 - * Only the NVIDIA backend uses `tuple` and `pair` types from libcu++, other backends continue to use the original Thrust implementations and hence do not require libcu++ (CCCL) as a dependency. -* Added the `thrust::hip::par_det` execution policy to enable bitwise reproducibility on algorithms that are not bitwise reproducible by default. - -#### Changed - -* Changed the default value of `rmake.py -a` to `default_gpus`. This is equivalent to `gfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102`. -* Enabled the upstream (thrust) test suite for execution by default. It can be disabled by using the `-DENABLE_UPSTREAM_TESTS=OFF` cmake option. - -#### Resolved issues - -* Fixed an issue in `rmake.py` where the list storing cmake options would contain individual characters instead of a full string of options. -* Fixed the HIP backend not passing `TestCopyIfNonTrivial` from the upstream (thrust) test suite. -* Fixed tests failing when compiled with `-D_GLIBCXX_ASSERTIONS=ON`. - -### **rocWMMA** (1.6.0) - -#### Added - -* Added OCP `F8`/`BF8` datatype support - -#### Changed - -* Optimized some aos<->soa transforms with half-rotation offsets -* Refactored the rocBLAS reference entry point for validation and benchmarking -* `ROCWMMA_*` preprocessor configurations are now all assigned values -* Updated the default architecture targets for ASAN builds -* Updated the actor-critic implementation - -#### Resolved issues - -* Fixed a bug in `F64` validation due to faulty typecasting -* Fixed a bug causing runtime compilation errors with hipRTC -* Various documentation updates and fixes - -### **RPP** (1.9.1) - -#### Added - -* RPP Glitch and RPP Pixelate have been added to the HOST and HIP backend. -* The following audio support was added to the HIP backend: - * Resample - * Pre-emphasis filter - * Down-mixing - * To Decibels - * Non-silent region - -#### Changed - -* Test prerequisites have been updated. -* AMD advanced build flag. - -#### Removed - -* Older versions of TurboJPEG have been removed. - -#### Optimized - -* Updated the test suite. - -#### Resolved issues - -* macOS build -* RPP Test Suite: augmentations fix -* Copy: bugfix for `NCDHW` layout -* MIVisionX compatibility fix: Resample and pre-emphasis filter - -#### Known issues - -* Package installation only supports the HIP backend. - -#### Upcoming changes - -* Optimized audio augmentations - -### **Tensile** (4.42.0) - -#### Added - -- Testing and documentation for `MasterSolutionLibrary.ArchitectureIndexMap` and `remapSolutionIndicesStartingFrom` -- Functions for writing master file -- `tPrint` and reconcile printing options -- Python unit test coverage report -- Factor embed library logic into function and test -- `clang++` as `cxx` compiler option for Windows -- Logic to cope with different compilers --`toFile` function to include `generateManifest` and moved to utilities -- Profiling CI job -- Support for `amdclang` and use defaults -- Architecture management functions in `TensileCreateLibrary` -- `TensileCreateLibrary` CLI reference docs -- New documentation for sphinx prototype and build out skeleton -- Contributor and developer guide -- Prediction model for optimal number of Stream-K tiles to run - - Two-tile algorithm with Stream-K after DP - - Atomic two-tile Stream-K and clean-up tuning parameters - - Using glob to find logic files in `TensileCreateLibrary` - - Function to confirm supported compiler rather than raw logic - -#### Changed - -- Improved rocBLAS build output by allowing warning suppression, ignoring developer warnings, displaying progress bar and quiet printing -- Reordered extensions for Windows in `which` function -- updated `amdclang++` and `asm` directories -- Updated duplicate marking tests with mocks -- Restored print ordering -- Print option -- Bumped rocm-docs-core from 1.2.0 to 1.5.0 in `/docs/sphinx` -- Refactored kernel duplicate matching -- Refactored `generateLogicDataAndSolutions` -- Restricted XCC mapping to gfx942 -- Refactored argument parsing in `TensileCreateLibrary` -- Disabled failing rhel9 tests -- Changed line length to 100 characters for formatting -- Changed YAML operations to use C `libyaml` backend -- Improved warning text -- Updated clang support for Windows -- Updated `supportedCompiler` function -- Clang support on Windows to require use of conditional choices and defaults -- Refactored sanity check in `TensileCreateLibrary` -- Moved client config logic from `TensileCreateLibrary` main into `createClientConfig` -- Updated `verifyManifest` in `TensileCreateLibrary` -- Updated RTD configs -- Cleaned up CMake to avoid redundant work during client builds -- Updated Stream-K debug settings - -#### Removed - -- Deprecated flag from CI profiling job -- Diagnostic print -- Globals from `prepAsm` -- Deprecated `package-library` option -- Duplicate `which` function and minor cleanup - -#### Optimized - -- To optimize the performance of Stream-K kernels: - - Introduced analytical grid size prediction model - - Remapped XCC-based workgroup - -#### Resolved issues - -- Fixed stream-K XCC configs for gfx942 -- Updated WMMA capability command for ISA 10+ -- Fixed progress bar character encoding error on Windows -- Fixed solution redundancy removal -- Fixed tuning imports for `pyyaml` -- Fixed printing of ASM capabilities for ROCm versions prior to 6.3 -- Fixed code objects by filtering kernels with build errors and unprocessed kernels -- Fixed fully qualified `std::get` in contraction solutions -- Fixed `add -v flag` and change system invocation -- Used conditional imports for new dependencies to fix yaml `CSafe` load and dump import and rich terminal print import -- Fixed comments on `scalarStaticDivideAndRemainder` - -## ROCm 6.2.4 - -See the [ROCm 6.2.4 release notes](https://rocm.docs.amd.com/en/docs-6.2.4/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.6.3) - -#### Resolved issues - -* Fixed support for the API calls `amdsmi_get_gpu_process_isolation` and - `amdsmi_clean_gpu_local_data`, along with the `amd-smi set - --process-isolation <0 or 1>` command. See issue - [#3500](https://github.com/ROCm/ROCm/issues/3500) on GitHub. - -### **rocFFT** (1.0.30) - -#### Optimized - -* Implemented 1D kernels for factorizable sizes greater than 1024 and less than 2048. - -#### Resolved issues - -* Fixed plan creation failure on some even-length real-complex transforms that use Bluestein's algorithm. - -### **rocSOLVER** (3.26.2) - -#### Resolved issues - -* Fixed synchronization issue in STEIN. - -## ROCm 6.2.2 - -### **AMD SMI** (24.6.3) - -#### Changed - -* Added `amd-smi static --ras` on Guest VMs. Guest VMs can view enabled/disabled RAS features on Host cards. - -#### Removed - -* Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs. Guest VMs do not support getting current ECC counts from the Host cards. - -#### Resolved issues - -* Fixed TypeError in `amd-smi process -G`. -* Updated CLI error strings to handle empty and invalid GPU/CPU inputs. -* Fixed Guest VM showing passthrough options. -* Fixed firmware formatting where leading 0s were missing. - -### **HIP** (6.2.1) - -#### Resolved issues - -* Soft hang when using `AMD_SERIALIZE_KERNEL` -* Memory leak in `hipIpcCloseMemHandle` - -### **HIPIFY** (18.0.0) - -#### Added - -* Added CUDA 12.5.1 support -* Added cuDNN 9.2.1 support -* Added LLVM 18.1.8 support -* Added `hipBLAS` 64-bit APIs support -* Added Support for math constants `math_constants.h` - -### **Omnitrace** (1.11.2) - -#### Known issues - -Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/). - -See [issue #3767](https://github.com/ROCm/ROCm/issues/3767) on GitHub. - -### **RCCL** (2.20.5) - -#### Known issues - -On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, Direct Memory Access (DMA) transfers between the GPU and NIC are disabled and impacts multi-node RCCL performance. -This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer. -Older RCCL versions are also impacted. - -This issue will be addressed in a future ROCm release. - -See [issue #3772](https://github.com/ROCm/ROCm/issues/3772) on GitHub. - -### **rocAL** (2.0.0) - -#### Changed - -* The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0. - Applications linked to version 1.3 must be recompiled to link against version 2.0. -* Added development and test packages. -* Added C++ rocAL audio unit test and Python script to run and compare the outputs. -* Added Python support for audio decoders. -* Added Pytorch iterator for audio. -* Added Python audio unit test and support to verify outputs. -* Added rocDecode for HW decode. -* Added support for: - * Audio loader and decoder, which uses libsndfile library to decode wav files - * Audio augmentation - PreEmphasis filter, Spectrogram, ToDecibels, Resample, NonSilentRegionDetection, MelFilterBank - * Generic augmentation - Slice, Normalize - * Reading from file lists in file reader - * Downmixing audio channels during decoding - * TensorTensorAdd and TensorScalarMultiply operations - * Uniform and Normal distribution nodes -* Image to tensor updates -* ROCm install - use case graphics removed - -#### Known issues - -* Dependencies are not installed with the rocAL package installer. Dependencies must be installed with the prerequisite setup script provided. See the [rocAL README on GitHub](https://github.com/ROCm/rocAL/blob/docs/6.2.1/README.md#prerequisites-setup-script) for details. - -### **rocBLAS** (4.2.1) - -#### Removed - -* Removed Device_Memory_Allocation.pdf link in documentation. - -#### Resolved issues - -* Fixed error/warning message during `rocblas_set_stream()` call. - -### **rocFFT** (1.0.29) - -#### Optimized - -* Implemented 1D kernels for factorizable sizes less than 1024. - -### **ROCm SMI** (7.3.0) - -#### Optimized - -* Improved handling of UnicodeEncodeErrors with non UTF-8 locales. Non UTF-8 locales were causing crashes on UTF-8 special characters. - -#### Resolved issues - -* Fixed an issue where the Compute Partition tests segfaulted when AMDGPU was loaded with optional parameters. - -#### Known issues - -* When setting CPX as a partition mode, there is a DRM node limit of 64. This is a known limitation when multiple drivers are using the DRM nodes. The `ls /sys/class/drm` command can be used to see the number of DRM nodes, and the following steps can be used to remove unnecessary drivers: - - 1. Unload AMDGPU: `sudo rmmod amdgpu`. - 2. Remove any unnecessary drivers using `rmmod`. For example, to remove an AST driver, run `sudo rmmod ast`. - 3. Reload AMDGPU using `modprobe`: `sudo modprobe amdgpu`. - -### **rocPRIM** (3.2.1) - -#### Optimized - -* Improved performance of `block_reduce_warp_reduce` when warp size equals block size. - -## ROCm 6.2.1 - -See the [ROCm 6.2.1 release notes](https://rocm.docs.amd.com/en/docs-6.2.1/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.6.3) - -#### Changes - -* Added `amd-smi static --ras` on Guest VMs. Guest VMs can view enabled/disabled RAS features on Host cards. - -#### Removals - -* Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs. Guest VMs do not support getting current ECC counts from the Host cards. - -#### Resolved issues - -* Fixed TypeError in `amd-smi process -G`. -* Updated CLI error strings to handle empty and invalid GPU/CPU inputs. -* Fixed Guest VM showing passthrough options. -* Fixed firmware formatting where leading 0s were missing. - -### **HIP** (6.2.1) - -#### Resolved issues - -* Soft hang when using `AMD_SERIALIZE_KERNEL` -* Memory leak in `hipIpcCloseMemHandle` - -### **HIPIFY** (18.0.0) - -#### Changes - -* Added CUDA 12.5.1 support. -* Added cuDNN 9.2.1 support. -* Added LLVM 18.1.8 support. -* Added `hipBLAS` 64-bit APIs support. -* Added Support for math constants `math_constants.h`. - -### **Omniperf** (2.0.1) - -#### Changes - -* Enabled rocprofv1 for MI300 hardware. -* Added dependency checks on application launch. -* Updated Omniperf packaging. -* Rolled back Grafana version in Dockerfile for Angular plugin compatibility. -* Added GPU model distinction for MI300 systems. -* Refactored and updated documemtation. - -#### Resolved issues - -* Fixed an issue with analysis output. -* Fixed issues with profiling multi-process and multi-GPU applications. - -#### Optimizations - -* Reduced running time of Omniperf when profiling. -* Improved console logging. - -### **Omnitrace** (1.11.2) - -#### Known issues - -Perfetto can no longer open Omnitrace proto files. Loading Perfetto trace output `.proto` files in the latest version of `ui.perfetto.dev` can result in a dialog with the message, "Oops, something went wrong! Please file a bug." The information in the dialog will refer to an "Unknown field type." The workaround is to open the files with the previous version of the Perfetto UI found at [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/). - -See [issue #3767](https://github.com/ROCm/ROCm/issues/3767) on GitHub. - -### **RCCL** (2.20.5) - -#### Known issues - -On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, Direct Memory Access (DMA) transfers between the GPU and NIC are disabled and impacts multi-node RCCL performance. -This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer. -Older RCCL versions are also impacted. - -This issue will be addressed in a future ROCm release. - -See [issue #3772](https://github.com/ROCm/ROCm/issues/3772) on GitHub. - -### **rocAL** (2.0.0) - -#### Changed - -* The new version of rocAL introduces many new features, but does not modify any of the existing public API functions.However, the version number was incremented from 1.3 to 2.0. - Applications linked to version 1.3 must be recompiled to link against version 2.0. -* Added development and test packages. -* Added C++ rocAL audio unit test and Python script to run and compare the outputs. -* Added Python support for audio decoders. -* Added Pytorch iterator for audio. -* Added Python audio unit test and support to verify outputs. -* Added rocDecode for HW decode. -* Added support for: - * Audio loader and decoder, which uses libsndfile library to decode wav files - * Audio augmentation - PreEmphasis filter, Spectrogram, ToDecibels, Resample, NonSilentRegionDetection, MelFilterBank - * Generic augmentation - Slice, Normalize - * Reading from file lists in file reader - * Downmixing audio channels during decoding - * TensorTensorAdd and TensorScalarMultiply operations - * Uniform and Normal distribution nodes -* Image to tensor updates -* ROCm install - use case graphics removed - -#### Known issues - -* Dependencies are not installed with the rocAL package installer. Dependencies must be installed with the prerequisite setup script provided. See the [rocAL README on GitHub](https://github.com/ROCm/rocAL/blob/docs/6.2.1/README.md#prerequisites-setup-script) for details. - -### **rocBLAS** (4.2.1) - -#### Removed - -* Removed Device_Memory_Allocation.pdf link in documentation. - -#### Resolved issues - -* Fixed error/warning message during `rocblas_set_stream()` call. - -### **rocFFT** (1.0.29) - -#### Optimized - -* Implemented 1D kernels for factorizable sizes less than 1024. - -### **ROCm SMI** (7.3.0) - -#### Optimized - -* Improved handling of UnicodeEncodeErrors with non UTF-8 locales. Non UTF-8 locales were causing crashes on UTF-8 special characters. - -#### Resolved issues - -* Fixed an issue where the Compute Partition tests segfaulted when AMDGPU was loaded with optional parameters. - -#### Known issues - -* When setting CPX as a partition mode, there is a DRM node limit of 64. This is a known limitation when multiple drivers are using the DRM nodes. The `ls /sys/class/drm` command can be used to see the number of DRM nodes, and the following steps can be used to remove unnecessary drivers: - - 1. Unload AMDGPU: `sudo rmmod amdgpu`. - 2. Remove any unnecessary drivers using `rmmod`. For example, to remove an AST driver, run `sudo rmmod ast`. - 3. Reload AMDGPU using `modprobe`: `sudo modprobe amdgpu`. - -### **rocPRIM** (3.2.1) - -#### Optimized - -* Improved performance of `block_reduce_warp_reduce` when warp size equals block size. - -## ROCm 6.2.0 - -See the [ROCm 6.2.0 release notes](https://rocm.docs.amd.com/en/docs-6.2.0/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.6.2) - -#### Changed - -- Added the following functionality: - - `amd-smi dmon` is now available as an alias to `amd-smi monitor`. - - An optional process table under `amd-smi monitor -q`. - - Handling to detect VMs with passthrough configurations in CLI tool. - - Process Isolation and Clear SRAM functionality to the CLI tool for VMs. - - Added Ring Hang event. -- Added macros that were in `amdsmi.h` to the AMD SMI Python library `amdsmi_interface.py`. -- Renamed `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`. - -#### Removed - -- Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported. -- Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`. - -#### Optimized - -- Updated CLI error strings to specify invalid device type queried. -- Multiple structure updates in `amdsmi.h` and `amdsmi_interface.py` to align with host/guest. - - Added `amdsmi.h` and `amdsmi_interface.py`. - - `amdsmi_clk_info_t` struct - - Added `AMDSMI` prefix to multiple structures. -- Updated `dpm_policy` references to `soc_pstate`. -- Updated `amdsmi_get_gpu_board_info()` product_name to fallback to `pciids` file. -- Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`. -- Updated CLI voltage curve command output. - -#### Resolved issues - -- Fixed multiple processes not being registered in `amd-smi process` with JSON and CSV format. -- `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -- Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`. -- Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool. When running - `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any - other subcommand that access the GPUs) a mutex would lock and not return requiring either a - clear of the mutex in `/dev/shm` or rebooting the machine. - -#### Known issues - -- `amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do not work. - They will be supported in a future release. - -See [issue #3500](https://github.com/ROCm/ROCm/issues/3500) on GitHub. - -> [!NOTE] -> See the [detailed AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/docs/6.2.0/CHANGELOG.md) on GitHub for more information. - -### **Composable Kernel** (1.1.0) - -#### Changed - -- Added support for: - - Permute scale for any dimension (#1198). - - Combined elementwise op (#1217). - - Multi D in grouped convolution backward weight (#1280). - - K or C equal to 1 for `fp16` in grouped convolution backward weight (#1280). - - Large batch in grouped convolution forward (#1332). -- Added `CK_TILE` layernorm example (#1339). -- `CK_TILE`-based Flash Attention 2 kernel is now merged into the upstream repository as ROCm backend. - -#### Optimized - -- Support universal GEMM in grouped convolution forward (#1320). -- Optimizations for low M and N in grouped convolution backward weight (#1303). -- Added a functional enhancement and compiler bug fix for FlashAttention Forward Kernel. -- `FP8` GEMM performance optimization and tuning (#1384). -- Added FlashAttention backward pass performance optimization (#1397). - -### **HIP** (6.2.0) - -#### Changed - -- Added the `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()` and `__any_sync()`. These take - a 64-bit integer as an explicit mask argument. - - In HIP 6.2, these are hidden behind the preprocessor macro `HIP_ENABLE_WARP_SYNC_BUILTINS`, and will be enabled - unconditionally in a future HIP release. - -- Added new HIP APIs: - - `hipGetProcAddress` returns the pointer to driver function, corresponding to the defined driver function symbol. - - `hipGetFuncBySymbol` returns the pointer to device entry function that matches entry function `symbolPtr`. - - `hipStreamBeginCaptureToGraph` begins graph capture on a stream to an existing graph. - - `hipGraphInstantiateWithParams` creates an executable graph from a graph. - -- Added a new flag `integrated` -- supported in device property. - - - The integrated flag is added in the struct `hipDeviceProp_t`. On the integrated APU system, the runtime driver - detects and sets this flag to `1`, in which case the API `hipDeviceGetAttribute` returns enum `hipDeviceAttribute_t` for - `hipDeviceAttributeIntegrated` as value 1, for integrated GPU device. - -- Added initial support for 8-bit floating point datatype in `amd_hip_fp8.h`. These are accessible via `#include `. - -- Added UUID support for environment variable `HIP_VISIBLE_DEVICES`. - -#### Resolved issues - -- Fixed stream capture support in HIP graphs. Prohibited and unhandled operations are fixed during stream capture in the HIP runtime. -- Fixed undefined symbol error for `hipTexRefGetArray` and `hipTexRefGetBorderColor`. - -#### Upcoming changes - -- The `_sync()` version of crosslane builtins such as `shfl_sync()`, `__all_sync()`, and `__any_sync()` will be enabled unconditionally in a future HIP release. - -### **hipBLAS** (2.2.0) - -#### Changed - -* Added a new ILP64 API for level 2 functions for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. -* Added a new ILP64 API for level 1 `_ex` functions. - -* The `install.sh` script now invokes the `rmake.py` script. Made other various improvements to the build scripts. -* Changed library dependencies in the `install.sh` script from `rocblas` and `rocsolver` to the development packages - `rocblas-dev` and `rocsolver-dev`. -* Updated Linux AOCL dependency to release 4.2 `gcc` build. -* Updated Windows `vcpkg` dependencies to release 2024.02.14. - -### **hipBLASLt** (0.8.0) - -#### Changed - -* Added extension APIs: - *`hipblasltExtAMaxWithScale`. - * `GemmTuning` extension parameter to set `wgm` by user. -* Added support for: - * `HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER` for `FP8`/`BF8` datatype. - * `FP8`/`BF8` input, `FP32/FP16/BF16/F8/BF8` output (gfx94x platform only). - * `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT` and `HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT` for `FP16` input data type to use `FP8`/`BF8` MFMA. -* Added support for gfx110x. - -#### Optimized - -* Improved library loading time. - -### **HIPCC** (1.1.1) - -#### Changed - -* Split `hipcc` package into two packages for different hardware platforms. - -* Cleaned up references to environment variables. - -* Enabled `hipcc` and `hipconfig` binaries (`hipcc.bin`, `hipconfig.bin`) by - default, instead of their Perl counterparts. - -* Enabled function calls. - -* Added support for generating packages for ROCm stack targeting static libraries. - -#### Resolved issues - -* Implemented numerous bug fixes and quality improvements. - -### **hipCUB** (3.2.0) - -#### Changed - -* Added `DeviceCopy` function for parity with CUB. -* Added `enum WarpExchangeAlgorithm` to the rocPRIM backend, which is used as - the new optional template argument for `WarpExchange`. - * The potential values for the enum are `WARP_EXCHANGE_SMEM` and - `WARP_EXCHANGE_SHUFFLE`. - * `WARP_EXCHANGE_SMEM` stands for the previous algorithm, while - `WARP_EXCHANGE_SHUFFLE` performs the exchange via shuffle operations. - * `WARP_EXCHANGE_SHUFFLE` does not require any pre-allocated shared memory, - but the `ItemsPerThread` must be a divisor of `WarpSize`. -* Added `tuple.hpp` which defines templates `hipcub::tuple`, - `hipcub::tuple_element`, `hipcub::tuple_element_t` and `hipcub::tuple_size`. -* Added new overloaded member functions to `BlockRadixSort` and - `DeviceRadixSort` that expose a `decomposer` argument. Keys of a custom type - (`key_type`) can be sorted via these overloads, if an appropriate decomposer - is passed. The decomposer has to implement `operator(const key_type&)` which - returns a `hipcub::tuple` of references pointing to members of `key_type`. - -* On AMD GPUs (using the HIP backend), you can now issue hipCUB API calls inside of - HIP graphs, with several exceptions: - * `CachingDeviceAllocator` - * `GridBarrierLifetime` - * `DeviceSegmentedRadixSort` - * `DeviceRunLengthEncode` - Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of HIP graphs. - -#### Removed - -* Deprecated `debug_synchronous` in hipCUB-2.13.2, and it no longer has any effect. With this release, passing `debug_synchronous` - to the device functions results in a deprecation warning both at runtime and at compile time. - * The synchronization that was previously achievable by passing `debug_synchronous=true` can now be achieved at compile time - by setting the `CUB_DEBUG_SYNC` (or higher debug level) or the `HIPCUB_DEBUG_SYNC` preprocessor definition. - * The compile time deprecation warnings can be disabled by defining the `HIPCUB_IGNORE_DEPRECATED_API` preprocessor definition. - -#### Resolved issues - -* Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. - It now derives the accumulator type as the result of the binary operator. - -### **hipFFT** (1.0.15) - -#### Resolved issues - -* Added `hip::host` as a public link library, as `hipfft.h` includes HIP runtime headers. -* Prevented C++ exceptions leaking from public API functions. -* Made output of `hipfftXt` match `cufftXt` in geometry and alignment for 2D and 3D FFTs. - -### **HIPIFY** (18.0.0) - -#### Changed - -- Added support for: - - NVIDIA CUDA 12.4.1 - - cuDNN 9.1.1 - - LLVM 18.1.6 -- Added full hipBLASLt support. - -#### Resolved issues - -- HIPIFY now applies `reinterpret_cast` for an explicit conversion between pointer-to-function and pointer-to-object; - affected functions: `hipFuncGetAttributes`, `hipFuncSetAttribute`, `hipFuncSetCacheConfig`, `hipFuncSetSharedMemConfig`, `hipLaunchKernel`, and `hipLaunchCooperativeKernel`. - -### **hipRAND** (2.11.0) - -#### Changed - -* Added support for setting generator output ordering in C and C++ API. -* `hiprandCreateGeneratorHost` dispatches to the host generator in the rocRAND backend instead of returning with - `uHIPRAND_STATUS_NOT_IMPLEMENTED`. -* Added options to create: - * A host generator to the Fortran wrapper. - * A host generator to the Python wrapper. -* Previously, for internal testing with HMM the environment variable `ROCRAND_USE_HMM` was used in previous - versions. The environment variable is now named `HIPRAND_USE_HMM`. -* Static library -- moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. -* Device API documentation is improved in this version. - -#### Removed - -* Removed the option to build hipRAND as a submodule to rocRAND. -* Removed references to, and workarounds for, the deprecated `hcc`. -* Removed support for finding rocRAND based on the environment variable `ROCRAND_DIR`. - Use `ROCRAND_PATH` instead. - -#### Resolved issues - -* Fixed a build error when using Clang++ directly due to unsupported references to `amdgpu-target`. - -### **hipSOLVER** (2.2.0) - -#### Changed - -- Added compatibility-only functions: - - `auxiliary` - - `hipsolverDnCreateParams`, `hipsolverDnDestroyParams`, `hipsolverDnSetAdvOptions` - - `getrf` - - `hipsolverDnXgetrf_bufferSize` - - `hipsolverDnXgetrf` - - `getrs` - - `hipsolverDnXgetrs` -- Added support for building on Ubuntu 24.04 and CBL-Mariner. -- Added `hip::host` to `roc::hipsolver` usage requirements. -- Added functions - - `syevdx`/`heevdx` - - `hipsolverSsyevdx_bufferSize`, `hipsolverDsyevdx_bufferSize`, `hipsolverCheevdx_bufferSize`, `hipsolverZheevdx_bufferSize` - - `hipsolverSsyevdx`, `hipsolverDsyevdx`, `hipsolverCheevdx`, `hipsolverZheevdx` - - `sygvdx`/`hegvdx` - - `hipsolverSsygvdx_bufferSize`, `hipsolverDsygvdx_bufferSize`, `hipsolverChegvdx_bufferSize`, `hipsolverZhegvdx_bufferSize` - - `hipsolverSsygvdx`, `hipsolverDsygvdx`, `hipsolverChegvdx`, `hipsolverZhegvdx` - -- Updated `csrlsvchol` to perform numerical factorization on the GPU. The symbolic factorization is still performed on the CPU. -- Renamed `hipsolver-compat.h` to `hipsolver-dense.h`. - -#### Removed - -- Removed dependency on `cblas` from the hipSOLVER test and benchmark clients. - -### **hipSPARSE** (3.1.1) - -#### Changed - -* Added the missing `hipsparseCscGet()` routine. - -* All internal hipSPARSE functions now exist inside a namespace. -* Match deprecations found in cuSPARSE 12.x.x when using cuSPARSE backend. -* Improved the user manual and contribution guidelines. - -#### Resolved issues - -* Fixed `SpGEMM` and `SpGEMM_reuse` routines that were not matching cuSPARSE behavior. - -#### Known issues - -* In `hipsparseSpSM_solve()`, the external buffer is currently passed as a parameter. This does not match the cuSPARSE API - and this extra external buffer parameter will be removed in a future release. For now this extra parameter can be - ignored and passed a `nullptr` as it is unused internally by `hipsparseSpSM_solve()`. - -### **hipSPARSELt** (0.2.1) - -#### Optimized - -* Refined test cases. - -### **hipTensor** (1.3.0) - -#### Changed - -* Added support for: - * Tensor permutation of ranks of 2, 3, 4, 5, and 6 - * Tensor contraction of M6N6K6: M, N, K up to rank 6 -* Added tests for: - * Tensor permutation of ranks of 2, 3, 4, 5, and 6 - * Tensor contraction of M6N6K6: M, N, K up to rank 6 - * YAML parsing to support sequential parameters ordering. -* Prefer `amd-llvm-devel` package before system LLVM library. -* Preferred compilers changed to `CC=amdclang` `CXX=amdclang++`. -* Updated actor-critic selection for new contraction kernel additions. -* Updated installation, programmer's guide, and API reference documentation. - -#### Resolved issues - -* Fixed LLVM parsing crash. -* Fixed memory consumption issue in complex kernels. -* Workaround implemented for compiler crash during debug build. -* Allow random modes ordering for tensor contractions. - -### **llvm-project** (18.0.0) - -#### Changed - -* LLVM IR - - * The `llvm.stacksave` and `llvm.stackrestore` intrinsics now use an overloaded pointer type to support non-0 address - spaces. - - * Added `llvm.exp10` intrinsic. - -* LLVM infrastructure - - * The minimum Clang version to build LLVM in C++20 configuration is now `clang-17.0.6`. - -* TableGen - - * Added constructs for debugging TableGen files: - - * `dump` keyword to dump messages to standard error. See [#68793](https://github.com/llvm/llvm-project/pull/68793). - - * `!repr` bang operator to inspect the content of values. See [#68716](https://github.com/llvm/llvm-project/pull/68716). - -* AArch64 backend - - * Added support for Cortex-A520, Cortex-A720 and Cortex-X4 CPUs. - -* AMDGPU backend - - * `llvm.sqrt.f32` is now lowered correctly. Use `llvm.amdgcn.sqrt.f32` for raw instruction access. - - * Implemented `llvm.stacksave` and `llvm.stackrestore` intrinsics. - - * Implemented `llvm.get.rounding`. - -* ARM backend - - * Added support for Cortex-M52 CPUs. - - * Added execute-only support for Armv6-M. - -* RISC-V backend - - * The `Zfa` extension version was upgraded to 1.0 and is no longer experimental. - - * `Zihintntl` extension version was upgraded to 1.0 and is no longer experimental. - - * Intrinsics were added for `Zk*`, `Zbb`, and `Zbc`. See - [Scalar Bit Manipulation Extension Intrinsics](https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/src/c-api.adoc#scalar-bit-manipulation-extension-intrinsics) in the RISC-V C API specification. - - * Default ABI with F but without D was changed to ilp32f for RV32 and to lp64f for RV64. - - * The `Zvbb`, `Zvbc`, `Zvkb`, `Zvkg`, `Zvkn`, `Zvknc`, `Zvkned`, `Zvkng`, `Zvknha`, `Zvknhb`, `Zvks`, `Zvksc`, - `Zvksed`, `Zvksg`, `Zvksh`, and `Zvkt` extension version was upgraded to 1.0 and is no longer experimental. However, - the C intrinsics for these extensions are still experimental. To use the C intrinsics for these extensions, - `-menable-experimental-extensions` needs to be passed to Clang. - - * `-mcpu=sifive-p450` was added. - - * CodeGen of `RV32E` and `RV64E` is supported experimentally. - - * CodeGen of `ilp32e` and `lp64e` is supported experimentally. - -* X86 backend - - * Added support for the RDMSRLIST and WRMSRLIST instructions. - - * Added support for the WRMSRNS instruction. - - * Support ISA of AMX-FP16 which contains `tdpfp16ps` instruction. - - * Support ISA of CMPCCXADD. - - * Support ISA of AVX-IFMA. - - * Support ISA of AVX-VNNI-INT8. - - * Support ISA of AVX-NE-CONVERT. - - * `-mcpu=raptorlake`, `-mcpu=meteorlake` and `-mcpu=emeraldrapids` are now supported. - - * `-mcpu=sierraforest`, `-mcpu=graniterapids` and `-mcpu=grandridge` are now supported. - - * `__builtin_unpredictable` (unpredictable metadata in LLVM IR), is handled by X86 Backend. X86CmovConversion pass now - respects this builtin and does not convert CMOVs to branches. - - * Add support for the PBNDKB instruction. - - * Support ISA of SHA512. - - * Support ISA of SM3. - - * Support ISA of SM4. - - * Support ISA of AVX-VNNI-INT16. - - * `-mcpu=graniterapids-d` is now supported. - - * The `i128` type now matches GCC and clang’s `__int128` type. This mainly benefits external projects such as Rust - which aim to be binary compatible with C, but also fixes code generation where LLVM already assumed that the type - matched and called into `libgcc` helper functions. - - * Support ISA of USER_MSR. - - * Support ISA of AVX10.1-256 and AVX10.1-512. - - * `-mcpu=pantherlake` and `-mcpu=clearwaterforest` are now supported. - - * `-mapxf` is supported. - - * Marking global variables with `code_model = "small"/"large"` in the IR now overrides the global code model to allow - 32-bit relocations or require 64-bit relocations to the global variable. - - * The medium code model’s code generation was audited to be more similar to the small code model where possible. - -* C API - - * Added `LLVMGetTailCallKind` and `LLVMSetTailCallKind` to allow getting and setting `tail`, `musttail`, and `notail` attributes on call instructions. - - * Added `LLVMCreateTargetMachineWithOptions`, along with helper functions for an opaque option structure, as an - alternative to `LLVMCreateTargetMachine`. The option structure exposes an additional setting (that is, the target - ABI) and provides default values for unspecified settings. - - * Added `LLVMGetNNeg` and `LLVMSetNNeg` for getting and setting the new `nneg` flag on zext instructions, and - `LLVMGetIsDisjoint` and `LLVMSetIsDisjoint` for getting and setting the new disjoint flag on or instructions. - - * Added the following functions for manipulating operand bundles, as well as building call and invoke instructions - that use operand bundles: - - * `LLVMBuildCallWithOperandBundles` - - * `LLVMBuildInvokeWithOperandBundles` - - * `LLVMCreateOperandBundle` - - * `LLVMDisposeOperandBundle` - - * `LLVMGetNumOperandBundles` - - * `LLVMGetOperandBundleAtIndex` - - * `LLVMGetNumOperandBundleArgs` - - * `LLVMGetOperandBundleArgAtIndex` - - * `LLVMGetOperandBundleTag` - - * Added `LLVMGetFastMathFlags` and `LLVMSetFastMathFlags` for getting and setting the fast-math flags of an - instruction, as well as `LLVMCanValueUseFastMathFlags` for checking if an instruction can use such flag. - -* CodeGen infrastructure - - * A new debug type `isel-dump` is added to show only the SelectionDAG dumps after each ISel phase (i.e. - `-debug-only=isel-dump`). This new debug type can be filtered by function names using - `-filter-print-funcs=`, the same flag used to filter IR dumps after each Pass. Note that the - existing `-debug-only=isel` will take precedence over the new behavior and print SelectionDAG dumps of every single - function regardless of `-filter-print-funcs`’s values. - -* Metadata info - - * Added a new loop metadata `!{!”llvm.loop.align”, i32 64}`. - -* LLVM tools - - * `llvm-symbolizer` now treats invalid input as an address for which source information is not found. - - * `llvm-readelf` now supports `--extra-sym-info` (-X) to display extra information (section name) when showing - symbols. - - * `llvm-readobj --elf-output-style=JSON` no longer prefixes each JSON object with the file name. Previously, each - object file’s output looked like `"main.o":{"FileSummary":{"File":"main.o"},...}` but is now - `{"FileSummary":{"File":"main.o"},...}`. This allows each JSON object to be parsed in the same way, since each - object no longer has a unique key. Tools that consume `llvm-readobj`’s JSON output should update their parsers - accordingly. - - * `llvm-objdump` now uses `--print-imm-hex` by default, which brings its default behavior closer in line with `objdump`. - - * `llvm-nm` now supports the `--line-numbers` (`-l`) option to use debugging information to print symbols’ filenames and line numbers. - - * `llvm-symbolizer` and `llvm-addr2line` now support addresses specified as symbol names. - - * `llvm-objcopy` now supports `--gap-fill` and `--pad-to` options, for ELF input and binary output files only. - -* LLDB - - * `SBType::FindDirectNestedType` function is added. It’s useful for formatters to quickly find directly nested type - when it’s known where to search for it, avoiding more expensive global search via `SBTarget::FindFirstType`. - - * Renamed `lldb-vscode` to `lldb-dap` and updated its installation instructions to reflect this. The underlying - functionality remains unchanged. - - * The `mte_ctrl` register can now be read from AArch64 Linux core files. - - * LLDB on AArch64 Linux now supports debugging the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 - (SME2) for both live processes and core files. For details refer to the AArch64 Linux documentation. - - * LLDB now supports symbol and binary acquisition automatically using the DEBUFINFOD protocol. The standard mechanism - of specifying DEBUFINOD servers in the DEBUGINFOD_URLS environment variable is used by default. In addition, users - can specify servers to request symbols from using the LLDB setting `plugin.symbol-locator.debuginfod.server_urls`, - override or adding to the environment variable. - - * When running on AArch64 Linux, `lldb-server` now provides register field information for the following registers: - `cpsr`, `fpcr`, `fpsr`, `svcr` and `mte_ctrl`. - -* Sanitizers - - * HWASan now defaults to detecting use-after-scope bugs. - -#### Removals - -* LLVM IR - - * The constant expression variants of the following instructions have been removed: - - * `and` - - * `or` - - * `lshr` - - * `ashr` - - * `zext` - - * `sext` - - * `fptrunc` - - * `fpext` - - * `fptoui` - - * `fptosi` - - * `uitofp` - - * `sitofp` - -* RISC-V backend - - * XSfcie extension and SiFive CSRs and instructions that were associated with it have been removed. None of these CSRs and - instructions were part of “SiFive Custom Instruction Extension”. The LLVM project needs to work with - SiFive to define and document real extension names for individual CSRs and instructions. - -* Python bindings - - * The Python bindings have been removed. - -* C API - - * The following functions for creating constant expressions have been removed, because the underlying constant - expressions are no longer supported. Instead, an instruction should be created using the `LLVMBuildXYZ` APIs, which - will constant fold the operands if possible and create an instruction otherwise: - - * `LLVMConstAnd` - - * `LLVMConstOr` - - * `LLVMConstLShr` - - * `LLVMConstAShr` - - * `LLVMConstZExt` - - * `LLVMConstSExt` - - * `LLVMConstZExtOrBitCast` - - * `LLVMConstSExtOrBitCast` - - * `LLVMConstIntCast` - - * `LLVMConstFPTrunc` - - * `LLVMConstFPExt` - - * `LLVMConstFPToUI` - - * `LLVMConstFPToSI` - - * `LLVMConstUIToFP` - - * `LLVMConstSIToFP` - - * `LLVMConstFPCast` - -* CodeGen infrastructure - - * `PrologEpilogInserter` no longer supports register scavenging during forwards frame index elimination. Targets - should use backwards frame index elimination instead. - - * `RegScavenger` no longer supports forwards register scavenging. Clients should use backwards register scavenging - instead, which is preferred because it does not depend on accurate kill flags. - -* LLDB - - * `SBWatchpoint::GetHardwareIndex` is deprecated and now returns `-1` to indicate the index is unavailable. - - * Methods in `SBHostOS` related to threads have had their implementations removed. These methods will return a value - indicating failure. - -#### Resolved issues - -* AArch64 backend - - * Neoverse-N2 was incorrectly marked as an Armv8.5a core. This has been changed to an Armv9.0a core. However, crypto - options are not enabled by default for Armv9 cores, so `-mcpu=neoverse-n2+crypto` is now required to enable crypto for - this core. As far as the compiler is concerned, Armv9.0a has the same features enabled as Armv8.5a, with the - exception of crypto. - -* Windows target - - * The LLVM filesystem class `UniqueID` and function `equivalent`() no longer determine that distinct different path - names for the same hard linked file actually are equal. This is an intentional tradeoff in a bug fix, where the bug - used to cause distinct files to be considered equivalent on some file systems. This change fixed the GitHub issues - [#61401](https://github.com/llvm/llvm-project/issues/61401) and [#22079](https://github.com/llvm/llvm-project/issues/22079). - -#### Known issues - -The compiler may incorrectly compile a program that uses the -``__shfl(var, srcLane, width)`` function when one of the parameters to -the function is undefined along some path to the function. For most functions, -uninitialized inputs cause undefined behavior. - -> [!NOTE] -> The ``-Wall`` compilation flag prompts the compiler to generate a warning if a variable is uninitialized along some path. - -As a workaround, initialize the parameters to ``__shfl``. For example: - -```{code-block} cpp -unsigned long istring = 0 // Initialize the input to __shfl -return __shfl(istring, 0, 64) -``` - -See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub. - -### **MIGraphX** (2.10.0) - -#### Changed - -- Added support for ONNX Runtime MIGraphX EP on Windows. -- Added `FP8` Python API. -- Added examples for SD 2.1 and SDXL. -- Added support for BERT to Dynamic Batch. -- Added a `--test` flag in `migraphx-driver` to validate the installation. -- Added support for ONNX Operator: Einsum. -- Added `uint8` support in ONNX Operators. -- Added Split-k kernel configurations for performance improvements. -- Added fusion for group convolutions. -- Added rocMLIR conv3d support. -- Added rocgdb to the Dockerfile. -- Changed default location of libraries with release specific ABI changes. -- Reorganized documentation in GitHub. - -#### Removed - -- Removed the `--model` flag with `migraphx-driver`. - -#### Optimized - -- Improved ONNX Model Zoo coverage. -- Reorganized `memcpys` with ONNX Runtime to improve performance. -- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast. -- Improved MLIR kernel selection for multibroadcasted GEMMs. -- Improved details of the perf report. -- Enable mlir by default for GEMMs with small K. -- Allow specifying dot or convolution fusion for mlir with environmental flag. -- Improve performance on small reductions by doing multiple reduction per wavefront. -- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants. -- Use MLIR attention kernels in more cases. -- Enables MIOpen and CK fusions for MI300 gfx arches. -- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs. -- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases. -- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout. -- Added Fusion for "pointwise + concat" operation which may help in performance in certain cases. -- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained. -- Simplifies "slice + concat" pattern in SDXL UNet. -- Removed ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros. -- Improved inference performance by fusing Reduce to Broadcast. -- Added additional information when printing the perf report. -- Improve scalar fusions when not all strides are 0. -- Added support for multi outputs in pointwise ops. -- Improve reduction fusion with reshape operators. -- Use the quantized output when an operator is used again. -- Enabled Split-k GEMM perf configs for rocMLIR based GEMM kernels for better performance on all Hardware. - -#### Resolved issues - -- Fixed Super Resolution model verification failed with `FP16`. -- Fixed confusing messages by suppressing them when compiling the model. -- Fixed an issue causing the mod operator with `int8` and `int32` inputs. -- Fixed an issue by preventing the spawning too many threads for constant propagation when parallel STL is not enabled. -- Fixed a bug when running `migraphx-driver` with the `--run 1` option. -- Fixed Layernorm accuracy: calculations in `FP32`. -- Fixed update Docker generator script to ROCm 6.1 to point at Jammy. -- Fixed a floating point exception for `dim (-1)` in the reshape operator. -- Fixed issue with `int8` accuracy and models which were failing due to requiring a fourth bias input. -- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix. -- Fixed order of operations for `int8` quantization which were causing inaccuracies and slowdowns. -- Fixed an issues during compilation caused by the incorrect constructor being used at compile time. - Removed list initializer of `prefix_scan_sum` which was causing issues during compilation. -- Fixed the `MIGRAPHX_GPU_COMPILE_PARALLEL` flag to enable users to control number of threads used for parallel compilation. - -### **MIOpen** (3.2.0) - -#### Changed - -- Added: - - [Conv] bilinear (alpha beta) solvers. - - [Conv] enable bf16 for ck-based solvers. - - [Conv] Add split_k tuning to 2d wrw ck-based solver. - - [MHA] graph API fp8 fwd. - - [RNN] multi-stream as default solution. -- Added TunaNetv2.0 for MI300. -- Added Adam and AMP Adam optimizer. - -#### Resolved issues - -- Memory access fault caused by `GemmBwdRest`. -- Context configuration in `GetWorkSpaceSize`. -- Fixes to support huge tensors. - -#### Optimized - -- Find: improved precision of benchmarking. - -### **MIVisionX** (3.0.0) - -#### Changed - -- Added support for: - - Advanced GPUs - - PreEmphasis Filter augmentation in openVX extensions - - Spectrogram augmentation in openVX extensions - - Downmix and ToDecibels augmentations in openVX extensions - - Resample augmentation and Operator overloading nodes in openVX extensions - - NonSilentRegion and Slice augmentations in openVX extensions - - Mel-Filter bank and Normalize augmentations in openVX extensions - -#### Removed - -- Deprecated the use of rocAL for processing. rocAL is available at [https://github.com/ROCm/rocAL](https://github.com/ROCm/rocAL). - -#### Resolved issues - -- Fixed issues with dependencies. - -#### Known issues - -- MIVisionX package install requires manual prerequisites installation. - -### **Omniperf** (2.0.1) - -#### Known issues - -- Error when running Omniperf with an application with command line arguments. As a workaround, create an - intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This - issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/rocprofiler-compute/issues/347). - -- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: - "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the - environment variable `export ROCPROF=rocprofv2`. - -- Omniperf's Python dependencies may not be installed with your ROCm installation, resulting in the following message: - - "*[ERROR] The 'dash>=1.12.0' package was not found in the current execution environment.* - - *[ERROR] The 'dash-bootstrap-components' package was not found in the current execution environment.* - - *Please verify all of the Python dependencies called out in the requirements file are installed locally prior to running omniperf.* - - *See: /opt/rocm-6.2.0/libexec/omniperf/requirements.txt*" - - As a workaround, install these Python requirements manually: `pip install /opt/rocm-6.2.0/libexec/omniperf/requirements.txt`. - -See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub. - -### **OpenMP** (17.0.0) - -#### Changed - -- Added basic experimental support for ``libc`` functions on the GPU via the - LLVM C Library for GPUs. -- Added minimal support for calling host functions from the device using the - `libc` interface. -- Added vendor agnostic OMPT callback support for OpenMP-based device offload. - -#### Removed - -- Removed the "old" device plugins along with support for the `remote` and - `ve` plugins. - -#### Resolved issues - -- Fixed the implementation of `omp_get_wtime` for AMDGPU targets. - -### **RCCL** (2.20.5) - -#### Changed - -- Added support for `fp8` and `rccl_bfloat8`. -- Added support for using HIP contiguous memory. -- Added ROC-TX for host-side profiling. -- Added new rome model. -- Added `fp16` and `fp8` cases to unit tests. -- Added a new unit test for main kernel stack size. -- Added the new `-n` option for `topo_expl` to override the number of nodes. -- Improved debug messages of memory allocations. -- Enabled static build. -- Enabled compatibility with: - - NCCL 2.20.5. - - NCCL 2.19.4. -- Performance tuning for some collective operations on MI300. -- Enabled NVTX code in RCCL. -- Replaced `rccl_bfloat16` with hip_bfloat16. -- NPKit updates: - - Removed warm-up iteration removal by default, need to opt in now. - - Doubled the size of buffers to accommodate for more channels. -- Modified rings to be rail-optimized topology friendly. - -#### Resolved issues - -- Fixed a bug when configuring RCCL for only LL128 protocol. -- Fixed scratch memory allocation after API change for MSCCL. - -### **rocAL** (1.0.0) - -#### Changed - -- Added tests and samples. - -#### Removed - -- Removed CuPy from `setup.py`. - - -#### Optimized - -- Added setup and install updates. - -#### Resolved issues - -- Minor bug fixes. - -### **rocALUTION** (3.2.0) - -#### Changed - -* Added new file I/O based on rocSPARSE I/O format. -* Added `GetConvergenceHistory` for ItILU0 preconditioner. - -#### Removed - -* Deprecated the following: - * `LocalMatrix::ReadFileCSR` - * `LocalMatrix::WriteFileCSR` - * `GlobalMatrix::ReadFileCSR` - * `GlobalMatrix::WriteFileCSR` - -### **rocBLAS** (4.2.0) - -#### Changed - -* Added Level 2 functions and level 3 `trsm` have additional ILP64 API for both C and FORTRAN (`_64` name suffix) with `int64_t` function arguments. -* Added cache flush timing for `gemm_batched_ex`, `gemm_strided_batched_ex`, and `axpy`. -* Added Benchmark class for common timing code. -* Added an environment variable `ROCBLAS_DEFAULT_ATOMICS_MODE`; to set default atomics mode during creation of `rocblas_handle`. -* Added support for single-precision (`fp32_r`) input and double-precision (`fp64_r`) output and compute types by extending `dot_ex`. - -* Updated Linux AOCL dependency to release 4.2 gcc build. -* Updated Windows vcpkg dependencies to release 2024.02.14. -* Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40. - -#### Optimized - -* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator. - -#### Removed - -* Deprecated `rocblas_gemm_ex3`, `gemm_batched_ex3` and `gemm_strided_batched_ex3`. They will be removed in the next - major release of rocBLAS. Refer to [hipBLASLt](https://github.com/ROCm/hipBLASLt) for future 8-bit float usage. - -### **ROCdbgapi** (0.76.0) - -#### Removed - -- Renamed `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_APERTURE_VIOLATION` to `(AMD_DBGAPI_EXCEPTION_WAVE,AMD_DBGAPI_WAVE_STOP_REASON)_ADDRESS_ERROR`. - The old names are still accessible but deprecated. - -### **rocDecode** (0.6.0) - -#### Changed - -- Added full H.264 support and bug fixes. - -### **rocFFT** (1.0.28) - -#### Changed - -* Randomly generated accuracy tests are now disabled by default. They can be enabled using - the `--nrand` option (which defaults to 0). - -#### Optimized - -* Implemented multi-device transform for 3D pencil decomposition. Contiguous dimensions on input and output bricks - are transformed locally, with global transposes to make remaining dimensions contiguous. - -### **rocm-cmake** (0.13.0) - -#### Changed - -- `ROCmCreatePackage` now accepts a suffix parameter, automatically generating it for static or ASAN builds. - - Package names are no longer pulled from `CPACK__PACKAGE_NAME`. - - Runtime packages will no longer be generated for static builds. - -### **ROCm Data Center Tool** (1.0.0) - -#### Changed - -- Added ROCProfiler `dmon` metrics. -- Added new ECC metrics. -- Added ROCm Validation Suite diagnostic command. -- Fully migrated to AMD SMI. - -#### Removed - -- Removed RASLIB dependency and blobs. -- Removed `rocm_smi_lib` dependency due to migration to AMD SMI. - -### **ROCm Debugger (ROCgdb)** (14.2) - -#### Changed - -- Introduce the coremerge utility to merge a host core dump and a GPU-only AMDGPU core dump into a unified AMDGPU corefile. -- Added support for generating and opening core files for heterogeneous processes. - -### **ROCm SMI** (7.3.0) - -#### Changed - -- Added Partition ID API (`rsmi_dev_partition_id_get(..)`). - -#### Resolved issues - -- Fixed Partition ID CLI output. - -> [!NOTE] -> See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.2.0/CHANGELOG.md) on GitHub for more information. - -### **ROCm Validation Suite** (1.0.0) - -#### Changed - -* Added stress tests: - - * IET (power) stress test for MI300A. - - * IET (power transition) test for MI300X. - -* Added support: - - * GEMM self-check and accuracy-check support for checking consistency and accuracy of GEMM output. - - * Trignometric float and random integer matrix data initialization support. - -* Updated GST performance benchmark test for better numbers. - -### **rocPRIM** (3.2.0) - -#### Changed - -* Added new overloads for `warp_scan::exclusive_scan` that take no initial value. These new overloads will write an unspecified result to the first value of each warp. -* The internal accumulator type of `inclusive_scan(_by_key)` and `exclusive_scan(_by_key)` is now exposed as an optional type parameter. - * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan). - This is the same behaviour as before this change. -* Added a new overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element. -* Added new public APIs for deriving resulting type on device-only functions: - * `rocprim::invoke_result` - * `rocprim::invoke_result_t` - * `rocprim::invoke_result_binary_op` - * `rocprim::invoke_result_binary_op_t` -* Added the new `rocprim::batch_copy` function. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy. -* Added more test cases, to better cover supported data types. -* Added an optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. - To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple` - of references to arithmetic types from the key. -* Added `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. - It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`. - It can be instantiated with: - * `rocprim::make_predicate_iterator` - * `rocprim::make_mask_iterator` -* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32. -* Added `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed. -* Updated some tests to work with supported data types. - -#### Removed - -* Deprecated the internal header `detail/match_result_type.hpp`. -* Deprecated `TwiddleIn` and `TwiddleOut` in favor of `radix_key_codec`. -* Deprecated the internal `::rocprim::detail::radix_key_codec` in favor of a new public utility with the same name. - -#### Optimized - -* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`. -* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread. - -#### Resolved issues - -* Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is - larger than the logical warp size. The test suite has been updated with such cases. -* Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`. -* Fixed incorrect output type used in `device_adjacent_difference`. -* Fixed an issue causing incorrect results on the GFX10 (RDNA1, RDNA2) ISA and GFX11 ISA on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types. -* Fixed an issue with `device_adjacent_difference`. It now considers both the - input and the output type for selecting the appropriate kernel launch config. - Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage. -* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`. -* Fixed a compilation failure in the host compiler when instantiating various block and device algorithms with block sizes not divisible by 64. - -### **ROCProfiler** (2.0.0) - -#### Removed - -- Removed `pcsampler` sample code due to deprecation from version 2. - -### **rocRAND** (3.1.0) - -#### Changed - -* Added `rocrand_create_generator_host`. - * The following generators are supported: - * `ROCRAND_RNG_PSEUDO_MRG31K3P` - * `ROCRAND_RNG_PSEUDO_MRG32K3A` - * `ROCRAND_RNG_PSEUDO_PHILOX4_32_10` - * `ROCRAND_RNG_PSEUDO_THREEFRY2_32_20` - * `ROCRAND_RNG_PSEUDO_THREEFRY2_64_20` - * `ROCRAND_RNG_PSEUDO_THREEFRY4_32_20` - * `ROCRAND_RNG_PSEUDO_THREEFRY4_64_20` - * `ROCRAND_RNG_PSEUDO_XORWOW` - * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL32` - * `ROCRAND_RNG_QUASI_SCRAMBLED_SOBOL64` - * `ROCRAND_RNG_QUASI_SOBOL32` - * `ROCRAND_RNG_QUASI_SOBOL64` - * The host-side generators support multi-core processing. On Linux, this requires the TBB (Thread Building Blocks) development package to be installed on the system when building rocRAND (`libtbb-dev` on Ubuntu and derivatives). - * If TBB is not found when configuring rocRAND, the configuration is still successful, and the host generators are executed on a single CPU thread. -* Added the option to create a host generator to the Python wrapper. -* Added the option to create a host generator to the Fortran wrapper -* Added dynamic ordering. This ordering is free to rearrange the produced numbers, - which can be specific to devices and distributions. It is implemented for: - * XORWOW, MRG32K3A, MTGP32, Philox 4x32-10, MRG31K3P, LFSR113, and ThreeFry -* Added support for using Clang as the host compiler for alternative platform compilation. -* C++ wrapper: - * Added support for `lfsr113_engine` being constructed with a seed of type `unsigned long long`, not only `uint4`. - * Added optional order parameter to the constructor of `mt19937_engine`. -* Added the following functions for the `ROCRAND_RNG_PSEUDO_MTGP32` generator: - * `rocrand_normal2` - * `rocrand_normal_double2` - * `rocrand_log_normal2` - * `rocrand_log_normal_double2` -* Added `rocrand_create_generator_host_blocking` which dispatches without stream semantics. -* Added host-side generator for `ROCRAND_RNG_PSEUDO_MTGP32`. -* Added offset and skipahead functionality to LFSR113 generator. -* Added dynamic ordering for architecture `gfx1102`. - -* For device-side generators, you can now wrap calls to `rocrand_generate_*` inside of a hipGraph. There are a few - things to be aware of: - - Generator creation (`rocrand_create_generator`), initialization (`rocrand_initialize_generator`), and destruction (`rocrand_destroy_generator`) must still happen outside the hipGraph. - - After the generator is created, you may call API functions to set its seed, offset, and order. - - After the generator is initialized (but before stream capture or manual graph creation begins), use `rocrand_set_stream` to set the stream the generator will use within the graph. - - A generator's seed, offset, and stream may not be changed from within the hipGraph. Attempting to do so may result in unpredictable behaviour. - - API calls for the poisson distribution (for example, `rocrand_generate_poisson`) are not yet supported inside of hipGraphs. - - For sample usage, see the unit tests in `test/test_rocrand_hipgraphs.cpp` -* Building rocRAND now requires a C++17 capable compiler, as the internal library sources now require it. However consuming rocRAND is still possible from C++11 as public headers don't make use of the new features. -* Building rocRAND should be faster on machines with multiple CPU cores as the library has been - split to multiple compilation units. -* C++ wrapper: the `min()` and `max()` member functions of the generators and distributions are now `static constexpr`. -* Renamed and unified the existing `ROCRAND_DETAIL_.*_BM_NOT_IN_STATE` to `ROCRAND_DETAIL_BM_NOT_IN_STATE` -* Static and dynamic library: moved all internal symbols to namespaces to avoid potential symbol name collisions when linking. - -#### Removed - -* Deprecated the following typedefs. Please use the unified `state_type` alias instead. - * `rocrand_device::threefry2x32_20_engine::threefry2x32_20_state` - * `rocrand_device::threefry2x64_20_engine::threefry2x64_20_state` - * `rocrand_device::threefry4x32_20_engine::threefry4x32_20_state` - * `rocrand_device::threefry4x64_20_engine::threefry4x64_20_state` -* Deprecated the following internal headers: - * `src/rng/distribution/distributions.hpp`. - * `src/rng/device_engines.hpp`. -* Removed references to and workarounds for deprecated hcc. -* Removed support for HIP-CPU. - -#### Known issues - -- `SOBOL64` and `SCRAMBLED_SOBOL64` generate poisson-distributed `unsigned long long int` numbers instead of `unsigned int`. This will be fixed in a future release. - -### **ROCr Runtime** (1.14.0) - -#### Changed - -- Added PC sampling feature (experimental feature). - -### **rocSOLVER** (3.26.0) - -#### Changed - -- Added 64-bit APIs for existing functions: - - GETF2_64 (with `batched` and `strided_batched` versions) - - GETRF_64 (with `batched` and `strided_batched` versions) - - GETRS_64 (with `batched` and `strided_batched` versions) -- Added gfx900 to default build targets. -- Added partial eigenvalue decomposition routines for symmetric/hermitian matrices using Divide & Conquer and Bisection: - - SYEVDX (with `batched` and `strided_batched` versions) - - HEEVDX (with `batched` and `strided_batched` versions) -- Added partial generalized symmetric/hermitian-definite eigenvalue decomposition using Divide & Conquer and Bisection: - - SYGVDX (with `batched` and `strided_batched` versions) - - HEGVDX (with `batched` and `strided_batched` versions) -- Renamed install script arguments of the form `*_dir to *-path`. Arguments of the form `*_dir` remain functional for - backwards compatibility. -- Functions working with arrays of size n - 1 can now accept null pointers when n = 1. - -#### Optimized - -- Improved performance of Cholesky factorization. -- Improved performance of `splitlu` to extract the L and U triangular matrices from the result of sparse factorization matrix M, where M = (L - eye) + U. - -#### Resolved issues - -- Fixed potential accuracy degradation in SYEVJ/HEEVJ for inputs with small eigenvalues. - -### **rocSPARSE** (3.2.0) - -#### Changed - -* Added a new Merge-Path algorithm to SpMM, supporting CSR format. -* Added support for row order to SpSM. -* Added rocsparseio I/O functionality to the library. -* Added `rocsparse_set_identity_permutation`. - -* Adjusted rocSPARSE dependencies to related HIP packages. -* Binary size has been reduced. -* A namespace has been wrapped around internal rocSPARSE functions and kernels. -* `rocsparse_csr_set_pointers`, `rocsparse_csc_set_pointers`, and `rocsparse_bsr_set_pointers` now allow the column indices and values arrays to be nullptr if `nnz` is 0. -* gfx803 target has been removed from address sanitizer builds. - -#### Optimized - -* SpMV adaptive and LRB algorithms have been further optimized on CSR format -* Improved performance of SpMV adaptive with symmetrically stored matrices on CSR format -* Improved documentation and contribution guidelines. - -#### Resolved issues - -* Fixed compilation errors with `BUILD_ROCSPARSE_ILP64=ON`. - -### **rocThrust** (3.1.0) - -#### Changed - -* Added changes from upstream CCCL/thrust 2.2.0. - * Updated the contents of `system/hip` and `test` with the upstream changes. -* Updated internal calls to `rocprim::detail::invoke_result` to use the public API `rocprim::invoke_result`. -* Updated to use `rocprim::device_adjacent_difference` for `adjacent_difference` API call. -* Updated internal use of custom iterator in `thrust::detail::unique_by_key` to use rocPRIM's `rocprim::unique_by_key`. -* Updated `adjecent_difference` to make use of `rocprim:adjecent_difference` when iterators are comparable and not equal otherwise use `rocprim:adjacent_difference_inplace`. - -#### Known issues - -* `thrust::reduce_by_key` outputs are not bit-wise reproducible, as run-to-run results for pseudo-associative reduction operators (e.g. floating-point arithmetic operators) are not deterministic on the same device. -* Note that currently, rocThrust memory allocation is performed in such a way that most algorithmic API functions cannot be called from within hipGraphs. - -### **rocWMMA** (1.5.0) - -#### Changed - -* Added internal utilities for: - * Element-wise vector transforms. - * Cross-lane vector transforms. -* Added internal aos<->soa transforms for block sizes of 16, 32, 64, 128 and 256 and vector widths of 2, 4, 8 and 16. -* Added tests for new internal transforms. - -* Improved loading layouts by increasing vector width for fragments with `blockDim > 32`. -* API `applyDataLayout` transform now accepts WaveCount template argument for cooperative fragments. -* API `applyDataLayout` transform now physically applies aos<->soa transform as necessary. -* Refactored entry-point of std library usage to improve hipRTC support. -* Updated installation, programmer's guide and API reference documentation. - -#### Resolved issues - -* Fixed the ordering of some header includes to improve portability. - -### **RPP** (1.8.0) - -#### Changed - -- Prerequisites - ROCm install requires only `--usecase=rocm`. -- Use pre-allocated common scratchBufferHip everywhere in Tensor code for scratch HIP memory. -- Use `CHECK_RETURN_STATUS` everywhere to adhere to C++17 for HIP. -- RPP Tensor Audio support on HOST for Spectrogram. -- RPP Tensor Audio support on HOST/HIP for Slice, by modifying voxel slice kernels to now accept anchor and shape params for a more generic version. -- RPP Tensor Audio support on HOST for Mel Filter Bank. -- RPP Tensor Normalize ND support on HOST and `HIP`. - -### **Tensile** (4.41.0) - -#### Changed - -- New tuning script to summarize rocBLAS log file -- New environment variable to test fixed grid size with Stream-K kernels -- New Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power -- Add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile -- Add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation)) -- Support NT flag for global load and store for gfx94x -- New Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload) -- Add option for rotating buffers timing with cache eviction -- Add predicate for arithmetic intensity -- Add DirectToVgpr + packing for f8/f16 + TLU cases -- Enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV -- Add test cases for DirectToVgpr + packing -- Add batch support for Stream-K kernels and new test cases -- New tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic -- Enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1) -- Support VectorWidthB (new parameter VectorWidthB) -- Support VectorWidth + non SourceSwap -- Add test cases for VectorWidthB, VectorWidth + non SourceSwap -- Add code owners file -- New environment variables to dynamically adjust number of CUs used in Stream-K -- Add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1)) -- Add xf32 option to rocblas-bench input creator - -- Update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag) -- Limit build threads based on CPUs/RAM available on system (for tests) -- Update required workspace size for Stream-K, skip kernel initialization when possible -- Use fallback libraries for archs without optimized logic -- Use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator) -- Remove OCL tests -- Disable HostLibraryTests -- Reduce extended test time by removing extra parameters in the test config files -- Disable InitAccVgprOpt for Stream-K -- Skip sgemm 64bit offset tests for gfx94x -- Skip DTV, DTL, LSU+MFMA tests for gfx908 -- Increase extended test timeout to 720 min -- Update xfail test (1sum tests only failing on gfx90a) -- Update lib logic convertor script -- Test limiting CI threads for only gfx11 -- wGM related kernargs are removed if they are not needed (WGM=-1,0,1) -- Cleanup on unused old code, mostly related to old client -- Change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it -- Update efficiency script for new architecture and xf32 datatype -- Re-enable negative values for WorkGroupMapping (asm kernel only) -- Disable HW monitor for aquvavanjaram941 -- Pre-apply offsets for strided batch kernels -- Update tensile build with 16 threads - -#### Optimized - -- Made initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py). -- Optimized temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype) -- Reversed MFMA order in inner loop for odd outer iteration -- Optimized waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync) -- Enhanced maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU) - -#### Resolved issues - -- Fixed `WorkspaceCheck` implementation when used in rocBLAS. -- Fixed Stream-K partials cache behavior. -- Fixed `MasterSolutionLibrary` indexing for multiple architecture build. -- Fixed memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size). -- Fixed BufferLoad=False with Stream-K. -- Fixed mismatch issue with `GlobalReadCoalesceGroup`. -- Fixed rocBLAS build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"]). -- Fixed for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2). -- Fixed inaccurate vgpr allocation for ClusterLocalRead. -- Fixed mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2. -- Fixed mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0). -- Fixed HostLibraryTests on gfx942 and gfx941. -- Fixed LLVM crash issue. -- Fixed for newer windows vcpkg msgpack and vcpkg version package name. -- Fixed an error with DisableKernelPieces + 32bit ShadowLimit. -- Ignore asm cap check for kernel arg preload for rocm6.0 and older. - -## ROCm 6.1.2 - -See the [ROCm 6.1.2 release notes](https://rocm.docs.amd.com/en/docs-6.1.2/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.5.1) - -#### Added - -* Added process isolation and clean shader APIs and CLI commands. - * `amdsmi_get_gpu_process_isolation()` - * `amdsmi_set_gpu_process_isolation()` - * `amdsmi_set_gpu_clear_sram_data()` -* Added the `MIN_POWER` metric to output provided by `amd-smi static --limit`. - -#### Changed - -* Updated `amismi_get_power_cap_info` to return values in uW instead of W. -* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`. -* Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. - -#### Removed - -* Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release. - -#### Optimized - -* Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command. - -#### Resolved issues - -* `amdsmi_get_gpu_board_info()` no longer returns junk character strings. -* `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices. -* Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices. -* Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls. - -> [!NOTE] -> See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information. - -### **RCCL** (2.18.6) - -#### Changed - -* Reduced `NCCL_TOPO_MAX_NODES` to limit stack usage and avoid stack overflow. - -### **rocBLAS** (4.1.2) - -#### Optimized - -* Tuned BBS TN and TT operations on the CDNA3 architecture. - -#### Resolved issues - -* Fixed an issue related to obtaining solutions for BF16 TT operations. - -### **rocDecode** (0.6.0) - -#### Added - -* Added support for FFmpeg v5.x. - -#### Changed - -* Updated core dependencies. -* Updated to support the use of public LibVA headers. - -#### Optimized - -* Updated error checking in the `rocDecode-setup.py` script. - -#### Resolved issues - -* Fixed some package dependencies. - -### **ROCm SMI** (7.2.0) - -#### Added - -* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum. - -#### Resolved issues - -* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112). -* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-Series hardware. - -## ROCm 6.1.1 - -See the [ROCm 6.1.1 release notes](https://rocm.docs.amd.com/en/docs-6.1.1/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.5.1) - -#### Added - -- Added deferred error correctable counts to `amd-smi metric -ecc -ecc-blocks`. - -#### Changed - -- Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks. -- Updated the output of `amd-smi metric --clock` to reflect each engine. -- Updated the output of `amd-smi topology --json` to align with output reported by host and guest systems. - -#### Removed - -- Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release. - -#### Resolved issues - -- Fixed `amd-smi metric --clock`'s clock lock and deep sleep status. -- Fixed an issue that would cause an error when resetting non-AMD GPUs. -- Fixed `amd-smi metric --pcie` and `amdsmi_get_pcie_info()` when using RDNA3 (Navi 32 and Navi 31) hardware to prevent "UNKNOWN" reports. -- Fixed the output results of `amd-smi process` when getting processes running on a device. - -#### Known issues - -- `amd-smi bad-pages` can result in a `ValueError: Null pointer access` error when using some PMU firmware versions. - -> [!NOTE] -> See the [detailed changelog](https://github.com/ROCm/amdsmi/blob/docs/6.1.1/CHANGELOG.md) with code samples for more information. - -### **hipBLASLt** (0.7.0) - -#### Added - -- Added `hipblasltExtSoftmax` extension API. -- Added `hipblasltExtLayerNorm` extension API. -- Added `hipblasltExtAMax` extension API. -- Added `GemmTuning` extension parameter to set split-k by user. -- Added support for mixed precision datatype: fp16/fp8 in with fp16 outk. - -#### Upcoming changes - -- `algoGetHeuristic()` ext API for GroupGemm will be deprecated in a future release of hipBLASLt. - -### **HIPCC** (1.0.0) - -#### Changed - -- **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users. You can continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`. -- **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly. -- **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`. - -### **hipSOLVER** (2.1.1) - -#### Changed - -- By default, `BUILD_WITH_SPARSE` is now set to OFF on Microsoft Windows. - -#### Resolved issues - -- Fixed benchmark client build when `BUILD_WITH_SPARSE` is OFF. - -### **rocFFT** (1.0.27) - -#### Added - -- Enable multi-GPU testing on systems without direct GPU-interconnects. - -#### Resolved issues - -- Fixed kernel launch failure on execute of very large odd-length real-complex transforms. - -### **ROCm SMI** (7.0.0) - -#### Added - -* Added the capability to unlock mutex when a process is dead. Added related debug output. -* Added the `Partition ID` field to the `rocm-smi` CLI. -* Added `NODE`, `GUID`, and `GFX Version` fields to the CLI. -* Documentation now includes C++ and Python tutorials, API guides, and reference material. - -#### Changed - -* Some `rocm-smi` fields now display `N/A` instead of `unknown/unsupported` for consistency. -* Changed stacked ID formatting in the `rocm-smi` CLI to make it easier to spot identifiers. - -#### Resolved issues - -* Fixed HIP and ROCm SMI mismatch on GPU bus assignments. -* Fixed memory leaks caused by not closing directories and creating maps nodes instead of using `.at()`. -* Fixed initializing calls which reuse `rocmsmi.initializeRsmi()` bindings in the `rocmsmi` Python API. -* Fixed an issue causing `rsmi_dev_activity_metric_get` gfx/memory to not update with GPU activity. - -#### Known issues - -- ROCm SMI reports GPU utilization incorrectly for RDNA3 GPUs in some situations. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112). - -> [!NOTE] -> See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.1.1/CHANGELOG.md) with code samples for more information. - -## ROCm 6.1.0 - -See the [ROCm 6.1.0 release notes](https://rocm.docs.amd.com/en/docs-6.1.0/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (24.4.1) - -#### Added - -* New monitor command for GPU metrics. - Use the monitor command to customize, capture, collect, and observe GPU metrics on - target devices. - -* Integration with E-SMI. - The EPYC™ System Management Interface In-band Library is a Linux C-library that provides in-band - user space software APIs to monitor and control your CPU’s power, energy, performance, and other - system management functionality. This integration enables access to CPU metrics and telemetry - through the AMD SMI API and CLI tools. - -### **Composable Kernel** (1.1.0) - -#### Added - -* New architecture support. - CK now supports to the following architectures to enable efficient image denoising on the following - AMD GPUs: gfx1030, gfx1100, gfx1031, gfx1101, gfx1032, gfx1102, gfx1034, gfx1103, gfx1035, - gfx1036 - -#### Changed - -* FP8 rounding logic is replaced with stochastic rounding. - Stochastic rounding mimics a more realistic data behavior and improves model convergence. - -### **HIP** (6.1) - -#### Added - -* New environment variable to enable kernel run serialization. - The default `HIP_LAUNCH_BLOCKING` value is `0` (disable); which causes kernels to run as defined in - the queue. When set to `1` (enable), the HIP runtime serializes the kernel queue, which behaves the - same as `AMD_SERIALIZE_KERNEL`. - -### **hipBLASLt** (0.7.0) - -#### Added - -* New GemmTuning extension parameter. GemmTuning allows you to set a split-k value for each solution, which is more feasible for - performance tuning. - -### **hipFFT** (1.0.14) - -#### Added - -* New multi-GPU support for single-process transforms. Multiple GPUs can be used to perform a transform in a single process. Note that this initial - implementation is a functional preview. - -### **HIPIFY** (17.0.0) - -#### Changed - -* Skipped code blocks: Code blocks that are skipped by the preprocessor are no longer hipified under the - `--default-preprocessor` option. To hipify everything, despite conditional preprocessor directives - (`#if`, `#ifdef`, `#ifndef`, `#elif`, or `#else`), don't use the `--default-preprocessor` or `--amap` options. - -### **hipSPARSELt** (0.1.0) - -#### Added - -* Structured sparsity matrix support extensions. - Structured sparsity matrices help speed up deep-learning workloads. We now support `B` as the - sparse matrix and `A` as the dense matrix in Sparse Matrix-Matrix Multiplication (SPMM). Prior to this - release, we only supported sparse (matrix A) x dense (matrix B) matrix multiplication. Structured - sparsity matrices help speed up deep learning workloads. - -### **hipTensor** (1.2.0) - -#### Added - -* 4D tensor permutation and contraction support. - You can now perform tensor permutation on 4D tensors and 4D contractions for F16, BF16, and - Complex F32/F64 datatypes. - -### **llvm-project** (17.0.0) - -#### Changed - -* Combined projects. ROCm Device-Libs, ROCm Compiler Support, and hipCC are now located in - the `llvm-project/amd` subdirectory of AMD's fork of the LLVM project. Previously, these projects - were maintained in separate repositories. Note that the projects themselves will continue to be - packaged separately. - -* Split the `rocm-llvm` package. This package has been split into a required and an optional package: - - * **rocm-llvm(required)**: A package containing the essential binaries needed for compilation. - - * **rocm-llvm-dev(optional)**: A package containing binaries for compiler and application developers. - -### **MIGraphX** (2.9.0) - -#### Added - -* Improved performance for transformer-based models. - We added support for FlashAttention, which benefits models like BERT, GPT, and Stable Diffusion. - -* New Torch-MIGraphX driver. - This driver calls MIGraphX directly from PyTorch. It provides an `mgx_module` object that you can - invoke like any other Torch module, but which utilizes the MIGraphX inference engine internally. - Torch-MIGraphX supports FP32, FP16, and INT8 datatypes. - -* FP8 support. We now offer functional support for inference in the FP8E4M3FNUZ datatype. You - can load an ONNX model in FP8E4M3FNUZ using C++ or Python APIs, or `migraphx-driver`. - You can quantize a floating point model to FP8 format by using the `--fp8` flag with `migraphx-driver`. - To accelerate inference, MIGraphX uses hardware acceleration on MI300 for FP8 by leveraging FP8 - support in various backend kernel libraries. - -### **MIOpen** (3.1.0) - -#### Added - -* Improved performance for inference and convolutions. - Inference support now provided for Find 2.0 fusion plans. Additionally, we've enhanced the Number of - samples, Height, Width, and Channels (NHWC) convolution kernels for heuristics. NHWC stores data - in a format where the height and width dimensions come first, followed by channels. - -### **OpenMP** (17.60.0) - -#### Added - -* New MI300 FP atomics. Application performance can now improve by leveraging fast floating-point atomics on MI300 (gfx942). - -#### Changed - -* Implicit Zero-copy is triggered automatically in XNACK-enabled MI300A systems. - Implicit Zero-copy behavior in `non unified_shared_memory` programs is triggered automatically in - XNACK-enabled MI300A systems (for example, when using the `HSA_XNACK=1` environment - variable). OpenMP supports the 'requires `unified_shared_memory`' directive to support programs - that don’t want to copy data explicitly between the CPU and GPU. However, this requires that you add - these directives to every translation unit of the program. - -### **RCCL** (2.18.6) - -#### Changed - -* NCCL 2.18.6 compatibility. - RCCL is now compatible with NCCL 2.18.6, which includes increasing the maximum IB network interfaces to 32 and fixing network device ordering when creating communicators with only one GPU - per node. - -* Doubled simultaneous communication channels. - We improved MI300X performance by increasing the maximum number of simultaneous - communication channels from 32 to 64. - -### **rocALUTION** (3.1.1) - -#### Added - -* New multiple node and GPU support. - Unsmoothed and smoothed aggregations and Ruge-Stueben AMG now work with multiple nodes - and GPUs. For more information, refer to the - [API documentation](https://rocm.docs.amd.com/projects/rocALUTION/en/docs-6.1.0/usermanual/solvers.html#unsmoothed-aggregation-amg). - -### **rocDecode** (0.5.0) - -#### Added - -* New ROCm component. - rocDecode ROCm's newest component, providing high-performance video decode support for AMD - GPUs. To learn more, refer to the - [documentation](https://rocm.docs.amd.com/projects/rocDecode/en/latest/). - -### **ROCm Data Center Tool** (0.3.0) - -#### Changed - -* C++ upgrades. - RDC was upgraded from C++11 to C++17 to enable a more modern C++ standard when writing RDC plugins. - -### **RPP** (1.5.0) - -#### Added - -* New backend support. - Audio processing support added for the `HOST` backend and 3D Voxel kernels support - for the `HOST` and `HIP` backends. - -### **ROCm Validation Suite** (1.0) - -#### Added - -* New datatype support. - Added BF16 and FP8 datatypes based on General Matrix Multiply(GEMM) operations in the GPU Stress Test (GST) module. This provides additional performance benchmarking and stress testing based on the newly supported datatypes. - -### **rocSOLVER** (3.25.0) - -#### Added - -* New EigenSolver routine. - Based on the Jacobi algorithm, a new EigenSolver routine was added to the library. This routine computes the eigenvalues and eigenvectors of a matrix with improved performance. - -### **ROCTracer** (4.1) - -#### Changed - -* New versioning and callback enhancements. - Improved to match versioning changes in HIP Runtime and supports runtime API callbacks and activity record logging. The APIs of different runtimes at different levels are considered different API domains with assigned domain IDs. - -## ROCm 6.0.2 - -See the [ROCm 6.0.2 release notes](https://rocm.docs.amd.com/en/docs-6.0.2/about/release-notes.html) -for a complete overview of this release. - -### **hipFFT** (1.0.13) - -#### Changed - -* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files - over (this should help simplify downstream builds and packaging) - -## ROCm 6.0.0 - -See the [ROCm 6.0.0 release notes](https://rocm.docs.amd.com/en/docs-6.0.0/about/release-notes.html) -for a complete overview of this release. - -### **AMD SMI** (23.4.2) - -#### Added - -* Integrated the E-SMI (EPYC-SMI) library. - You can now query CPU-related information directly through AMD SMI. Metrics include power, - energy, performance, and other system details. - -* Added support for gfx942 metrics. - You can now query MI300 device metrics to get real-time information. Metrics include power, - temperature, energy, and performance. - -### **HIP** (6.0.0) - -#### Added - -* New features to improve resource interoperability. - * For external resource interoperability, we've added new structs and enums. - * We've added new members to HIP struct `hipDeviceProp_t` for surfaces, textures, and device - identifiers. - -#### Changed - -* Changes impacting backward compatibility. - There are several changes impacting backward compatibility: we changed some struct members and - some enum values, and removed some deprecated flags. For additional information, please refer to - the Changelog. - -### **hipCUB** (3.0.0) - -#### Changed - -* Additional CUB API support. - The hipCUB backend is updated to CUB and Thrust 2.1. - -### **HIPIFY** (17.0.0) - -#### Added - -* Hipified rocSPARSE. - We've implemented support for the direct hipification of additional cuSPARSE APIs into rocSPARSE - APIs under the `--roc` option. This covers a major milestone in the roadmap towards complete - cuSPARSE-to-rocSPARSE hipification. - -#### Optimized - -* Enhanced CUDA2HIP document generation. - API versions are now listed in the CUDA2HIP documentation. To see if the application binary - interface (ABI) has changed, refer to the - [*C* column](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Runtime_API_functions_supported_by_HIP.html) - in our API documentation. - -### **hipRAND** (2.10.16) - -* Official release. - hipRAND is now a *standalone project*--it's no longer available as a submodule for rocRAND. - -### **hipTensor** (1.1.0) - -#### Added - -* Added architecture support. - We've added contraction support for gfx942 architectures, and f32 and f64 data - types. - -#### Optimized - -* Upgraded testing infrastructure. - hipTensor will now support dynamic parameter configuration with input YAML config. - -### **llvm-project** (17.0.0) - -#### Added - -* Added kernel argument optimization on gfx942. - With the new feature, you can preload kernel arguments into Scalar General-Purpose Registers - (SGPRs) rather than pass them in memory. This feature is enabled with a compiler option, which also - controls the number of arguments to pass in SGPRs. For more information, see: - [https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments](https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments) - -#### Optimized - -* Improved register allocation at -O0. - We've improved the register allocator used at -O0 to avoid compiler crashes (when the signature is - 'ran out of registers during register allocation'). - -* Improved generation of debug information. - We've improved compile time when generating debug information for certain corner cases. We've - also improved the compiler to eliminate compiler crashes when generating debug information. - -### **MIGraphX** (2.8.0) - -#### Added - -* Added TorchMIGraphX. - We introduced a Dynamo backend for Torch, which allows PyTorch to use MIGraphX directly - without first requiring a model to be converted to the ONNX model format. With a single line of - code, PyTorch users can utilize the performance and quantization benefits provided by MIGraphX. - -* Added INT8 support across the MIGraphX portfolio. - We now support the INT8 data type. MIGraphX can perform the quantization or ingest - prequantized models. INT8 support extends to the MIGraphX execution provider for ONNX Runtime. - -* Boosted overall performance with rocMLIR. - We've integrated the rocMLIR library for ROCm-supported RDNA and CDNA GPUs. This - technology provides MLIR-based convolution and GEMM kernel generation. - -### **ROCgdb** (13.2) - -#### Added - -* Added support for additional GPU architectures. - * Navi 3 Series: gfx1100, gfx1101, and gfx1102. - * MI300 Series: gfx942. - -### **ROCm SMI** (6.0.0) - -#### Added - -* Improved accessibility to GPU partition nodes. - You can now view, set, and reset the compute and memory partitions. You'll also get notifications of - a GPU busy state, which helps you avoid partition set or reset failure. - -#### Optimized - -* Upgraded GPU metrics version 1.4. - The upgraded GPU metrics binary has an improved metric version format with a content version - appended to it. You can read each metric within the binary without the full `rsmi_gpu_metric_t` data - structure. - -* Updated GPU index sorting. - We made GPU index sorting consistent with other ROCm software tools by optimizing it to use - `Bus:Device.Function` (BDF) instead of the card number. - -### **ROCm Validation Suite** (1.0) - -#### Added - -* Added GPU and operating system support. - We added support for MI300X GPU in GPU Stress Test (GST). - -### **ROCProfiler** (2.0) - -#### Added - -* Added option to specify desired ROCProfiler version. - You can now use rocProfV1 or rocProfV2 by specifying your desired version, as the legacy rocProf - (`rocprofv1`) provides the option to use the latest version (`rocprofv2`). - -* Added ATT support for parallel kernels. - The automatic ISA dumping process also helps ATT successfully parse multiple kernels running in - parallel, and provide cycle-accurate occupancy information for multiple kernels at the same time. - -#### Changed - -* Automated the ISA dumping process by Advance Thread Tracer. - Advance Thread Tracer (ATT) no longer depends on user-supplied Instruction Set Architecture (ISA) - and compilation process (using ``hipcc --save-temps``) to dump ISA from the running kernels. - -### **ROCr Runtime** (1.12.0) - -#### Added - -* Support for SDMA link aggregation. - If multiple XGMI links are available when making SDMA copies between GPUs, the copy is - distributed over multiple links to increase peak bandwidth. - -### **rocThrust** (3.0.0) - -#### Added - -* Added Thrust 2.1 API support. - rocThrust backend is updated to Thrust and CUB 2.1. - -### **rocWMMA** (1.3.0) - -#### Added - -* **Added new architecture support**. - We added support for gfx942 architectures. - -* **Added data type support**. - We added support for f8, bf8, xf32 data types on supporting architectures, and for bf16 in the HIP RTC - environment. - -* **Added support for the PyTorch kernel plugin**. - We added awareness of `__HIP_NO_HALF_CONVERSIONS__` to support PyTorch users. - -### **TransferBench** (beta) - -#### Optimized - -* Improved ordering control. - You can now set the thread block size (`BLOCK_SIZE`) and the thread block order (`BLOCK_ORDER`) - in which thread blocks from different transfers are run when using a single stream. - -* Added comprehensive reports. - We modified individual transfers to report X Compute Clusters (XCC) ID when `SHOW_ITERATIONS` - is set to 1. - -* Improved accuracy in result validation. - You can now validate results for each iteration instead of just once for all iterations. - -## ROCm 5.7.1 - -See the [ROCm 5.7.1 release notes](https://github.com/ROCm/ROCm/blob/docs/5.7.1/RELEASE.md) -on GitHub for a complete overview of this release. - -### **HIP** (5.7.1) - -#### Resolved issues - -* The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory. - -### **hipSOLVER** (1.8.2) - -#### Resolved issues - -- Fixed conflicts between the hipsolver-dev and -asan packages by excluding - hipsolver_module.f90 from the latter - -### **rocBLAS** (3.1.0) - -#### Added - -* A new functionality `rocblas-gemm-tune` and an environment variable `ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH`. - For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune) - -## ROCm 5.7.0 - -See the [ROCm 5.7.0 release notes](https://github.com/ROCm/ROCm/blob/docs/5.7.0/RELEASE.md) -on GitHub for a complete overview of this release. - -### **HIP** (5.7.0) - -#### Added - -- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition - -- Added new APIs supporting Windows only, under development on Linux - - - `hipMallocMipmappedArray` for allocating a mipmapped array on the device - - - `hipFreeMipmappedArray` for freeing a mipmapped array on the device - - - `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array - - - `hipMipmappedArrayCreate` for creating a mipmapped array - - - `hipMipmappedArrayDestroy` for destroy a mipmapped array - - - `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level - -#### Known issues - -- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility. -- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host. - -#### Upcoming changes - -- Removal of gcnarch from hipDeviceProp_t structure - -- Addition of new fields in hipDeviceProp_t structure - - - maxTexture1D - - - maxTexture2D - - - maxTexture1DLayered - - - maxTexture2DLayered - - - sharedMemPerMultiprocessor - - - deviceOverlap - - - asyncEngineCount - - - surfaceAlignment - - - unifiedAddressing - - - computePreemptionSupported - - - hostRegisterSupported - - - uuid - -- Removal of deprecated code -hip-hcc codes from hip code tree - -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA - -- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D() - -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' - -- Correct hipGetLastError to return the last error instead of last API call's return code - -- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]" - -- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess - -- Remove hiparray* and make it opaque with hipArray_t - -### **hipBLAS** (1.1.0) - -#### Changed - -- updated documentation requirements -- dependency rocSOLVER now depends on rocSPARSE - -### **hipCUB** (2.13.1) - -#### Changed - -- CUB backend references CUB and Thrust version 2.0.1. -- Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one. -- Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`. - -#### Known issues - -- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks. -- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types. -- `DeviceHistogram::HistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int]`. -- `DeviceHistogram::MultiHistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]` and `[LevelT, SampleIteratorT] = [float, double]`. - -### **hipFFT** (1.0.12) - -#### Added - -- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. - -#### Changed - -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. - -### **hipSOLVER** (1.8.1) - -#### Changed - -- Changed hipsolver-test sparse input data search paths to be relative to the test executable - -### **hipSPARSE** (2.3.8) - -#### Optimized - -- Fix compilation failures when using cusparse 12.1.0 backend -- Fix compilation failures when using cusparse 12.0.0 backend -- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend -- Minor improvements - -### **MIOpen** (2.19.0) - -#### Added - -- ROCm 5.5 support for gfx1101 (Navi32) - -#### Changed - -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag - -#### Resolved issues - -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. - -### **RCCL** (2.17.1-1) - -#### Added - -- Minor improvements to MSCCL codepath -- NCCL_NCHANNELS_PER_PEER support -- Improved compilation performance -- Support for gfx94x - -#### Changed - -- Compatibility with NCCL 2.17.1-1 -- Performance tuning for some collective operations - -#### Resolved issues - -- Potential race-condition during ncclSocketClose() - -### **rocALUTION** (2.1.11) - -#### Added - -- Added support for gfx940, gfx941 and gfx942 - -#### Optimized - -- Fixed OpenMP runtime issue with Windows toolchain - -### **rocBLAS** (3.1.0) - -#### Added - -- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details. -- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems. - -#### Changed - -- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory -- enhanced reporting of installation issues caused by runtime libraries (Tensile) -- standardized internal rocblas C++ interface across most functions -- Dependencies: - - optional use of AOCL BLIS 4.0 on Linux for clients - - optional build tool only dependency on python psutil - -#### Resolved issues - -- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow: - - Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2 - - Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv - - Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam - - General: set_vector, get_vector, set_matrix, get_matrix - - Related fixes: internal scalar loads with > 32bit offsets - - fix in-place functionality for all trtri sizes - -#### Upcoming changes - -- Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release - -### **rocFFT** (1.0.24) - -#### Added - -- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds) - -#### Changed - -- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux. -- Moved library kernel cache to lib/rocFFT directory. - -#### Optimized - -- Improved performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use Bluestein's algorithm. - -### **rocm-cmake** (0.10.0) - -#### Added - -- Added ROCMTest module -- ROCMCreatePackage: Added support for ASAN packages - -### **rocPRIM** (2.13.1) - -#### Changed - -- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm's parameters. New configuration `radix_sort_config_v2` is preferred instead. -- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation. -- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`. -- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm's parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead. - -#### Resolved issues - -- Fixed build issue caused by missing header in `thread/thread_search.hpp`. - -### **rocRAND** (2.10.17) - -#### Added - -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". - -#### Changed - -- Python 2.7 is no longer officially supported. - -### **rocSOLVER** (3.23.0) - -#### Added - -- LU factorization without pivoting for block tridiagonal matrices: - - GEBLTTRF_NPVT now supports interleaved\_batched format -- Linear system solver without pivoting for block tridiagonal matrices: - - GEBLTTRS_NPVT now supports interleaved\_batched format - -#### Changed - -- Changed rocsolver-test sparse input data search paths to be relative to the test executable -- Changed build scripts to default to compressed debug symbols in Debug builds - -#### Resolved issues - -- Fixed stack overflow in sparse tests on Windows - -### **rocSPARSE** (2.5.4) - -#### Added - -- Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex) -- Added support for gfx940, gfx941 and gfx942 - -#### Optimized - -- Fixed a bug in csrsm and bsrsm - -#### Known issues - -In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split. - -### **rocThrust** (2.18.0) - -#### Changed - -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). -- Removed references to and workarounds for deprecated hcc - -#### Resolved issues - -- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. -- Fixed issue where `transform_iterator` would not compile with `__device__`-only operators. - -### **rocWMMA** (1.2.0) - -#### Changed - -- Fixed a bug with synchronization -- Updated rocWMMA cmake versioning - -### **Tensile** (4.38.0) - -#### Added - -- Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation) -- Added user-driven solution selection feature - -#### Changed - -- Removed DGEMM NT custom kernels and related test cases -- Changed noTailLoop logic to apply noTailLoop only for NT -- Changed the range of AssertFree0ElementMultiple and Free1 -- Unified aStr, bStr generation code in mfmaIter - -#### Optimized - -- Enabled LocalSplitU with MFMA for I8 data type -- Optimized K mask code in mfmaIter -- Enabled TailLoop code in NoLoadLoop to prefetch global/local read -- Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations -- Optimized DirectToLds test cases to reduce the test duration - -#### Resolved issues - -- Fixed LocalSplitU mismatch issue for SGEMM -- Fixed BufferStore=0 and Ldc != Ldd case -- Fixed mismatch issue with TailLoop + MatrixInstB > 1 - -## ROCm 5.6.1 - -See the [ROCm 5.6.1 release notes](https://github.com/ROCm/ROCm/blob/docs/5.6.1/RELEASE.md) -on GitHub for a complete overview of this release. - -### **HIP** (5.6.1) - -#### Resolved issues - -- *hipMemcpy* device-to-device (intra device) is now asynchronous with respect to the host -- Enabled xnack+ check in HIP catch2 tests hang when executing tests -- Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs -- Using *hipGraphAddMemFreeNode* no longer results in a crash - -## ROCm 5.6.0 - -See the [ROCm 5.6.0 release notes](https://github.com/ROCm/ROCm/blob/docs/5.6.0/RELEASE.md) -on GitHub for a complete overview of this release. - -### **AMD SMI** (1.0.0) - -#### Added - -- AMDSMI CLI tool enabled for Linux Bare Metal & Guest - -- Package: amd-smi-lib - -#### Known issues - -- not all Error Correction Code (ECC) fields are currently supported - -- RHEL 8 & SLES 15 have extra install steps - -### **HIP** (5.6.0) - -#### Added - -- Added hipRTC support for amd_hip_fp16 -- Added hipStreamGetDevice implementation to get the device associated with the stream -- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats -- hipArrayGetInfo for getting information about the specified array -- hipArrayGetDescriptor for getting 1D or 2D array descriptor -- hipArray3DGetDescriptor to get 3D array descriptor - -#### Changed - -- hipMallocAsync to return success for zero size allocation to match hipMalloc -- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package -- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide -- Removed hipBusBandwidth and hipCommander samples from hip-tests - -#### Optimized - -- Consolidation of hipamd, rocclr and OpenCL projects in clr -- Optimized lock for graph global capture mode - -#### Resolved issues - -- Fixed regression in hipMemCpyParam3D when offset is applied - -#### Known issues - -- Limited testing on xnack+ configuration - - Multiple HIP tests failures (gpuvm fault or hangs) -- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU -- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in a future ROCm release - -#### Upcoming changes - -- Removal of gcnarch from hipDeviceProp_t structure -- Addition of new fields in hipDeviceProp_t structure - - maxTexture1D - - maxTexture2D - - maxTexture1DLayered - - maxTexture2DLayered - - sharedMemPerMultiprocessor - - deviceOverlap - - asyncEngineCount - - surfaceAlignment - - unifiedAddressing - - computePreemptionSupported - - uuid -- Removal of deprecated code - - hip-hcc codes from hip code tree -- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA -- HIPMEMCPY_3D fields correction (unsigned int -> size_t) -- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type' - -### **ROCgdb** (13.1) - -#### Optimized - -- Improved performances when handling the end of a process with a large number of threads. - -#### Known issues - -- On certain configurations, ROCgdb can show the following warning message: - - `warning: Probes-based dynamic linker interface failed. Reverting to original interface.` - - This does not affect ROCgdb's functionalities. - -### **hipBLAS** (1.0.0) - -#### Changed - -- added const qualifier to hipBLAS functions (swap, sbmv, spmv, symv, trsm) where missing - -#### Removed - -- removed support for deprecated hipblasInt8Datatype_t enum -- removed support for deprecated hipblasSetInt8Datatype and hipblasGetInt8Datatype functions -- in-place trmm is deprecated. It will be replaced by trmm which includes both in-place and - out-of-place functionality - -### **hipCUB** (2.13.1) - -#### Added - -- Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. - -#### Changed - -- CUB backend references CUB and Thrust version 1.17.2. -- Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). - -#### Known issues - -- `BlockRadixRankMatch` is currently broken under the rocPRIM backend. -- `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. - -### **hipFFT** (1.0.12) - -#### Added - -- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms. - -#### Changed - -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. - -### **hipSOLVER** (1.8.0) - -#### Added - -- Added compatibility API with hipsolverRf prefix - -### **hipSPARSE** (2.3.6) - -#### Added - -- Added SpGEMM algorithms - -#### Changed - -- For hipsparseXbsr2csr and hipsparseXcsr2bsr, blockDim == 0 now returns HIPSPARSE_STATUS_INVALID_SIZE - -### **MIOpen** (2.19.0) - -#### Added - -- ROCm 5.5 support for gfx1101 (Navi32) - -#### Changed - -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag - -#### Resolved issues - -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. - -### **RCCL** (2.15.5) - -#### Added - -- HW-topology aware binary tree implementation -- Experimental support for MSCCL -- New unit tests for hipGraph support -- NPKit integration - -#### Changed - -- Compatibility with NCCL 2.15.5 -- Unit test executable renamed to rccl-UnitTests - -#### Removed - -- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench - -#### Resolved issues - -- rocm-smi ID conversion -- Support for HIP_VISIBLE_DEVICES for unit tests -- Support for p2p transfers to non (HIP) visible devices - -### **rocALUTION** (2.1.9) - -#### Optimized - -- Fixed synchronization issues in level 1 routines - -### **rocBLAS** (3.0.0) - -#### Added - -- Added bf16 inputs and f32 compute support to Level 1 rocBLAS Extension functions axpy_ex, scal_ex and nrm2_ex. - -#### Changed - -- refactor rotg test code -- Dependencies: - - build only dependency on python joblib added as used by Tensile build - - fix for cmake install on some OS when performed by install.sh -d --cmake_install - -#### Removed - -- is_complex helper was deprecated and now removed. Use rocblas_is_complex instead. -- The enum truncate_t and the value truncate was deprecated and now removed from. It was replaced by rocblas_truncate_t and rocblas_truncate, respectively. -- rocblas_set_int8_type_for_hipblas was deprecated and is now removed. -- rocblas_get_int8_type_for_hipblas was deprecated and is now removed. -- trmm inplace is deprecated. It will be replaced by trmm that has both inplace and out-of-place functionality -- rocblas_query_int8_layout_flag() is deprecated and will be removed in a future release -- rocblas_gemm_flags_pack_int8x4 enum is deprecated and will be removed in a future release -- rocblas_set_device_memory_size() is deprecated and will be replaced by a future function rocblas_increase_device_memory_size() -- rocblas_is_user_managing_device_memory() is deprecated and will be removed in a future release - -#### Optimized - -- Improved performance of Level 2 rocBLAS GEMV on gfx90a GPU for non-transposed problems having small matrices and larger batch counts. Performance enhanced for problem sizes when m and n <= 32 and batch_count >= 256. -- Improved performance of rocBLAS syr2k for single, double, and double-complex precision, and her2k for double-complex precision. Slightly improved performance for general sizes on gfx90a. - -#### Resolved issues - -- make trsm offset calculations 64 bit safe - -### **rocFFT** (1.0.23) - -#### Added - -- Implemented half-precision transforms, which can be requested by passing rocfft_precision_half to rocfft_plan_create. -- Implemented a hierarchical solution map which saves how to decompose a problem and the kernels to be used. -- Implemented a first version of offline-tuner to support tuning kernels for C2C/Z2Z problems. - -#### Changed - -- Replaced std::complex with hipComplex data types for data generator. -- FFT plan dimensions are now sorted to be row-major internally where possible, which produces better plans if the dimensions were accidentally specified in a different order (column-major, for example). -- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform. - -#### Resolved issues - -- Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch failure. - -### **rocm-cmake** (0.9.0) - -#### Added - -- Added the option ROCM_HEADER_WRAPPER_WERROR - - Compile-time C macro in the wrapper headers causes errors to be emitted instead of warnings. - - Configure-time CMake option sets the default for the C macro. - -### **rocPRIM** (2.13.0) - -#### Added - -- New block level `radix_rank` primitive. -- New block level `radix_rank_match` primitive. -- Added a stable block sorting implementation. This be used with `block_sort` by using the `block_sort_algorithm::stable_merge_sort` algorithm. - -#### Changed - -- Improved the performance of `block_radix_sort` and `device_radix_sort`. -- Improved the performance of `device_merge_sort`. -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). Contributed by: [v01dXYZ](https://github.com/v01dXYZ). - -#### Known issues - -- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. -- When `ROCPRIM_DISABLE_LOOKBACK_SCAN` is set, `device_scan` fails for input sizes bigger than `scan_config::size_limit`, which defaults to `std::numeric_limits<unsigned int>::max()`. - -### **ROCprofiler** - -In ROCm 5.6 the `rocprofilerv1` and `rocprofilerv2` include and library files of -ROCm 5.5 are split into separate files. The `rocmtools` files that were -deprecated in ROCm 5.5 have been removed. - - | ROCm 5.6 | rocprofilerv1 | rocprofilerv2 | - |-----------------|-------------------------------------|----------------------------------------| - | **Tool script** | `bin/rocprof` | `bin/rocprofv2` | - | **API include** | `include/rocprofiler/rocprofiler.h` | `include/rocprofiler/v2/rocprofiler.h` | - | **API library** | `lib/librocprofiler.so.1` | `lib/librocprofiler.so.2` | - -The ROCm Profiler Tool that uses `rocprofilerV1` can be invoked using the -following command: - -```sh -$ rocprof … -``` - -To write a custom tool based on the `rocprofilerV1` API do the following: - -```C -main.c: -#include // Use the rocprofilerV1 API -int main() { - // Use the rocprofilerV1 API - return 0; -} -``` - -This can be built in the following manner: - -```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64 -``` - -The resulting `a.out` will depend on -`/opt/rocm-5.6.0/lib/librocprofiler64.so.1`. - -The ROCm Profiler that uses `rocprofilerV2` API can be invoked using the -following command: - -```sh -$ rocprofv2 … -``` - -To write a custom tool based on the `rocprofilerV2` API do the following: - -```C -main.c: -#include // Use the rocprofilerV2 API -int main() { - // Use the rocprofilerV2 API - return 0; -} -``` - -This can be built in the following manner: - -```sh -$ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2 -``` - -The resulting `a.out` will depend on -`/opt/rocm-5.6.0/lib/librocprofiler64.so.2`. - -#### Added - -- 'end_time' need to be disabled in roctx_trace.txt - -#### Optimized - -- Improved Test Suite - -#### Resolved issues - -- rocprof in ROcm/5.4.0 gpu selector broken. -- rocprof in ROCm/5.4.1 fails to generate kernel info. -- rocprof clobbers LD_PRELOAD. - -### **rocRAND** (2.10.17) - -#### Added - -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". - -#### Changed - -- Python 2.7 is no longer officially supported. - -### **rocSOLVER** (3.22.0) - -#### Added - -- LU refactorization for sparse matrices - - CSRRF_ANALYSIS - - CSRRF_SUMLU - - CSRRF_SPLITLU - - CSRRF_REFACTLU -- Linear system solver for sparse matrices - - CSRRF_SOLVE -- Added type `rocsolver_rfinfo` for use with sparse matrix routines - -#### Optimized - -- Improved the performance of BDSQR and GESVD when singular vectors are requested - -#### Resolved issues - -- BDSQR and GESVD should no longer hang when the input contains `NaN` or `Inf` - -### **rocSPARSE** (2.5.2) - -#### Optimized - -- Fixed a memory leak in csritsv -- Fixed a bug in csrsm and bsrsm - -### **rocThrust** (2.18.0) - -#### Changed - -- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). - -#### Resolved issues - -- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types. - -### **rocWMMA** (1.1.0) - -#### Added - -- Added cross-lane operation backends (Blend, Permute, Swizzle and Dpp) -- Added GPU kernels for rocWMMA unit test pre-process and post-process operations (fill, validation) -- Added performance gemm samples for half, single and double precision -- Added rocWMMA cmake versioning -- Added vectorized support in coordinate transforms -- Included ROCm smi for runtime clock rate detection -- Added fragment transforms for transpose and change data layout - -#### Changed - -- Default to GPU rocBLAS validation against rocWMMA -- Re-enabled int8 gemm tests on gfx9 -- Upgraded to C++17 -- Restructured unit test folder for consistency -- Consolidated rocWMMA samples common code - -### **Tensile** (4.37.0) - -#### Added - -- Added user driven tuning API -- Added decision tree fallback feature -- Added SingleBuffer + AtomicAdd option for GlobalSplitU -- DirectToVgpr support for fp16 and Int8 with TN orientation -- Added new test cases for various functions -- Added SingleBuffer algorithm for ZGEMM/CGEMM -- Added joblib for parallel map calls -- Added support for MFMA + LocalSplitU + DirectToVgprA+B -- Added asmcap check for MIArchVgpr -- Added support for MFMA + LocalSplitU -- Added frequency, power, and temperature data to the output - -#### Changed - -- Updated custom kernels with 64-bit offsets -- Adapted 64-bit offset arguments for assembly kernels -- Improved temporary register re-use to reduce max sgpr usage -- Removed some restrictions on VectorWidth and DirectToVgpr -- Updated the dependency requirements for Tensile -- Changed the range of AssertSummationElementMultiple -- Modified the error messages for more clarity -- Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used -- Removed dummy vgpr for vectorStaticRemainder -- Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder -- Removed qReg parameter from vectorStaticRemainder - -#### Optimized - -- Improved the performance of GlobalSplitU with SingleBuffer algorithm -- Reduced the running time of the extended and pre_checkin tests -- Optimized the Tailloop section of the assembly kernel -- Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch) -- Improved the performance of the second kernel of MultipleBuffer algorithm - -#### Resolved issues - -- Fixed tmp sgpr allocation to avoid over-writing values (alpha) -- 64-bit offset parameters for post kernels -- Fixed gfx908 CI test failures -- Fixed offset calculation to prevent overflow for large offsets -- Fixed issues when BufferLoad and BufferStore are equal to zero -- Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch -- Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch -- Fixed the memory access error related to StaggerU + large stride -- Fixed ZGEMM 4x4 MatrixInst mismatch -- Fixed DGEMM 4x4 MatrixInst mismatch -- Fixed ASEM + GSU + NoTailLoop opt mismatch -- Fixed AssertSummationElementMultiple + GlobalSplitU issues -- Fixed ASEM + GSU + TailLoop inner unroll - -## ROCm 5.5.1 - -See the [ROCm 5.5.1 changelog](https://github.com/ROCm/ROCm/blob/docs/5.5.1/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.5.0 - -See the [ROCm 5.5.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.5.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.54.0) - -#### Added - -- added option to opt-in to use __half for hipblasHalf type in the API for c++ users who define HIPBLAS_USE_HIP_HALF -- added scripts to plot performance for multiple functions -- data driven hipblas-bench and hipblas-test execution via external yaml format data files -- client smoke test added for quick validation using command hipblas-test --yaml hipblas_smoke.yaml - -#### Changed - -- changed reference code for Windows to OpenBLAS -- hipblas client executables all now begin with hipblas- prefix - -#### Resolved issues - -- fixed datatype conversion functions to support more rocBLAS/cuBLAS datatypes -- fixed geqrf to return successfully when nullptrs are passed in with n == 0 || m == 0 -- fixed getrs to return successfully when given nullptrs with corresponding size = 0 -- fixed getrs to give info = -1 when transpose is not an expected type -- fixed gels to return successfully when given nullptrs with corresponding size = 0 -- fixed gels to give info = -1 when transpose is not in ('N', 'T') for real cases or not in ('N', 'C') for complex cases - -### **hipCUB** (2.13.1) - -#### Added - -- Benchmarks for `BlockShuffle`, `BlockLoad`, and `BlockStore`. - -#### Changed - -- CUB backend references CUB and Thrust version 1.17.2. -- Improved benchmark coverage of `BlockScan` by adding `ExclusiveScan`, benchmark coverage of `BlockRadixSort` by adding `SortBlockedToStriped`, and benchmark coverage of `WarpScan` by adding `Broadcast`. - -#### Resolved issues - -- Windows HIP SDK support - -#### Known Issues - -- `BlockRadixRankMatch` is currently broken under the rocPRIM backend. -- `BlockRadixRankMatch` with a warp size that does not exactly divide the block size is broken under the CUB backend. - -### **hipFFT** (1.0.11) - -#### Resolved issues - -- Fixed old version rocm include/lib folders not removed on upgrade. - -### **hipSOLVER** (1.7.0) - -#### Added - -- Added functions - - gesvdj - - hipsolverSgesvdj_bufferSize, hipsolverDgesvdj_bufferSize, hipsolverCgesvdj_bufferSize, hipsolverZgesvdj_bufferSize - - hipsolverSgesvdj, hipsolverDgesvdj, hipsolverCgesvdj, hipsolverZgesvdj - - gesvdjBatched - - hipsolverSgesvdjBatched_bufferSize, hipsolverDgesvdjBatched_bufferSize, hipsolverCgesvdjBatched_bufferSize, hipsolverZgesvdjBatched_bufferSize - - hipsolverSgesvdjBatched, hipsolverDgesvdjBatched, hipsolverCgesvdjBatched, hipsolverZgesvdjBatched - -### **hipSPARSE** (2.3.5) - -#### Optimized - -- Fixed an issue, where the rocm folder was not removed on upgrade of meta packages -- Fixed a compilation issue with cusparse backend -- Added more detailed messages on unit test failures due to missing input data -- Improved documentation -- Fixed a bug with deprecation messages when using gcc9 (Thanks @Maetveis) - -### **MIOpen** (2.19.0) - -#### Added - -- ROCm 5.5 support for gfx1101 (Navi32) - -#### Changed - -- Tuning results for MLIR on ROCm 5.5 -- Bumping MLIR commit to 5.5.0 release tag - -#### Resolved issues - -- Fix 3d convolution Host API bug -- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required. - -### **RCCL** (2.15.5) - -#### Added - -- HW-topology aware binary tree implementation -- Experimental support for MSCCL -- New unit tests for hipGraph support -- NPKit integration - -#### Changed - -- Compatibility with NCCL 2.15.5 -- Unit test executable renamed to rccl-UnitTests - -#### Removed - -- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCmSoftwarePlatform/TransferBench - -#### Resolved issues - -- rocm-smi ID conversion -- Support for HIP_VISIBLE_DEVICES for unit tests -- Support for p2p transfers to non (HIP) visible devices - -### **rocALUTION** (2.1.8) - -#### Added - -- Added build support for Navi32 - -#### Changed - -- LocalVector::GetIndexValues(ValueType\*) is deprecated, use LocalVector::GetIndexValues(const LocalVector&, LocalVector\*) instead -- LocalVector::SetIndexValues(const ValueType\*) is deprecated, use LocalVector::SetIndexValues(const LocalVector&, const LocalVector&) instead -- LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix\*) instead -- LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, LocalMatrix\*) instead -- LocalMatrix::RugeStueben() is deprecated -- LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, LocalMatrix\*, int) is deprecated, use LocalMatrix::AMGAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix\*, int) instead -- LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*, LocalMatrix\*) is deprecated, use LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix\*) instead - -#### Optimized - -- Fixed a typo in MPI backend -- Fixed a bug with the backend when HIP support is disabled -- Fixed a bug in SAAMG hierarchy building on HIP backend -- Improved SAAMG hierarchy build performance on HIP backend - -### **rocBLAS** (2.47.0) - -#### Added - -- added functionality rocblas_geam_ex for matrix-matrix minimum operations -- added HIP Graph support as beta feature for rocBLAS Level 1, Level 2, and Level 3(pointer mode host) functions -- added beta features API. Exposed using compiler define ROCBLAS_BETA_FEATURES_API -- added support for vector initialization in the rocBLAS test framework with negative increments -- added windows build documentation for forthcoming support using ROCm HIP SDK -- added scripts to plot performance for multiple functions - -#### Changed - -- install.sh internally runs rmake.py (also used on windows) and rmake.py may be used directly by developers on linux (use --help) -- rocblas client executables all now begin with rocblas- prefix - -#### Removed - -- install.sh removed options -o --cov as now Tensile will use the default COV format, set by cmake define Tensile_CODE_OBJECT_VERSION=default - -#### Optimized - -- improved performance of Level 2 rocBLAS GEMV for float and double precision. Performance enhanced by 150-200% for certain problem sizes when (m==n) measured on a gfx90a GPU. -- improved performance of Level 2 rocBLAS GER for float, double and complex float precisions. Performance enhanced by 5-7% for certain problem sizes measured on a gfx90a GPU. -- improved performance of Level 2 rocBLAS SYMV for float and double precisions. Performance enhanced by 120-150% for certain problem sizes measured on both gfx908 and gfx90a GPUs. - -#### Resolved issues - -- fixed setting of executable mode on client script rocblas_gentest.py to avoid potential permission errors with clients rocblas-test and rocblas-bench -- fixed deprecated API compatibility with Visual Studio compiler -- fixed test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory - -### **rocFFT** (1.0.22) - -#### Added - -- Added gfx1101 to default AMDGPU_TARGETS. - -#### Changed - -- Moved client programs to C++17. -- Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled. -- Moved transpose, real-complex, Bluestein, and Stockham kernels to library kernel cache. - -#### Optimized - -- Improved performance of 1D lengths < 2048 that use Bluestein's algorithm. -- Reduced time for generating code during plan creation. -- Optimized 3D R2C/C2R lengths 32, 84, 128. -- Optimized batched small 1D R2C/C2R cases. - -#### Resolved issues - -- Removed zero-length twiddle table allocations, which fixes errors from hipMallocManaged. -- Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are present. - -### **rocm-cmake** (0.8.1) - -#### Changed - -- ROCMHeaderWrapper: The wrapper header deprecation message is now a deprecation warning. - -#### Resolved issues - -- ROCMInstallTargets: Added compatibility symlinks for included cmake files in `<ROCM>/lib/cmake/<PACKAGE>`. - -### **rocPRIM** (2.13.0) - -#### Added - -- New block level `radix_rank` primitive. -- New block level `radix_rank_match` primitive. - -#### Changed - -- Improved the performance of `block_radix_sort` and `device_radix_sort`. - -#### Resolved issues - -- Fixed benchmark build on Windows - -#### Known issues - -- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows. - -### **rocRAND** (2.10.17) - -#### Added - -- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. -- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`. -- experimental HIP-CPU feature -- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3". - -#### Changed - -- Python 2.7 is no longer officially supported. - -#### Fixed - -- Windows HIP SDK support - -### **rocSOLVER** (3.21.0) - -#### Added - -- SVD for general matrices using Jacobi algorithm: - - GESVDJ (with batched and strided\_batched versions) -- LU factorization without pivoting for block tridiagonal matrices: - - GEBLTTRF_NPVT (with batched and strided\_batched versions) -- Linear system solver without pivoting for block tridiagonal matrices: - - GEBLTTRS_NPVT (with batched and strided\_batched, versions) -- Product of triangular matrices - - LAUUM -- Added experimental hipGraph support for rocSOLVER functions - -#### Optimized - -- Improved the performance of SYEVJ/HEEVJ. - -#### Changed - -- STEDC, SYEVD/HEEVD and SYGVD/HEGVD now use fully implemented Divide and Conquer approach. - -#### Fixed - -- SYEVJ/HEEVJ should now be invariant under matrix scaling. -- SYEVJ/HEEVJ should now properly output the eigenvalues when no sweeps are executed. -- Fixed GETF2\_NPVT and GETRF\_NPVT input data initialization in tests and benchmarks. -- Fixed rocblas missing from the dependency list of the rocsolver deb and rpm packages. - -### **rocSPARSE** (2.5.1) - -#### Added - -- Added bsrgemm and spgemm for BSR format -- Added bsrgeam -- Added build support for Navi32 -- Added experimental hipGraph support for some rocSPARSE routines -- Added csritsv, spitsv csr iterative triangular solve -- Added mixed precisions for SpMV -- Added batched SpMM for transpose A in COO format with atomic atomic algorithm - -#### Improved - -- Optimization to csr2bsr -- Optimization to csr2csr_compress -- Optimization to csr2coo -- Optimization to gebsr2csr -- Optimization to csr2gebsr -- Fixes to documentation -- Fixes a bug in COO SpMV gridsize -- Fixes a bug in SpMM gridsize when using very large matrices - -#### Known issues - -- In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split. - -### **rocWMMA** (1.0) - -#### Added - -- Added support for wave32 on gfx11+ -- Added infrastructure changes to support hipRTC -- Added performance tracking system - -#### Changed - -- Modified the assignment of hardware information -- Modified the data access for unsigned datatypes -- Added library config to support multiple architectures - -### **Tensile** (4.36.0) - -#### Added - -- Add functions for user-driven tuning -- Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac -- Add binary search for Grid-Based algorithm -- Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2) -- Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4) -- Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1) -- Add GSU SingleBuffer algorithm for HSS/BSS -- Add gfx900:xnack-, gfx1032, gfx1034, gfx1035 -- Enable gfx1031 support - -#### Changed - -- Use global_atomic for GSU instead of flat and global_store for debug code -- Replace flat_load/store with global_load/store -- Use global_load/store for BufferLoad/Store=0 and enable scheduling -- LocalSplitU support for HGEMM+HPA when MFMA disabled -- Update Code Object Version -- Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss -- Update asm cap cache arguments -- Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead -- Change checks, error messages, assembly syntax, and coverage for DirectToLds -- Remove unused cmake file -- Clean up the LLVM dependency code -- Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2 -- Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead - -#### Optimized - -- Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1 -- Improve InitAccVgprOpt - -#### Resolved issues - -- Add build-id to header of compiled source kernels -- Fix solution index collisions -- Fix h beta vectorwidth4 correctness issue for WMMA -- Fix an error with BufferStore=0 -- Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2) -- Fix MoveMIoutToArch bug -- Fix flat load correctness issue on I8 and flat store correctness issue -- Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes -- Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop -- Fix issues with DirectToVgpr + ScheduleIterAlg<3 -- Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2 -- Fix mismatch issue with PrefetchGlobalRead=2 -- Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size -- Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1 -- Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case -- Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical -- Fix for failing CI tests due to CpuThreads=0 -- Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2 -- Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only) -- Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only) - -## ROCm 5.4.3 - -See the [ROCm 5.4.3 changelog](https://github.com/ROCm/ROCm/blob/docs/5.4.3/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **rocFFT** (1.0.21) - -#### Resolved issues - -- Removed source directory from rocm_install_targets call to prevent installation of rocfft.h in an unintended location. - -## ROCm 5.4.2 - -See the [ROCm 5.4.2 changelog](https://github.com/ROCm/ROCm/blob/docs/5.4.2/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.4.1 - -See the [ROCm 5.4.1 changelog](https://github.com/ROCm/ROCm/blob/docs/5.4.1/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **rocFFT** (1.0.20) - -#### Fixed - -- Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride. - -## ROCm 5.4.0 - -See the [ROCm 5.4.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.4.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.53.0) - -#### Added - -- Allow for selection of int8 datatype -- Added support for hipblasXgels and hipblasXgelsStridedBatched operations (with s,d,c,z precisions), - only supported with rocBLAS backend -- Added support for hipblasXgelsBatched operations (with s,d,c,z precisions) - -### **hipCUB** (2.13.0) - -#### Added - -- CMake functionality to improve build parallelism of the test suite that splits compilation units by -function or by parameters. -- New overload for `BlockAdjacentDifference::SubtractLeftPartialTile` that takes a predecessor item. - -#### Changed - -- Improved build parallelism of the test suite by splitting up large compilation units for `DeviceRadixSort`, -`DeviceSegmentedRadixSort` and `DeviceSegmentedSort`. -- CUB backend references CUB and thrust version 1.17.1. - -### **hipFFT** (1.0.10) - -#### Added - -- Added hipfftExtPlanScaleFactor API to efficiently multiply each output element of a FFT by a given scaling factor. Result scaling must be supported in the backend FFT library. - -#### Changed - -- When hipFFT is built against the rocFFT backend, rocFFT 1.0.19 or higher is now required. - -### **hipSOLVER** (1.6.0) - -#### Added - -- Added compatibility-only functions - - gesvdaStridedBatched - - hipsolverDnSgesvdaStridedBatched_bufferSize, hipsolverDnDgesvdaStridedBatched_bufferSize, hipsolverDnCgesvdaStridedBatched_bufferSize, hipsolverDnZgesvdaStridedBatched_bufferSize - - hipsolverDnSgesvdaStridedBatched, hipsolverDnDgesvdaStridedBatched, hipsolverDnCgesvdaStridedBatched, hipsolverDnZgesvdaStridedBatched - -### **hipSPARSE** (2.3.3) - -#### Added - -- Added hipsparseCsr2cscEx2_bufferSize and hipsparseCsr2cscEx2 routines - -#### Changed - -- HIPSPARSE_ORDER_COLUMN has been renamed to HIPSPARSE_ORDER_COL to match cusparse - -### **RCCL** (2.13.4) - -#### Changed - -- Compatibility with NCCL 2.13.4 -- Improvements to RCCL when running with hipGraphs -- RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support -- Minor latency improvements - -#### Resolved issues - -- Resolved potential memory access error due to asynchronous memset - -### **rocALUTION** (2.1.3) - -#### Added - -- Added build support for Navi31 and Navi33 -- Added support for non-squared global matrices - -#### Changed - -- Switched GTest death test style to 'threadsafe' -- GlobalVector::GetGhostSize() is deprecated and will be removed -- ParallelManager::GetGlobalSize(), ParallelManager::GetLocalSize(), ParallelManager::SetGlobalSize() and ParallelManager::SetLocalSize() are deprecated and will be removed -- Vector::GetGhostSize() is deprecated and will be removed -- Multigrid::SetOperatorFormat(unsigned int) is deprecated and will be removed, use Multigrid::SetOperatorFormat(unsigned int, int) instead -- RugeStuebenAMG::SetCouplingStrength(ValueType) is deprecated and will be removed, use SetStrengthThreshold(float) instead - -#### Optimized - -- Fixed a memory leak in MatrixMult on HIP backend -- Global structures can now be used with a single process - -### **rocBLAS** (2.46.0) - -#### Added - -- client smoke test dataset added for quick validation using command rocblas-test --yaml rocblas_smoke.yaml -- Added stream order device memory allocation as a non-default beta option. - -#### Changed - -- Level 2, Level 1, and Extension functions: argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour. -- Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm -- Allow hipBLAS to select int8 datatype -- Disallow B == C && ldb != ldc in rocblas_xtrmm_outofplace - -#### Optimized - -- Improved trsm performance for small sizes by using a substitution method technique -- Improved syr2k and her2k performance significantly by using a block-recursive algorithm - -#### Fixed - -- FORTRAN interfaces generalized for FORTRAN compilers other than gfortran -- fix for trsm_strided_batched rocblas-bench performance gathering -- Fix for rocm-smi path in commandrunner.py script to match ROCm 5.2 and above - -### **rocFFT** (1.0.19) - -#### Added - -- Added rocfft_plan_description_set_scale_factor API to efficiently multiply each output element of a FFT by a given scaling factor. -- Created a rocfft_kernel_cache.db file next to the installed library. SBCC kernels are moved to this file when built with the library, and are runtime-compiled for new GPU architectures. -- Added gfx1100 and gfx1102 to default AMDGPU_TARGETS. - -#### Changed - -- Moved runtime compilation cache to in-memory by default. A default on-disk cache can encounter contention problems -on multi-node clusters with a shared filesystem. rocFFT can still be told to use an on-disk cache by setting the -ROCFFT_RTC_CACHE_PATH environment variable. - -#### Optimized - -- Optimized some strided large 1D plans. - -### **rocPRIM** (2.12.0) - -#### Changed - -- `device_partition`, `device_unique`, and `device_reduce_by_key` now support problem - sizes larger than 2^32 items. - -#### Removed - -- `block_sort::sort()` overload for keys and values with a dynamic size. This overload was documented but the - implementation is missing. To avoid further confusion the documentation is removed until a decision is made on - implementing the function. - -#### Resolved issues - -- Fixed the compilation failure in `device_merge` if the two key iterators don't match. - -### **rocRAND** (2.10.16) - -#### Added - -- MRG31K3P pseudorandom number generator based on L'Ecuyer and Touzin, 2000, "Fast combined multiple recursive generators with multipliers of the form a = ±2q ±2r". -- LFSR113 pseudorandom number generator based on L'Ecuyer, 1999, "Tables of maximally equidistributed combined LFSR generators". -- SCRAMBLED_SOBOL32 and SCRAMBLED_SOBOL64 quasirandom number generators. The Scrambled Sobol sequences are generated by scrambling the output of a Sobol sequence. - -#### Changed - -- The `mrg_<distribution>_distribution` structures, which provided numbers based on MRG32K3A, are now replaced by `mrg_engine_<distribution>_distribution`, where `<distribution>` is `log_normal`, `normal`, `poisson`, or `uniform`. These structures provide numbers for MRG31K3P (with template type `rocrand_state_mrg31k3p`) and MRG32K3A (with template type `rocrand_state_mrg32k3a`). - -#### Resolved issues - -- Sobol64 now returns 64 bits random numbers, instead of 32 bits random numbers. As a result, the performance of this generator has regressed. -- Fixed a bug that prevented compiling code in C++ mode (with a host compiler) when it included the rocRAND headers on Windows. - -### **rocSOLVER** (3.20.0) - -#### Added - -- Partial SVD for bidiagonal matrices: - - BDSVDX -- Partial SVD for general matrices: - - GESVDX (with batched and strided\_batched versions) - -#### Changed - -- Changed `ROCSOLVER_EMBED_FMT` default to `ON` for users building directly with CMake. - This matches the existing default when building with install.sh or rmake.py. - -### **rocSPARSE** (2.4.0) - -#### Added - -- Added rocsparse_spmv_ex routine -- Added rocsparse_bsrmv_ex_analysis and rocsparse_bsrmv_ex routines -- Added csritilu0 routine -- Added build support for Navi31 and Navi 33 - -#### Optimized - -- Optimization to segmented algorithm for COO SpMV by performing analysis -- Improve performance when generating random matrices. -- Fixed bug in ellmv -- Optimized bsr2csr routine -- Fixed integer overflow bugs - -### **rocThrust** (2.17.0) - -#### Added - -- Updated to match upstream Thrust 1.17.0 - -### **rocWMMA** (0.9) - -#### Added - -- Added gemm driver APIs for flow control builtins -- Added benchmark logging systems -- Restructured tests to follow naming convention. Added macros for test generation - -#### Changed - -- Changed CMake to accomodate the modified test infrastructure -- Fine tuned the multi-block kernels with and without lds -- Adjusted Maximum Vector Width to dWordx4 Width -- Updated Efficiencies to display as whole number percentages -- Updated throughput from GFlops/s to TFlops/s -- Reset the ad-hoc tests to use smaller sizes -- Modified the output validation to use CPU-based implementation against rocWMMA -- Modified the extended vector test to return error codes for memory allocation failures - -### **Tensile** (4.35.0) - -#### Added - -- Async DMA support for Transpose Data Layout (ThreadSeparateGlobalReadA/B) -- Option to output library logic in dictionary format -- No solution found error message for benchmarking client -- Exact K check for StoreCInUnrollExact -- Support for CGEMM + MIArchVgpr -- client-path parameter for using prebuilt client -- CleanUpBuildFiles global parameter -- Debug flag for printing library logic index of winning solution -- NumWarmups global parameter for benchmarking -- Windows support for benchmarking client -- DirectToVgpr support for CGEMM -- TensileLibLogicToYaml for creating tuning configs from library logic solutions - -#### Changed - -- Re-enable HardwareMonitor for gfx90a -- Decision trees use MLFeatures instead of Properties - -#### Optimized - -- Put beta code and store separately if StoreCInUnroll = x4 store -- Improved performance for StoreCInUnroll + b128 store - -#### Resolved issues - -- Reject DirectToVgpr + MatrixInstBM/BN > 1 -- Fix benchmark timings when using warmups and/or validation -- Fix mismatch issue with DirectToVgprB + VectorWidth > 1 -- Fix mismatch issue with DirectToLds + NumLoadsCoalesced > 1 + TailLoop -- Fix incorrect reject condition for DirectToVgpr -- Fix reject condition for DirectToVgpr + MIWaveTile < VectorWidth -- Fix incorrect instruction generation with StoreCInUnroll - -## ROCm 5.3.3 - -See the [ROCm 5.3.3 changelog](https://github.com/ROCm/ROCm/blob/docs/5.3.3/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.3.2 - -See the [ROCm 5.3.2 changelog](https://github.com/ROCm/ROCm/blob/docs/5.3.2/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.3.0 - -See the [ROCm 5.3.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.3.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.52.0) - -#### Added - -- Added --cudapath option to install.sh to allow user to specify which cuda build they would like to use. -- Added --installcuda option to install.sh to install cuda via a package manager. Can be used with new --installcudaversion - option to specify which version of cuda to install. - -#### Resolved issues - -- Fixed #includes to support a compiler version. -- Fixed client dependency support in install.sh - -### **hipCUB** (2.12.0) - -#### Added - -- UniqueByKey device algorithm -- SubtractLeft, SubtractLeftPartialTile, SubtractRight, SubtractRightPartialTile overloads in BlockAdjacentDifference. - - The old overloads (FlagHeads, FlagTails, FlagHeadsAndTails) are deprecated. -- DeviceAdjacentDifference algorithm. -- Extended benchmark suite of `DeviceHistogram`, `DeviceScan`, `DevicePartition`, `DeviceReduce`, -`DeviceSegmentedReduce`, `DeviceSegmentedRadixSort`, `DeviceRadixSort`, `DeviceSpmv`, `DeviceMergeSort`, -`DeviceSegmentedSort` - -#### Changed - -- Obsolated type traits defined in util_type.hpp. Use the standard library equivalents instead. -- CUB backend references CUB and thrust version 1.16.0. -- DeviceRadixSort's num_items parameter's type is now templated instead of being an int. - - If an integral type with a size at most 4 bytes is passed (i.e. an int), the former logic applies. - - Otherwise the algorithm uses a larger indexing type that makes it possible to sort input data over 2**32 elements. -- Improved build parallelism of the test suite by splitting up large compilation units - -### **hipFFT** (1.0.9) - -#### Changed - -- Clean up build warnings. -- GNUInstall Dir enhancements. -- Requires gtest 1.11. - -### **hipSOLVER** (1.5.0) - -#### Added - -- Added functions - - syevj - - hipsolverSsyevj_bufferSize, hipsolverDsyevj_bufferSize, hipsolverCheevj_bufferSize, hipsolverZheevj_bufferSize - - hipsolverSsyevj, hipsolverDsyevj, hipsolverCheevj, hipsolverZheevj - - syevjBatched - - hipsolverSsyevjBatched_bufferSize, hipsolverDsyevjBatched_bufferSize, hipsolverCheevjBatched_bufferSize, hipsolverZheevjBatched_bufferSize - - hipsolverSsyevjBatched, hipsolverDsyevjBatched, hipsolverCheevjBatched, hipsolverZheevjBatched - - sygvj - - hipsolverSsygvj_bufferSize, hipsolverDsygvj_bufferSize, hipsolverChegvj_bufferSize, hipsolverZhegvj_bufferSize - - hipsolverSsygvj, hipsolverDsygvj, hipsolverChegvj, hipsolverZhegvj -- Added compatibility-only functions - - syevdx/heevdx - - hipsolverDnSsyevdx_bufferSize, hipsolverDnDsyevdx_bufferSize, hipsolverDnCheevdx_bufferSize, hipsolverDnZheevdx_bufferSize - - hipsolverDnSsyevdx, hipsolverDnDsyevdx, hipsolverDnCheevdx, hipsolverDnZheevdx - - sygvdx/hegvdx - - hipsolverDnSsygvdx_bufferSize, hipsolverDnDsygvdx_bufferSize, hipsolverDnChegvdx_bufferSize, hipsolverDnZhegvdx_bufferSize - - hipsolverDnSsygvdx, hipsolverDnDsygvdx, hipsolverDnChegvdx, hipsolverDnZhegvdx -- Added --mem_query option to hipsolver-bench, which will print the amount of device memory workspace required by the function. - -#### Changed - -- The rocSOLVER backend will now set `info` to zero if rocSOLVER does not reference `info`. (Applies to orgbr/ungbr, orgqr/ungqr, orgtr/ungtr, ormqr/unmqr, ormtr/unmtr, gebrd, geqrf, getrs, potrs, and sytrd/hetrd). -- gesvdj will no longer require extra workspace to transpose `V` when `jobz` is `HIPSOLVER_EIG_MODE_VECTOR` and `econ` is 1. - -#### Fixed - -- Fixed Fortran return value declarations within hipsolver_module.f90 -- Fixed gesvdj_bufferSize returning `HIPSOLVER_STATUS_INVALID_VALUE` when `jobz` is `HIPSOLVER_EIG_MODE_NOVECTOR` and 1 <= `ldv` < `n` -- Fixed gesvdj returning `HIPSOLVER_STATUS_INVALID_VALUE` when `jobz` is `HIPSOLVER_EIG_MODE_VECTOR`, `econ` is 1, and `m` < `n` - -### **hipSPARSE** (2.3.1) - -#### Added - -- Add SpMM and SpMM batched for CSC format - -### **rocALUTION** (2.1.0) - -#### Added - -- Benchmarking tool -- Ext+I Interpolation with sparsify strategies added for RS-AMG - -#### Optimized - -- ParallelManager - -### **rocBLAS** (2.45.0) - -#### Added - -- install.sh option --upgrade_tensile_venv_pip to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIP. -- install.sh option --relocatable or -r adds rpath and removes ldconf entry on rocBLAS build. -- install.sh option --lazy-library-loading to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization. -- Support for RHEL9 and CS9. -- Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value. - -#### Changed - -- Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now. -- Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour. -- Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset. -- Add quick-return for internal trmm and gemm template functions. -- Moved function block sizes to a shared header file. -- Level 1, 2, and 3 functions use rocblas_stride datatype for offset. -- Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions. -- Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure. -- Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions. - -#### Removed - -- install.sh options --hip-clang , --no-hip-clang, --merge-files, --no-merge-files are removed. -- is_complex helper is now deprecated. Use rocblas_is_complex instead. -- The enum truncate_t and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition. - -#### Optimized - -- trmm_outofplace performance improvements for all sizes and data types using block-recursive algorithm. -- herkx performance improvements for all sizes and data types using block-recursive algorithm. -- syrk/herk performance improvements by utilising optimised syrkx/herkx code. -- symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm. - -#### Resolved issues - -- Improved logic to #include <filesystem> vs <experimental/filesystem>. -- install.sh -s option to build rocblas as a static library. -- dot function now sets the device results asynchronously for N <= 0 - -### **rocFFT** (1.0.18) - -#### Changed - -- Runtime compilation cache now looks for environment variables XDG_CACHE_HOME (on Linux) and LOCALAPPDATA (on - Windows) before falling back to HOME. - -#### Optimized - -- Optimized 2D R2C/C2R to use 2-kernel plans where possible. -- Improved performance of the Bluestein algorithm. -- Optimized sbcc-168 and 100 by using half-lds. - -#### Resolved issues - -- Fixed occasional failures to parallelize runtime compilation of kernels. - Failures would be retried serially and ultimately succeed, but this would take extra time. -- Fixed failures of some R2C 3D transforms that use the unsupported TILE_UNALGNED SBRC kernels. - An example is 98^3 R2C out-of-place. -- Fixed bugs in SBRC_ERC type. - -### **rocm-cmake** (0.8.0) - -#### Changed - -- `ROCM_USE_DEV_COMPONENT` set to on by default for all platforms. This means that Windows will now generate runtime and devel packages by default -- ROCMInstallTargets now defaults `CMAKE_INSTALL_LIBDIR` to `lib` if not otherwise specified. -- Changed default Debian compression type to xz and enabled multi-threaded package compression. -- `rocm_create_package` will no longer warn upon failure to determine version of program rpmbuild. - -#### Resolved issues - -- Fixed error in prerm scripts created by `rocm_create_package` that could break uninstall for packages using the `PTH` option. - -### **rocPRIM** (2.11.0) - -#### Added - -- New functions `subtract_left` and `subtract_right` in `block_adjacent_difference` to apply functions - on pairs of adjacent items distributed between threads in a block. -- New device level `adjacent_difference` primitives. -- Added experimental tooling for automatic kernel configuration tuning for various architectures -- Benchmarks collect and output more detailed system information -- CMake functionality to improve build parallelism of the test suite that splits compilation units by -function or by parameters. -- Reverse iterator. - -### **rocRAND** (2.10.15) - -#### Changed - -- Increased number of warmup iterations for rocrand_benchmark_generate from 5 to 15 to eliminate corner cases that would generate artificially high benchmark scores. - -### **rocSOLVER** (3.19.0) - -#### Added - -- Partial eigensolver routines for symmetric/hermitian matrices: - - SYEVX (with batched and strided\_batched versions) - - HEEVX (with batched and strided\_batched versions) -- Generalized symmetric- and hermitian-definite partial eigensolvers: - - SYGVX (with batched and strided\_batched versions) - - HEGVX (with batched and strided\_batched versions) -- Eigensolver routines for symmetric/hermitian matrices using Jacobi algorithm: - - SYEVJ (with batched and strided\_batched versions) - - HEEVJ (with batched and strided\_batched versions) -- Generalized symmetric- and hermitian-definite eigensolvers using Jacobi algorithm: - - SYGVJ (with batched and strided\_batched versions) - - HEGVJ (with batched and strided\_batched versions) -- Added --profile_kernels option to rocsolver-bench, which will include kernel calls in the - profile log (if profile logging is enabled with --profile). - -#### Changed - -- Changed rocsolver-bench result labels `cpu_time` and `gpu_time` to - `cpu_time_us` and `gpu_time_us`, respectively. - -#### Removed - -- Removed dependency on cblas from the rocsolver test and benchmark clients. - -#### Resolved issues - -- Fixed incorrect SYGS2/HEGS2, SYGST/HEGST, SYGV/HEGV, and SYGVD/HEGVD results for batch counts - larger than 32. -- Fixed STEIN memory access fault when nev is 0. -- Fixed incorrect STEBZ results for close eigenvalues when range = index. -- Fixed git unsafe repository error when building with `./install.sh -cd` as a non-root user. - -### **rocThrust** (2.16.0) - -#### Changed - -- rocThrust functionality dependent on device malloc works is functional as ROCm 5.2 reneabled device malloc. Device launched `thrust::sort` and `thrust::sort_by_key` are available for use. - -### **Tensile** (4.34.0) - -#### Added - -- Lazy loading of solution libraries and code object files -- Support for dictionary style logic files -- Support for decision tree based logic files using dictionary format -- DecisionTreeLibrary for solution selection -- DirectToLDS support for HGEMM -- DirectToVgpr support for SGEMM -- Grid based distance metric for solution selection -- Support for gfx11xx -- Support for DirectToVgprA/B + TLU=False -- ForkParameters Groups as a way of specifying solution parameters -- Support for a new Tensile yaml config format -- TensileClientConfig for generating Tensile client config files -- Options for TensileCreateLibrary to build client and create client config file - -#### Changed - -- Default MACInstruction to FMA - -#### Optimized - -- Solution generation is now cached and is not repeated if solution parameters are unchanged - -#### Resolved issues - -- Accept StaggerUStride=0 as valid -- Reject invalid data types for UnrollLoopEfficiencyEnable -- Fix invalid code generation issues related to DirectToVgpr -- Return hipErrorNotFound if no modules are loaded -- Fix performance drop for NN ZGEMM with 96x64 macro tile -- Fix memory violation for general batched kernels when alpha/beta/K = 0 - -## ROCm 5.2.3 - -See the [ROCm 5.2.3 changelog](https://github.com/ROCm/ROCm/blob/docs/5.2.3/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **RCCL** (2.12.10) - -#### Added - -- Compatibility with NCCL 2.12.10 -- Packages for test and benchmark executables on all supported OSes using CPack. -- Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1 - - Additional details provided if Binary File Descriptor library (BFD) is pre-installed -- Adding support for reusing ports in NET/IB channels - - Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1 - - When "Call to bind failed : Address already in use" error happens in large-scale AlltoAll - (e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options - to resolve the massive port usage issue - - Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1 - -#### Removed - -- Removed experimental clique-based kernels - -## ROCm 5.2.1 - -See the [ROCm 5.2.1 changelog](https://github.com/ROCm/ROCm/blob/docs/5.2.1/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.2.0 - -See the [ROCm 5.2.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.2.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.51.0) - -#### Added - -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added File/Folder Reorg Changes with backward compatibility support enabled using ROCM-CMAKE wrapper functions -- Added user-specified initialization option to hipblas-bench - -#### Resolved issues - -- Fixed version gathering in performance measuring script - -### **hipCUB** (2.11.1) - -#### Added - -- Packages for tests and benchmark executable on all supported OSes using CPack. - -### **hipFFT** (1.0.8) - -#### Added - -- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. -- Packages for test and benchmark executables on all supported OSes using CPack. - -### **hipSOLVER** (1.4.0) - -#### Added - -- Package generation for test and benchmark executables on all supported OSes using CPack. -- File/Folder Reorg - - Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. - -#### Resolved issues - -- Fixed the ReadTheDocs documentation generation. - -### **hipSPARSE** (2.2.0) - -#### Added - -- Packages for test and benchmark executables on all supported OSes using CPack. - -### **rocALUTION** (2.0.3) - -#### Added - -- Packages for test and benchmark executables on all supported OSes using CPack. - -### **rocBLAS** (2.44.0) - -#### Added - -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions. -- Added Denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions. -- Added NaN initialization tests to the yaml files of Level 2 rocBLAS batched and strided-batched functions for testing purposes. -- Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests. - -#### Changed - -- Modifying gemm_ex for HBH (High-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16. -- Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions. -- For gemm, gemm_ex, gemm_ex2 internal API use rocblas_stride datatype for offset. -- For symm, hemm, syrk, herk, dgmm, geam internal API use rocblas_stride datatype for offset. -- AMD copyright year for all rocBLAS files. -- For gemv (transpose-case), typecasted the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. - -#### Removed - -- Remove Navi12 (gfx1011) from fat binary. - -#### Optimized - -- Improved performance of non-batched and batched her2 for all sizes and data types. -- Improved performance of non-batched and batched amin for all data types using shuffle reductions. -- Improved performance of non-batched and batched amax for all data types using shuffle reductions. -- Improved performance of trsv for all sizes and data types. - -#### Resolved issues - -- For function her2 avoid overflow in offset calculation. -- For trsm when alpha == 0 and on host, allow A to be nullptr. -- Fixed memory access issue in trsv. -- Fixed git pre-commit script to update only AMD copyright year. -- Fixed dgmm, geam test functions to set correct stride values. -- For functions ssyr2k and dsyr2k allow trans == rocblas_operation_conjugate_transpose. -- Fixed compilation error for clients-only build. - -### **rocFFT** (1.0.17) - -#### Added - -- Packages for test and benchmark executables on all supported OSes using CPack. -- Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. - -#### Changed - -- Improved reuse of twiddle memory between plans. -- Set a default load/store callback when only one callback - type is set via the API for improved performance. - -#### Optimized - -- Introduced a new access pattern of lds (non-linear) and applied it on - sbcc kernels len 64 to get performance improvement. - -#### Resolved issues - -- Fixed plan creation failure in cases where SBCC kernels would need to write to non-unit-stride buffers. - -### **rocPRIM** (2.10.14) - -#### Added - -- Packages for tests and benchmark executable on all supported OSes using CPack. -- Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers. - -### **rocRAND** (2.10.14) - -#### Added - -- Backward compatibility for deprecated `#include <rocrand.h>` using wrapper header files. -- Packages for test and benchmark executables on all supported OSes using CPack. - -### **rocSOLVER** (3.18.0) - -#### Added - -- Partial eigenvalue decomposition routines: - - STEBZ - - STEIN -- Package generation for test and benchmark executables on all supported OSes using CPack. -- Added tests for multi-level logging -- Added tests for rocsolver-bench client -- File/Folder Reorg - - Added File/Folder Reorg Changes with backward compatibility support using ROCM-CMAKE wrapper functions. - -#### Resolved issues - -- Fixed compatibility with libfmt 8.1 - -### **rocSPARSE** (2.2.0) - -#### Added - -- batched SpMM for CSR, COO and Blocked ELL formats. -- Packages for test and benchmark executables on all supported OSes using CPack. -- Clients file importers and exporters. - -#### Changed - -- Test adjustments due to roundoff errors. -- Fixing API calls compatiblity with rocPRIM. - -#### Optimized - -- Clients code size reduction. -- Clients error handling. -- Clients benchmarking for performance tracking. - -### **rocThrust** (2.15.0) - -#### Added - -- Packages for tests and benchmark executable on all supported OSes using CPack. - -### **rocWMMA** (0.7) - -#### Added - -- Added unit tests for DLRM kernels -- Added GEMM sample -- Added DLRM sample -- Added SGEMV sample -- Added unit tests for cooperative wmma load and stores -- Added unit tests for IOBarrier.h -- Added wmma load/ store tests for different matrix types (A, B and Accumulator) -- Added more block sizes 1, 2, 4, 8 to test MmaSyncMultiTest -- Added block sizes 4, 8 to test MmaSynMultiLdsTest -- Added support for wmma load / store layouts with block dimension greater than 64 -- Added IOShape structure to define the attributes of mapping and layouts for all wmma matrix types -- Added CI testing for rocWMMA - -#### Changed - -- Renamed wmma to rocwmma in cmake, header files and documentation -- Renamed library files -- Modified Layout.h to use different matrix offset calculations (base offset, incremental offset and cumulative offset) -- Opaque load/store continue to use incrementatl offsets as they fill the entire block -- Cooperative load/store use cumulative offsets as they fill only small portions for the entire block -- Increased Max split counts to 64 for cooperative load/store -- Moved all the wmma definitions, API headers to rocwmma namespace -- Modified wmma fill unit tests to validate all matrix types (A, B, Accumulator) - -### **Tensile** (4.33.0) - -#### Added - -- TensileUpdateLibrary for updating old library logic files -- Support for TensileRetuneLibrary to use sizes from separate file -- ZGEMM DirectToVgpr/DirectToLds/StoreCInUnroll/MIArchVgpr support -- Tests for denorm correctness -- Option to write different architectures to different TensileLibrary files - -#### Optimizations - -- Optimize MessagePackLoadLibraryFile by switching to fread -- DGEMM tail loop optimization for PrefetchAcrossPersistentMode=1/DirectToVgpr - -#### Changed - -- Alpha/beta datatype remains as F32 for HPA HGEMM -- Force assembly kernels to not flush denorms -- Use hipDeviceAttributePhysicalMultiProcessorCount as multiProcessorCount - -#### Resolved issues - -- Fix segmentation fault when run i8 datatype with TENSILE_DB=0x80 - -## ROCm 5.1.3 - -See the [ROCm 5.1.3 changelog](https://github.com/ROCm/ROCm/blob/docs/5.1.3/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.1.1 - -See the [ROCm 5.1.1 changelog](https://github.com/ROCm/ROCm/blob/docs/5.1.1/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.1.0 - -See the [ROCm 5.1.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.1.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.50.0) - -#### Added - -- Added library version and device information to hipblas-test output -- Added --rocsolver-path command line option to choose path to pre-built rocSOLVER, as - absolute or relative path -- Added --cmake_install command line option to update cmake to minimum version if required -- Added cmake-arg parameter to pass in cmake arguments while building -- Added infrastructure to support readthedocs hipBLAS documentation. - -#### Fixed - -- Added hipblasVersionMinor define. hipblaseVersionMinor remains defined - for backwards compatibility. -- Doxygen warnings in hipblas.h header file. - -#### Changed - -- rocblas-path command line option can be specified as either absolute or relative path -- Help message improvements in install.sh and rmake.py -- Updated googletest dependency from 1.10.0 to 1.11.0 - -### **hipCUB** (2.11.0) - -#### Added - -- Device segmented sort -- Warp merge sort, WarpMask and thread sort from cub 1.15.0 supported in hipCUB -- Device three way partition - -#### Changed - -- Device_scan and device_segmented_scan: inclusive_scan now uses the input-type as accumulator-type, exclusive_scan uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. short input, int output). - - And low-res input with high-res output (e.g. float input, double output) - - Block merge sort no longer supports non power of two blocksizes - -### **hipFFT** (1.0.7) - -#### Changed - -- Use fft_params struct for accuracy and benchmark clients. - -### **hipSOLVER** (1.3.0) - -#### Added - -- Added functions - - gels - - hipsolverSSgels_bufferSize, hipsolverDDgels_bufferSize, hipsolverCCgels_bufferSize, hipsolverZZgels_bufferSize - - hipsolverSSgels, hipsolverDDgels, hipsolverCCgels, hipsolverZZgels -- Added library version and device information to hipsolver-test output. -- Added compatibility API with hipsolverDn prefix. -- Added compatibility-only functions - - gesvdj - - hipsolverDnSgesvdj_bufferSize, hipsolverDnDgesvdj_bufferSize, hipsolverDnCgesvdj_bufferSize, hipsolverDnZgesvdj_bufferSize - - hipsolverDnSgesvdj, hipsolverDnDgesvdj, hipsolverDnCgesvdj, hipsolverDnZgesvdj - - gesvdjBatched - - hipsolverDnSgesvdjBatched_bufferSize, hipsolverDnDgesvdjBatched_bufferSize, hipsolverDnCgesvdjBatched_bufferSize, hipsolverDnZgesvdjBatched_bufferSize - - hipsolverDnSgesvdjBatched, hipsolverDnDgesvdjBatched, hipsolverDnCgesvdjBatched, hipsolverDnZgesvdjBatched - - syevj - - hipsolverDnSsyevj_bufferSize, hipsolverDnDsyevj_bufferSize, hipsolverDnCheevj_bufferSize, hipsolverDnZheevj_bufferSize - - hipsolverDnSsyevj, hipsolverDnDsyevj, hipsolverDnCheevj, hipsolverDnZheevj - - syevjBatched - - hipsolverDnSsyevjBatched_bufferSize, hipsolverDnDsyevjBatched_bufferSize, hipsolverDnCheevjBatched_bufferSize, hipsolverDnZheevjBatched_bufferSize - - hipsolverDnSsyevjBatched, hipsolverDnDsyevjBatched, hipsolverDnCheevjBatched, hipsolverDnZheevjBatched - - sygvj - - hipsolverDnSsygvj_bufferSize, hipsolverDnDsygvj_bufferSize, hipsolverDnChegvj_bufferSize, hipsolverDnZhegvj_bufferSize - - hipsolverDnSsygvj, hipsolverDnDsygvj, hipsolverDnChegvj, hipsolverDnZhegvj - -#### Changed - -- The rocSOLVER backend now allows hipsolverXXgels and hipsolverXXgesv to be called in-place when B == X. -- The rocSOLVER backend now allows rwork to be passed as a null pointer to hipsolverXgesvd. - -#### Resolved issues - -- bufferSize functions will now return HIPSOLVER_STATUS_NOT_INITIALIZED instead of HIPSOLVER_STATUS_INVALID_VALUE when both handle and lwork are null. -- Fixed rare memory allocation failure in syevd/heevd and sygvd/hegvd caused by improper workspace array allocation outside of rocSOLVER. - -### **hipSPARSE** (2.1.0) - -#### Added - -- Added gtsv_interleaved_batch and gpsv_interleaved_batch routines -- Add SpGEMM_reuse - -#### Changed - -- Changed BUILD_CUDA with USE_CUDA in install script and cmake files -- Update googletest to 11.1 - -#### Resolved issues - -- Fixed a bug in SpMM Alg versioning - -### **RCCL** (2.11.4) - -#### Added - -- Compatibility with NCCL 2.11.4 - -#### Known issues - -- Managed memory is not currently supported for clique-based kernels - -### **rocALUTION** (2.0.2) - -#### Added - -- Added out-of-place matrix transpose functionality -- Added LocalVector<bool> - -### **rocBLAS** (2.43.0) - -#### Added - -- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (-j, --jobs) -- Option to install script to build clients without using any Fortran (--clients_no_fortran) -- rocblas_client_initialize function, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time. -- Added tests for output of reduction functions when given bad input -- Added user specified initialization (rand_int/trig_float/hpl) for initializing matrices and vectors in rocblas-bench - -#### Changed - -- For syrkx and trmm internal API use rocblas_stride datatype for offset -- For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match -- Test client dependencies updated to GTest 1.11 -- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives. -- Help menu messages in install.sh -- For ger function, typecast the 'lda'(offset) datatype to size_t during offset calculation to avoid overflow and remove duplicate template functions. -- Modified default initialization from rand_int to hpl for initializing matrices and vectors in rocblas-bench - -#### Optimized - -- Improved performance of trsm with side == left and n == 1 -- Improved perforamnce of trsm with side == left and m <= 32 along with side == right and n <= 32 - -#### Resolved issues - -- For function trmv (non-transposed cases) avoid overflow in offset calculation -- Fixed cppcheck errors/warnings -- Fixed doxygen warnings - -### **rocFFT** (1.0.16) - -#### Changed - -- Supported unaligned tile dimension for SBRC_2D kernels. -- Improved (more RAII) test and benchmark infrastructure. -- Enabled runtime compilation of length-2304 FFT kernel during plan creation. - -#### Removed - -- The hipFFT API (header) has been removed from after a long deprecation period. Please use the [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package/repository to obtain the hipFFT API. - -#### Optimized - -- Optimized more large 1D cases by using L1D_CC plan. -- Optimized 3D 200^3 C2R case. -- Optimized 1D 2^30 double precision on MI200. - -#### Resolved issues - -- Fixed correctness of some R2C transforms with unusual strides. - -### **rocPRIM** (2.10.13) - -#### Added - -- Future value -- Added device partition_three_way to partition input to three output iterators based on two predicates - -#### Changed - -- The reduce/scan algorithm precision issues in the tests has been resolved for half types. - -#### Resolved issues - -- Fixed radix sort int64_t bug introduced in [2.10.11] - -#### Known issues - -- device_segmented_radix_sort unit test failing for HIP on Windows - -### **rocRAND** (2.10.13) - -#### Added - -- Generating a random sequence different sizes now produces the same sequence without gaps - indepent of how many values are generated per call. - - Only in the case of XORWOW, MRG32K3A, PHILOX4X32_10, SOBOL32 and SOBOL64 - - This only holds true if the size in each call is a divisor of the distributions - `output_width` due to performance - - Similarly the output pointer has to be aligned to `output_width * sizeof(output_type)` - -#### Changed - -- [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND.git) split into a separate package -- Header file installation location changed to match other libraries. - - Using the `rocrand.h` header file should now use `#include <rocrand/rocrand.h>`, rather than `#include <rocrand/rocrand.h>` -- rocRAND still includes hipRAND using a submodule - - The rocRAND package also sets the provides field with hipRAND, so projects which require hipRAND can begin to specify it. - -#### Resolved issues - -- Fix offset behaviour for XORWOW, MRG32K3A and PHILOX4X32_10 generator, setting offset now - correctly generates the same sequence starting from the offset. - - Only uniform int and float will work as these can be generated with a single call to the generator - -#### Known issues - -- kernel_xorwow unit test is failing for certain GPU architectures. - -### **rocSOLVER** (3.17.0) - -#### Optimized - -- Optimized non-pivoting and batch cases of the LU factorization - -#### Resolved issues - -- Fixed missing synchronization in SYTRF with `rocblas_fill_lower` that could potentially - result in incorrect pivot values. -- Fixed multi-level logging output to file with the `ROCSOLVER_LOG_PATH`, - `ROCSOLVER_LOG_TRACE_PATH`, `ROCSOLVER_LOG_BENCH_PATH` and `ROCSOLVER_LOG_PROFILE_PATH` - environment variables. -- Fixed performance regression in the batched LU factorization of tiny matrices - -### **rocSPARSE** (2.1.0) - -#### Added - -- gtsv_interleaved_batch -- gpsv_interleaved_batch -- SpGEMM_reuse -- Allow copying of mat info struct - -#### Optimized - -- Optimization for SDDMM -- Allow unsorted matrices in csrgemm multipass algorithm - -### **rocThrust** (2.14.0) - -rocThrust 2.14.0 for ROCm 5.1.0 - -#### Added - -- Updated to match upstream Thrust 1.15.0 - -#### Known issues - -- async_copy, partition, and stable_sort_by_key unit tests are failing on HIP on Windows. - -### **Tensile** (4.32.0) - -Tensile 4.32.0 for ROCm 5.1.0 - -#### Added - -- Better control of parallelism to control memory usage -- Support for multiprocessing on Windows for TensileCreateLibrary -- New JSD metric and metric selection functionality -- Initial changes to support two-tier solution selection - -#### Changed - -- Update Googletest to 1.11.0 - -##### Removed - -- Removed no longer supported benchmarking steps - -#### Optimized - -- Optimized runtime of TensileCreateLibraries by reducing max RAM usage -- StoreCInUnroll additional optimizations plus adaptive K support -- DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support - -## ROCm 5.0.2 - -See the [ROCm 5.0.2 changelog](https://github.com/ROCm/ROCm/blob/docs/5.0.2/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.0.1 - -See the [ROCm 5.0.1 changelog](https://github.com/ROCm/ROCm/blob/docs/5.0.1/CHANGELOG.md) -on GitHub for a complete overview of this release. - -## ROCm 5.0.0 - -See the [ROCm 5.0.0 changelog](https://github.com/ROCm/ROCm/blob/docs/5.0.0/CHANGELOG.md) -on GitHub for a complete overview of this release. - -### **hipBLAS** (0.49.0) - -#### Added - -- Added rocSOLVER functions to hipblas-bench -- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex -- Added compilation warning for future trmm changes -- Added documentation to hipblas.h -- Added option to forgo pivoting for getrf and getri when ipiv is nullptr -- Added code coverage option - -#### Resolved issues - -- Fixed use of incorrect 'HIP_PATH' when building from source. -- Fixed windows packaging -- Allowing negative increments in hipblas-bench -- Removed boost dependency - -### **hipCUB** (2.10.13) - -#### Added - -- Bfloat16 support to test cases (device_reduce & device_radix_sort) -- Device merge sort -- Block merge sort -- API update to CUB 1.14.0 - -#### Changed - -- The SetupNVCC.cmake automatic target selector select all of the capabalities of all available card for NVIDIA backend. - -#### Resolved issues - -- Added missing includes to hipcub.hpp - -### **hipFFT** (1.0.4) - -#### Fixed - -- Add calls to rocFFT setup/cleanup. -- Cmake fixes for clients and backend support. - -#### Added - -- Added support for Windows 10 as a build target. - -### **hipSOLVER** (1.2.0) - -#### Added - -- Added functions - - sytrf - - hipsolverSsytrf_bufferSize, hipsolverDsytrf_bufferSize, hipsolverCsytrf_bufferSize, hipsolverZsytrf_bufferSize - - hipsolverSsytrf, hipsolverDsytrf, hipsolverCsytrf, hipsolverZsytrf - -#### Resolved issues - -- Fixed use of incorrect `HIP_PATH` when building from source (#40). - -### **hipSPARSE** (2.0.0) - -#### Added - -- Added (conjugate) transpose support for csrmv, hybmv and spmv routines - -### **RCCL** (2.10.3) - -#### Added - -- Compatibility with NCCL 2.10.3 - -#### Known issues - -- Managed memory is not currently supported for clique-based kernels - -### **rocALUTION** (2.0.1) - -#### Changed - -- Removed deprecated GlobalPairwiseAMG class, please use PairwiseAMG instead. -- Changed to C++ 14 Standard - -#### Optimized - -- Added sanitizer option -- Improved documentation - -### **rocBLAS** (2.42.0) - -#### Added - -- Added rocblas_get_version_string_size convenience function -- Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm -- Added hpl and trig initialization for gemm_ex to rocblas-bench -- Added source code gemm. It can be used as an alternative to Tensile for debugging and development -- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex - -#### Changed - -- Instantiate templated rocBLAS functions to reduce size of librocblas.so -- Removed static library dependency on msgpack -- Removed boost dependencies for clients - -#### Optimized - -- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU. -- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU. - -#### Resolved issues - -- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library -- Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input -- Fix for dgmm with side == rocblas_side_left and a negative incx -- Fixed out-of-bounds read for small trsm -- Fixed numerical checking for tbmv_strided_batched - -### **rocFFT** (1.0.13) - -#### Added - -- Added new kernel generator for select fused-2D transforms. - -#### Optimized - -- Improved many plans by removing unnecessary transpose steps. -- Optimized scheme selection for 3D problems. - - Imposed less restrictions on 3D_BLOCK_RC selection. More problems can use 3D_BLOCK_RC and - have some performance gain. - - Enabled 3D_RC. Some 3D problems with SBCC-supported z-dim can use less kernels and get benefit. - - Force --length 336 336 56 (dp) use faster 3D_RC to avoid it from being skipped by conservative - threshold test. -- Optimized some even-length R2C/C2R cases by doing more operations - in-place and combining pre/post processing into Stockham kernels. -- Added radix-17. - -#### Resolved issues - -- Improved large 1D transform decompositions. - -### **rocPRIM** (2.10.12) - -#### Added - -- Added scan size limit feature -- Added reduce size limit feature -- Added transform size limit feature -- Add block_load_striped and block_store_striped -- Add gather_to_blocked to gather values from other threads into a blocked arrangement -- The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config - - the block sort step supports multiple items per thread - -#### Changed - -- size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter -- Device_scan and device_segmented_scan: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). - - And low-res input with high-res output (e.g. `float` input, `double` output) -- Revert old Fiji workaround, because they solved the issue at compiler side -- Update README cmake minimum version number -- Block sort support multiple items per thread - - currently only powers of two block sizes, and items per threads are supported and only for full blocks -- Bumped the minimum required version of CMake to 3.16 - -#### Resolved issues - -- Enable bfloat16 tests and reduce threshold for bfloat16 -- Fix device scan limit_size feature -- Non-optimized builds no longer trigger local memory limit errors - -#### Known issues - -- Unit tests may soft hang on MI200 when running in hipMallocManaged mode. -- device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows -- ReduceEmptyInput cause random faulire with bfloat16 - -### **rocSOLVER** (3.16.0) - -#### Added - -- Symmetric matrix factorizations: - - LASYF - - SYTF2, SYTRF (with batched and strided\_batched versions) -- Added `rocsolver_get_version_string_size` to help with version string queries -- Added `rocblas_layer_mode_ex` and the ability to print kernel calls in the trace and profile logs -- Expanded batched and strided\_batched sample programs. - -#### Changed - -- The rocsolver-test client now prints the rocSOLVER version used to run the tests, - rather than the version used to build them -- The rocsolver-bench client now prints the rocSOLVER version used in the benchmark - -#### Optimized - -- Improved general performance of LU factorization -- Increased parallelism of specialized kernels when compiling from source, reducing build times on multi-core systems. - -#### Resolved issues - -- Added missing `stdint.h` include to `rocsolver.h` - -### **rocSPARSE** (2.0.0) - -#### Added - -- csrmv, coomv, ellmv, hybmv for (conjugate) transposed matrices -- csrmv for symmetric matrices - -#### Changed - -- spmm\_ex is now deprecated and will be removed in the next major release -- Optimization for gtsv - -### **rocThrust** (2.13.0) - -#### Added - -- Updated to match upstream Thrust 1.13.0 -- Updated to match upstream Thrust 1.14.0 -- Added async scan - -#### Changed - -- Scan algorithms: `inclusive_scan` now uses the input-type as accumulator-type, `exclusive_scan` uses initial-value-type. - - This particularly changes behaviour of small-size input types with large-size output types (e.g. `short` input, `int` output). - - And low-res input with high-res output (e.g. `float` input, `double` output) - -### **Tensile** (4.31.0) - -#### Added - -- DirectToLds support (x2/x4) -- DirectToVgpr support for DGEMM -- Parameter to control number of files kernels are merged into to better parallelize kernel compilation -- FP16 alternate implementation for HPA HGEMM on aldebaran - -#### Changed - -- Update tensile_client executable to std=c++14 - -#### Removed - -- Remove unused old Tensile client code - -#### Optimized - -- Add DGEMM NN custom kernel for HPL on aldebaran - -#### Resolved issues - -- Fixed `hipErrorInvalidHandle` during benchmarks -- Fixed `addrVgpr` for atomic GSU -- Fixed for Python 3.8: add case for Constant nodeType -- Fixed architecture mapping for gfx1011 and gfx1012 -- Fixed `PrintSolutionRejectionReason` verbiage in `KernelWriter.py` -- Fixed vgpr alignment problem when enabling flat buffer load diff --git a/docs/compatibility/compatibility-matrix-historical-6.0.csv b/docs/compatibility/compatibility-matrix-historical-6.0.csv deleted file mode 100644 index 4868282cb..000000000 --- a/docs/compatibility/compatibility-matrix-historical-6.0.csv +++ /dev/null @@ -1,137 +0,0 @@ -ROCm Version,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0 - :ref:`Operating systems & kernels `,Ubuntu 24.04.3,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,, - ,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2" - ,,,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5" - ,"RHEL 10.0 [#rhel-10-702-past-60]_, 9.6 [#rhel-10-702-past-60]_, 9.4 [#rhel-94-702-past-60]_","RHEL 9.6 [#rhel-10-702-past-60], 9.4 [#rhel-94-702-past-60]_","RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2" - ,RHEL 8.10 [#rhel-700-past-60]_,RHEL 8.10 [#rhel-700-past-60]_,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8" - ,SLES 15 SP7 [#sles-db-700-past-60]_,SLES 15 SP7 [#sles-db-700-past-60]_,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4" - ,,,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9 - ,"Oracle Linux 10, 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,, - ,"Debian 13 [#db-mi300x-past-60]_, 12 [#sles-db-700-past-60]_",Debian 12 [#sles-db-700-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,, - ,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,,,,,,,,,,,, - ,Rocky Linux 9 [#rl-700-past-60]_,Rocky Linux 9 [#rl-700-past-60]_,,,,,,,,,,,,,,,,,, - ,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`Architecture `,CDNA4,CDNA4,,,,,,,,,,,,,,,,,, - ,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3 - ,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2 - ,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA - ,RDNA4,RDNA4,RDNA4,RDNA4,RDNA4,,,,,,,,,,,,,,, - ,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3 - ,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2 - ,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`GPU / LLVM target `,gfx950 [#mi350x-os-past-60]_,gfx950 [#mi350x-os-past-60]_,,,,,,,,,,,,,,,,,, - ,gfx1201 [#RDNA-OS-700-past-60]_,gfx1201 [#RDNA-OS-700-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, - ,gfx1200 [#RDNA-OS-700-past-60]_,gfx1200 [#RDNA-OS-700-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, - ,gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_,gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,, - ,gfx1100 [#RDNA-OS-700-past-60]_,gfx1100 [#RDNA-OS-700-past-60]_,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100 - ,gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_,gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030 - ,gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_,gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_ - ,gfx90a [#mi200x-os-past-60]_,gfx90a [#mi200x-os-past-60]_,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a - ,gfx908 [#mi100-os-past-60]_,gfx908 [#mi100-os-past-60]_,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908 - ,,,,,,,,,,,,,,,,,,,, - FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13" - :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1" - :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26 - :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,N/A,b6356,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - `ONNX Runtime `_,1.22.0,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1 - ,,,,,,,,,,,,,,,,,,,, - ,,,,,,,,,,,,,,,,,,,, - THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - `UCC `_,>=1.4.0,>=1.4.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0 - `UCX `_,>=1.17.0,>=1.17.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1 - ,,,,,,,,,,,,,,,,,,,, - THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - Thrust,2.6.0,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 - CUB,2.6.0,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1 - ,,,,,,,,,,,,,,,,,,,, - DRIVER & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`AMD GPU Driver `,"30.10.2, 30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x","30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x" - ,,,,,,,,,,,,,,,,,,,, - ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`Composable Kernel `,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0 - :doc:`MIGraphX `,2.13.0,2.13.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.0,2.11.0,2.11.0,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0 - :doc:`MIOpen `,3.5.0,3.5.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`MIVisionX `,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0 - :doc:`rocAL `,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - :doc:`rocDecode `,1.0.0,1.0.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A - :doc:`rocJPEG `,1.1.0,1.1.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`rocPyDecode `,0.6.0,0.6.0,0.3.1,0.3.1,0.3.1,0.3.1,0.2.0,0.2.0,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`RPP `,2.0.0,2.0.0,1.9.10,1.9.10,1.9.10,1.9.10,1.9.1,1.9.1,1.9.1,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0 - ,,,,,,,,,,,,,,,,,,,, - COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`RCCL `,2.26.6,2.26.6,2.22.3,2.22.3,2.22.3,2.22.3,2.21.5,2.21.5,2.21.5,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3 - :doc:`rocSHMEM `,3.0.0,3.0.0,2.0.1,2.0.1,2.0.0,2.0.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A - ,,,,,,,,,,,,,,,,,,,, - MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - `half `_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0 - :doc:`hipBLAS `,3.0.2,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0 - :doc:`hipBLASLt `,1.0.0,1.0.0,0.12.1,0.12.1,0.12.1,0.12.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0 - :doc:`hipFFT `,1.0.20,1.0.20,1.0.18,1.0.18,1.0.18,1.0.18,1.0.17,1.0.17,1.0.17,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13 - :doc:`hipfort `,0.7.0,0.7.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.1,0.5.1,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0 - :doc:`hipRAND `,3.0.0,3.0.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.1,2.11.1,2.11.1,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16 - :doc:`hipSOLVER `,3.0.0,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0 - :doc:`hipSPARSE `,4.0.1,4.0.1,3.2.0,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 - :doc:`hipSPARSELt `,0.2.4,0.2.4,0.2.3,0.2.3,0.2.3,0.2.3,0.2.2,0.2.2,0.2.2,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0 - :doc:`rocALUTION `,4.0.0,4.0.0,3.2.3,3.2.3,3.2.3,3.2.2,3.2.1,3.2.1,3.2.1,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3 - :doc:`rocBLAS `,5.0.2,5.0.0,4.4.1,4.4.1,4.4.0,4.4.0,4.3.0,4.3.0,4.3.0,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0 - :doc:`rocFFT `,1.0.34,1.0.34,1.0.32,1.0.32,1.0.32,1.0.32,1.0.31,1.0.31,1.0.31,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23 - :doc:`rocRAND `,4.0.0,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17 - :doc:`rocSOLVER `,3.30.1,3.30.0,3.28.2,3.28.2,3.28.0,3.28.0,3.27.0,3.27.0,3.27.0,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0 - :doc:`rocSPARSE `,4.0.2,4.0.2,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2 - :doc:`rocWMMA `,2.0.0,2.0.0,1.7.0,1.7.0,1.7.0,1.7.0,1.6.0,1.6.0,1.6.0,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0 - :doc:`Tensile `,4.44.0,4.44.0,4.43.0,4.43.0,4.43.0,4.43.0,4.42.0,4.42.0,4.42.0,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0 - ,,,,,,,,,,,,,,,,,,,, - PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`hipCUB `,4.0.0,4.0.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`hipTensor `,2.0.0,2.0.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0 - :doc:`rocPRIM `,4.0.1,4.0.0,3.4.1,3.4.1,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0 - :doc:`rocThrust `,4.0.0,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0 - ,,,,,,,,,,,,,,,,,,,, - SUPPORT LIBS,,,,,,,,,,,,,,,,,,,, - `hipother `_,7.0.51830,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - `rocm-core `_,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0 - `ROCT-Thunk-Interface `_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245 - ,,,,,,,,,,,,,,,,,,,, - SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`AMD SMI `,26.0.2,26.0.0,25.5.1,25.5.1,25.4.2,25.3.0,24.7.1,24.7.1,24.7.1,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2 - :doc:`ROCm Data Center Tool `,1.1.0,1.1.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0 - :doc:`rocminfo `,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - :doc:`ROCm SMI `,7.8.0,7.8.0,7.7.0,7.5.0,7.5.0,7.5.0,7.4.0,7.4.0,7.4.0,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0 - :doc:`ROCm Validation Suite `,1.2.0,1.2.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000 - ,,,,,,,,,,,,,,,,,,,, - PERFORMANCE TOOLS,,,,,,,,,,,,,,,,,,,, - :doc:`ROCm Bandwidth Test `,2.6.0,2.6.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0 - :doc:`ROCm Compute Profiler `,3.2.3,3.2.3,3.1.1,3.1.1,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCm Systems Profiler `,1.1.1,1.1.0,1.0.2,1.0.2,1.0.1,1.0.0,0.1.2,0.1.1,0.1.0,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCProfiler `,2.0.70002,2.0.70000,2.0.60403,2.0.60402,2.0.60401,2.0.60400,2.0.60303,2.0.60302,2.0.60301,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000 - :doc:`ROCprofiler-SDK `,1.0.0,1.0.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A - :doc:`ROCTracer `,4.1.70002,4.1.70000,4.1.60403,4.1.60402,4.1.60401,4.1.60400,4.1.60303,4.1.60302,4.1.60301,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000 - ,,,,,,,,,,,,,,,,,,,, - DEVELOPMENT TOOLS,,,,,,,,,,,,,,,,,,,, - :doc:`HIPIFY `,20.0.0,20.0.0,19.0.0,19.0.0,19.0.0,19.0.0,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - :doc:`ROCm CMake `,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0 - :doc:`ROCdbgapi `,0.77.4,0.77.3,0.77.2,0.77.2,0.77.2,0.77.2,0.77.0,0.77.0,0.77.0,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0 - :doc:`ROCm Debugger (ROCgdb) `,16.3.0,16.3.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0 - `rocprofiler-register `_,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A - :doc:`ROCr Debug Agent `,2.1.0,2.1.0,2.0.4,2.0.4,2.0.4,2.0.4,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3 - ,,,,,,,,,,,,,,,,,,,, - COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - `clang-ocl `_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0 - :doc:`hipCC `,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0 - `Flang `_,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - :doc:`llvm-project `,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - `OpenMP `_,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483 - ,,,,,,,,,,,,,,,,,,,, - RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,, - :doc:`AMD CLR `,7.0.51831,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - :doc:`HIP `,7.0.51831,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830 - `OpenCL Runtime `_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0 - :doc:`ROCr Runtime `,1.18.0,1.18.0,1.15.0,1.15.0,1.15.0,1.15.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0 diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst deleted file mode 100644 index da80c0ff6..000000000 --- a/docs/compatibility/compatibility-matrix.rst +++ /dev/null @@ -1,301 +0,0 @@ -.. meta:: - :description: ROCm compatibility matrix - :keywords: GPU, architecture, hardware, compatibility, system, requirements, components, libraries - -************************************************************************************** -Compatibility matrix -************************************************************************************** - -Use this matrix to view the ROCm compatibility and system requirements across successive major and minor releases. - -You can also refer to the :ref:`past versions of ROCm compatibility matrix`. - -Accelerators and GPUs listed in the following table support compute workloads (no display -information or graphics). If you’re using ROCm with AMD Radeon GPUs or Ryzen APUs for graphics -workloads, see the `Use ROCm on Radeon and Ryzen -`_ to verify -compatibility and system requirements. - -.. |br| raw:: html - -
- -.. container:: format-big-table - - .. csv-table:: - :header: "ROCm Version", "7.0.2", "7.0.1/7.0.0", "6.4.0" - :stub-columns: 1 - - :ref:`Operating systems & kernels `,Ubuntu 24.04.3,Ubuntu 24.04.3,Ubuntu 24.04.2 - ,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5 - ,"RHEL 10.0 [#rhel-10-702]_, 9.6 [#rhel-10-702]_, 9.4 [#rhel-94-702]_","RHEL 9.6 [#rhel-10-702]_, 9.4 [#rhel-94-702]_","RHEL 9.5, 9.4" - ,RHEL 8.10 [#rhel-700]_,RHEL 8.10 [#rhel-700]_,RHEL 8.10 - ,SLES 15 SP7 [#sles-db-700]_,SLES 15 SP7 [#sles-db-700]_,SLES 15 SP6 - ,"Oracle Linux 10, 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_" - ,"Debian 13 [#db-mi300x]_, 12 [#sles-db-700]_",Debian 12 [#sles-db-700]_,Debian 12 [#single-node]_ - ,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_ - ,Rocky Linux 9 [#rl-700]_,Rocky Linux 9 [#rl-700]_, - ,.. _architecture-support-compatibility-matrix:,, - :doc:`Architecture `,CDNA4,CDNA4, - ,CDNA3,CDNA3,CDNA3 - ,CDNA2,CDNA2,CDNA2 - ,CDNA,CDNA,CDNA - ,RDNA4,RDNA4, - ,RDNA3,RDNA3,RDNA3 - ,RDNA2,RDNA2,RDNA2 - ,.. _gpu-support-compatibility-matrix:,, - :doc:`GPU / LLVM target `,gfx950 [#mi350x-os]_,gfx950 [#mi350x-os]_, - ,gfx1201 [#RDNA-OS-700]_,gfx1201 [#RDNA-OS-700]_, - ,gfx1200 [#RDNA-OS-700]_,gfx1200 [#RDNA-OS-700]_, - ,gfx1101 [#RDNA-OS-700]_ [#rd-v710]_,gfx1101 [#RDNA-OS-700]_ [#rd-v710]_, - ,gfx1100 [#RDNA-OS-700]_,gfx1100 [#RDNA-OS-700]_,gfx1100 - ,gfx1030 [#RDNA-OS-700]_ [#rd-v620]_,gfx1030 [#RDNA-OS-700]_ [#rd-v620]_,gfx1030 - ,gfx942 [#mi325x-os]_ [#mi300x-os]_ [#mi300A-os]_,gfx942 [#mi325x-os]_ [#mi300x-os]_ [#mi300A-os]_,gfx942 - ,gfx90a [#mi200x-os]_,gfx90a [#mi200x-os]_,gfx90a - ,gfx908 [#mi100-os]_,gfx908 [#mi100-os]_,gfx908 - ,,, - FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,, - :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3" - :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350]_","2.18.1, 2.17.1, 2.16.2" - :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.6.0,0.4.35 - :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,2.4.0 - :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat]_,N/A,b6356,b5997 - `ONNX Runtime `_,1.22.0,1.22.0,1.20.0 - ,,, - THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix:,, - `UCC `_,>=1.4.0,>=1.4.0,>=1.3.0 - `UCX `_,>=1.17.0,>=1.17.0,>=1.15.0 - ,,, - THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix:,, - Thrust,2.6.0,2.6.0,2.5.0 - CUB,2.6.0,2.6.0,2.5.0 - ,,, - DRIVER & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,, - :doc:`AMD GPU Driver `,"30.10.2, 30.10.1 [#driver_patch]_, |br| 30.10, 6.4.x, 6.3.x","30.10.1 [#driver_patch]_, 30.10, |br| 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x" - ,,, - ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix:,, - :doc:`Composable Kernel `,1.1.0,1.1.0,1.1.0 - :doc:`MIGraphX `,2.13.0,2.13.0,2.12.0 - :doc:`MIOpen `,3.5.0,3.5.0,3.4.0 - :doc:`MIVisionX `,3.3.0,3.3.0,3.2.0 - :doc:`rocAL `,2.3.0,2.3.0,2.2.0 - :doc:`rocDecode `,1.0.0,1.0.0,0.10.0 - :doc:`rocJPEG `,1.1.0,1.1.0,0.8.0 - :doc:`rocPyDecode `,0.6.0,0.6.0,0.3.1 - :doc:`RPP `,2.0.0,2.0.0,1.9.10 - ,,, - COMMUNICATION,.. _commlibs-support-compatibility-matrix:,, - :doc:`RCCL `,2.26.6,2.26.6,2.22.3 - :doc:`rocSHMEM `,3.0.0,3.0.0,2.0.0 - ,,, - MATH LIBS,.. _mathlibs-support-compatibility-matrix:,, - `half `_ ,1.12.0,1.12.0,1.12.0 - :doc:`hipBLAS `,3.0.2,3.0.0,2.4.0 - :doc:`hipBLASLt `,1.0.0,1.0.0,0.12.0 - :doc:`hipFFT `,1.0.20,1.0.20,1.0.18 - :doc:`hipfort `,0.7.0,0.7.0,0.6.0 - :doc:`hipRAND `,3.0.0,3.0.0,2.12.0 - :doc:`hipSOLVER `,3.0.0,3.0.0,2.4.0 - :doc:`hipSPARSE `,4.0.1,4.0.1,3.2.0 - :doc:`hipSPARSELt `,0.2.4,0.2.4,0.2.3 - :doc:`rocALUTION `,4.0.0,4.0.0,3.2.2 - :doc:`rocBLAS `,5.0.2,5.0.0,4.4.0 - :doc:`rocFFT `,1.0.34,1.0.34,1.0.32 - :doc:`rocRAND `,4.0.0,4.0.0,3.3.0 - :doc:`rocSOLVER `,3.30.1,3.30.0,3.28.0 - :doc:`rocSPARSE `,4.0.2,4.0.2,3.4.0 - :doc:`rocWMMA `,2.0.0,2.0.0,1.7.0 - :doc:`Tensile `,4.44.0,4.44.0,4.43.0 - ,,, - PRIMITIVES,.. _primitivelibs-support-compatibility-matrix:,, - :doc:`hipCUB `,4.0.0,4.0.0,3.4.0 - :doc:`hipTensor `,2.0.0,2.0.0,1.5.0 - :doc:`rocPRIM `,4.0.1,4.0.0,3.4.0 - :doc:`rocThrust `,4.0.0,4.0.0,3.3.0 - ,,, - SUPPORT LIBS,,, - `hipother `_,7.0.51830,7.0.51830,6.4.43482 - `rocm-core `_,7.0.2,7.0.1/7.0.0,6.4.0 - `ROCT-Thunk-Interface `_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_ - ,,, - SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix:,, - :doc:`AMD SMI `,26.0.2,26.0.0,25.3.0 - :doc:`ROCm Data Center Tool `,1.1.0,1.1.0,0.3.0 - :doc:`rocminfo `,1.0.0,1.0.0,1.0.0 - :doc:`ROCm SMI `,7.8.0,7.8.0,7.5.0 - :doc:`ROCm Validation Suite `,1.2.0,1.2.0,1.1.0 - ,,, - PERFORMANCE TOOLS,,, - :doc:`ROCm Bandwidth Test `,2.6.0,2.6.0,1.4.0 - :doc:`ROCm Compute Profiler `,3.2.3,3.2.3,3.1.0 - :doc:`ROCm Systems Profiler `,1.1.1,1.1.0,1.0.0 - :doc:`ROCProfiler `,2.0.70002,2.0.70000,2.0.60400 - :doc:`ROCprofiler-SDK `,1.0.0,1.0.0,0.6.0 - :doc:`ROCTracer `,4.1.70002,4.1.70000,4.1.60400 - ,,, - DEVELOPMENT TOOLS,,, - :doc:`HIPIFY `,20.0.0,20.0.0,19.0.0 - :doc:`ROCm CMake `,0.14.0,0.14.0,0.14.0 - :doc:`ROCdbgapi `,0.77.4,0.77.3,0.77.2 - :doc:`ROCm Debugger (ROCgdb) `,16.3.0,16.3.0,15.2.0 - `rocprofiler-register `_,0.5.0,0.5.0,0.4.0 - :doc:`ROCr Debug Agent `,2.1.0,2.1.0,2.0.4 - ,,, - COMPILERS,.. _compilers-support-compatibility-matrix:,, - `clang-ocl `_,N/A,N/A,N/A - :doc:`hipCC `,1.1.1,1.1.1,1.1.1 - `Flang `_,20.0.0.25385,20.0.0.25314,19.0.0.25133 - :doc:`llvm-project `,20.0.0.25385,20.0.0.25314,19.0.0.25133 - `OpenMP `_,20.0.0.25385,20.0.0.25314,19.0.0.25133 - ,,, - RUNTIMES,.. _runtime-support-compatibility-matrix:,, - :doc:`AMD CLR `,7.0.51831,7.0.51830,6.4.43482 - :doc:`HIP `,7.0.51831,7.0.51830,6.4.43482 - `OpenCL Runtime `_,2.0.0,2.0.0,2.0.0 - :doc:`ROCr Runtime `,1.18.0,1.18.0,1.15.0 - -.. rubric:: Footnotes - -.. [#rhel-10-702] RHEL 10.0 and RHEL 9.6 are supported on all listed :ref:`supported_GPUs` except AMD Radeon PRO V620 GPU. -.. [#rhel-94-702] RHEL 9.4 is supported on all AMD Instinct GPUs listed under :ref:`supported_GPUs`. -.. [#rhel-700] RHEL 8.10 is supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs. -.. [#ol-700-mi300x] **For ROCm 7.0.x** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI355X, MI350X, and MI300X GPUs. Oracle Linux 8 is supported only on AMD Instinct MI300X GPU. -.. [#ol-mi300x] **Prior ROCm 7.0.0** - Oracle Linux is supported only on AMD Instinct MI300X GPUs. -.. [#db-mi300x] **For ROCm 7.0.2** - Debian 13 is supported only on AMD Instinct MI300X GPUs. -.. [#sles-db-700] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs. -.. [#az-mi300x] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710 GPUs. -.. [#rl-700] Rocky Linux 9 is supported only on AMD Instinct MI300X and MI300A GPUs. -.. [#single-node] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X GPUs for single-node functionality. -.. [#mi350x-os] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, Oracle Linux 10, and Oracle Linux 9. -.. [#RDNA-OS-700] **For ROCm 7.0.x** - AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 9060 (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, and RHEL 9.6. -.. [#rd-v710] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, and Azure Linux 3.0. -.. [#rd-v620] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) GPUs are supported only on Ubuntu 24.04.3 and Ubuntu 22.04.5. -.. [#mi325x-os] **For ROCm 7.0.x** - AMD Instinct MI325X GPUs (gfx942) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4. -.. [#mi300x-os] **For ROCm 7.0.x** - AMD Instinct MI300X GPUs (gfx942) are supported on all listed :ref:`supported_distributions`. -.. [#mi300A-os] **For ROCm 7.0.x** - AMD Instinct MI300A GPUs (gfx942) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9. -.. [#mi200x-os] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12. -.. [#mi100-os] **For ROCm 7.0.x** - AMD Instinct MI100 GPUs (gfx908) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, and RHEL 8.10. -.. [#tf-mi350] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead. -.. [#dgl_compat] DGL is supported only on ROCm 6.4.0. -.. [#llama-cpp_compat] llama.cpp is supported only on ROCm 7.0.0 and ROCm 6.4.x. -.. [#driver_patch] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0. -.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix `_. -.. [#ROCT-rocr] Starting from ROCm 6.3.0, the ROCT Thunk Interface is included as part of the ROCr runtime package. - - -.. _OS-kernel-versions: - -Operating systems, kernel and Glibc versions -********************************************* - -Use this lookup table to confirm which operating system and kernel versions are supported with ROCm. - -.. csv-table:: - :header: "OS", "Version", "Kernel", "Glibc" - :widths: 40, 20, 30, 20 - :stub-columns: 1 - - `Ubuntu `_, 24.04.3, "6.8 [GA], 6.14 [HWE]", 2.39 - ,, - `Ubuntu `_, 24.04.2, "6.8 [GA], 6.11 [HWE]", 2.39 - ,, - `Ubuntu `_, 22.04.5, "5.15 [GA], 6.8 [HWE]", 2.35 - ,, - `Red Hat Enterprise Linux (RHEL 10) `_, 10.0, 6.12.0-55, 2.39 - ,, - `Red Hat Enterprise Linux (RHEL 9) `_, 9.6, 5.14.0-570, 2.34 - ,9.5, 5.14+, 2.34 - ,9.4, 5.14.0-427, 2.34 - ,, - `Red Hat Enterprise Linux (RHEL 8) `_, 8.10, 4.18.0-553, 2.28 - ,, - `SUSE Linux Enterprise Server (SLES) `_, 15 SP7, 6.40-150700.51, 2.38 - ,15 SP6, "6.5.0+, 6.4.0", 2.38 - ,15 SP5, 5.14.21, 2.31 - ,, - `Rocky Linux `_, 9, 5.14.0-570, 2.34 - ,, - `Oracle Linux `_, 10, 6.12.0 (UEK), 2.39 - ,9, 6.12.0 (UEK), 2.34 - ,8, 5.15.0 (UEK), 2.28 - ,, - `Debian `_,13, 6.12, 2.35 - ,12, 6.1.0, 2.36 - ,, - `Azure Linux `_,3.0, 6.6.92, 2.38 - ,, - -.. note:: - - * See `Red Hat Enterprise Linux Release Dates `_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL). - * See `List of SUSE Linux Enterprise Server kernel `_ to learn about the specific kernel version supported on SUSE Linux Enterprise Server (SLES). -.. - Footnotes and ref anchors in below historical tables should be appended with "-past-60", to differentiate from the - footnote references in the above, latest, compatibility matrix. It also allows to easily find & replace. - An easy way to work is to download the historical.CSV file, and update open it in excel. Then when content is ready, - delete the columns you don't need, to build the current compatibility matrix to use in above table. Find & replace all - instances of "-past-60" to make it ready for above table. - - -.. _past-rocm-compatibility-matrix: - -Past versions of ROCm compatibility matrix -*************************************************** - -Expand for full historical view of: - -.. dropdown:: ROCm 6.0 - Present - - You can `download the entire .csv <../downloads/compatibility-matrix-historical-6.0.csv>`_ for offline reference. - - .. csv-table:: - :file: compatibility-matrix-historical-6.0.csv - :header-rows: 1 - :stub-columns: 1 - - .. rubric:: Footnotes - - .. [#rhel-10-702-past-60] RHEL 10.0 and RHEL 9.6 are supported on all listed :ref:`supported_GPUs` except AMD Radeon PRO V620 GPU. - .. [#rhel-94-702-past-60] RHEL 9.4 is supported on all AMD Instinct GPUs listed under :ref:`supported_GPUs`. - .. [#rhel-700-past-60] **For ROCm 7.0.x** - RHEL 8.10 is supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs. - .. [#ol-700-mi300x-past-60] **For ROCm 7.0.x** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI355X, MI350X, and MI300X GPUs. Oracle Linux 8 is supported only on AMD Instinct MI300X GPU. - .. [#mi300x-past-60] **Prior ROCm 7.0.0** - Oracle Linux is supported only on AMD Instinct MI300X GPUs. - .. [#db-mi300x-past-60] **For ROCm 7.0.2** - Debian 13 is supported only on AMD Instinct MI300X GPUs. - .. [#sles-db-700-past-60] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs. - .. [#single-node-past-60] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X GPUs for single-node functionality. - .. [#az-mi300x-past-60] Starting from ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710 GPUs. - .. [#az-mi300x-630-past-60] **Prior ROCm 6.4.0**- Azure Linux 3.0 is supported only on AMD Instinct MI300X GPUs. - .. [#rl-700-past-60] Rocky Linux 9 is supported only on AMD Instinct MI300X and MI300A GPUs. - .. [#mi350x-os-past-60] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and Oracle Linux 9. - .. [#RDNA-OS-700-past-60] **For ROCm 7.0.x** AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 9060 (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, Oracle Linux 10, and Oracle Linux 9. - .. [#RDNA-OS-past-60] **Prior ROCm 7.0.0** - Radeon AI PRO R9700, Radeon RX 9070 XT (gfx1201), Radeon RX 9060 XT (gfx1200), Radeon PRO W7700 (gfx1101), and Radeon RX 7800 XT (gfx1101) are supported only on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4. - .. [#rd-v710-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, and Azure Linux 3.0. - .. [#rd-v620-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) is supported only on Ubuntu 24.04.3 and Ubuntu 22.04.5. - .. [#mi325x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI325X GPU (gfx942) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4. - .. [#mi300x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI300X GPU (gfx942) is supported on all listed :ref:`supported_distributions`. - .. [#mi300A-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI300A GPU (gfx942) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9. - .. [#mi200x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12. - .. [#mi100-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI100 GPU (gfx908) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, and RHEL 8.10. - .. [#7700XT-OS-past-60] **Prior to ROCm 7.0.0** - Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6. - .. [#mi300_624-past-60] **For ROCm 6.2.4** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE]. - .. [#mi300_622-past-60] **For ROCm 6.2.2** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE]. - .. [#mi300_621-past-60] **For ROCm 6.2.1** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE]. - .. [#mi300_620-past-60] **For ROCm 6.2.0** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE]. - .. [#mi300_612-past-60] **For ROCm 6.1.2** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4 and Oracle Linux. - .. [#mi300_611-past-60] **For ROCm 6.1.1** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4 and Oracle Linux. - .. [#mi300_610-past-60] **For ROCm 6.1.0** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4. - .. [#mi300_602-past-60] **For ROCm 6.0.2** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.3. - .. [#mi300_600-past-60] **For ROCm 6.0.0** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.3. - .. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead. - .. [#verl_compat-past-60] verl is supported only on ROCm 6.2.0. - .. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is supported only on ROCm 6.3.0. - .. [#dgl_compat-past-60] DGL is supported only on ROCm 6.4.0. - .. [#megablocks_compat-past-60] Megablocks is supported only on ROCm 6.3.0. - .. [#taichi_compat-past-60] Taichi is supported only on ROCm 6.3.2. - .. [#ray_compat-past-60] Ray is supported only on ROCm 6.4.1. - .. [#llama-cpp_compat-past-60] llama.cpp is supported only on ROCm 7.0.0 and 6.4.x. - .. [#flashinfer_compat-past-60] FlashInfer is supported only on ROCm 6.4.1. - .. [#driver_patch-past-60] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0. - .. [#kfd_support-past-60] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix `_. - .. [#ROCT-rocr-past-60] Starting from ROCm 6.3.0, the ROCT Thunk Interface is included as part of the ROCr runtime package. - diff --git a/docs/compatibility/frameworks.rst b/docs/compatibility/frameworks.rst deleted file mode 100644 index ee790c155..000000000 --- a/docs/compatibility/frameworks.rst +++ /dev/null @@ -1,5 +0,0 @@ -************************************** -Deep learning frameworks compatibility -************************************** - -Basdflkj; jaksldf;jkasl;d jkl;fdksalsdfhguieqwrasdf .asdf diff --git a/docs/compatibility/ml-compatibility/dgl-compatibility.rst b/docs/compatibility/ml-compatibility/dgl-compatibility.rst deleted file mode 100644 index 7c61515ec..000000000 --- a/docs/compatibility/ml-compatibility/dgl-compatibility.rst +++ /dev/null @@ -1,255 +0,0 @@ -:orphan: - -.. meta:: - :description: Deep Graph Library (DGL) compatibility - :keywords: GPU, DGL compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -DGL compatibility -******************************************************************************** - -Deep Graph Library `(DGL) `_ is an easy-to-use, high-performance and scalable -Python package for deep learning on graphs. DGL is framework agnostic, meaning -if a deep graph model is a component in an end-to-end application, the rest of -the logic is implemented using PyTorch. - -* ROCm support for DGL is hosted in the `https://github.com/ROCm/dgl `_ repository. -* Due to independent compatibility considerations, this location differs from the `https://github.com/dmlc/dgl `_ upstream repository. -* Use the prebuilt :ref:`Docker images ` with DGL, PyTorch, and ROCm preinstalled. -* See the :doc:`ROCm DGL installation guide ` - to install and get started. - - -Supported devices -================================================================================ - -- **Officially Supported**: TF32 with AMD Instinct MI300X (through hipblaslt) -- **Partially Supported**: TF32 with AMD Instinct MI250X - - -.. _dgl-recommendations: - -Use cases and recommendations -================================================================================ - -DGL can be used for Graph Learning, and building popular graph models like -GAT, GCN and GraphSage. Using these we can support a variety of use-cases such as: - -- Recommender systems -- Network Optimization and Analysis -- 1D (Temporal) and 2D (Image) Classification -- Drug Discovery - -Multiple use cases of DGL have been tested and verified. -However, a recommended example follows a drug discovery pipeline using the ``SE3Transformer``. -Refer to the `AMD ROCm blog `_, -where you can search for DGL examples and best practices to optimize your training workflows on AMD GPUs. - -Coverage includes: - -- Single-GPU training/inference -- Multi-GPU training - - -.. _dgl-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes `DGL images `_ -with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated -inventories were tested on `ROCm 6.4.0 `_. -Click the |docker-icon| to view the image on Docker Hub. - -.. list-table:: DGL Docker image components - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker - - DGL - - PyTorch - - Ubuntu - - Python - - * - .. raw:: html - - - - - `2.4.0 `_ - - `2.6.0 `_ - - 24.04 - - `3.12.9 `_ - - * - .. raw:: html - - - - - `2.4.0 `_ - - `2.4.1 `_ - - 24.04 - - `3.12.9 `_ - - - * - .. raw:: html - - - - - `2.4.0 `_ - - `2.4.1 `_ - - 22.04 - - `3.10.16 `_ - - - * - .. raw:: html - - - - - `2.4.0 `_ - - `2.3.0 `_ - - 22.04 - - `3.10.16 `_ - - -Key ROCm libraries for DGL -================================================================================ - -DGL on ROCm depends on specific libraries that affect its features and performance. -Using the DGL Docker container or building it with the provided docker file or a ROCm base image is recommended. -If you prefer to build it yourself, ensure the following dependencies are installed: - -.. list-table:: - :header-rows: 1 - - * - ROCm library - - Version - - Purpose - * - `Composable Kernel `_ - - :version-ref:`"Composable Kernel" rocm_version` - - Enables faster execution of core operations like matrix multiplication - (GEMM), convolutions and transformations. - * - `hipBLAS `_ - - :version-ref:`hipBLAS rocm_version` - - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for - matrix and vector operations. - * - `hipBLASLt `_ - - :version-ref:`hipBLASLt rocm_version` - - hipBLASLt is an extension of the hipBLAS library, providing additional - features like epilogues fused into the matrix multiplication kernel or - use of integer tensor cores. - * - `hipCUB `_ - - :version-ref:`hipCUB rocm_version` - - Provides a C++ template library for parallel algorithms for reduction, - scan, sort and select. - * - `hipFFT `_ - - :version-ref:`hipFFT rocm_version` - - Provides GPU-accelerated Fast Fourier Transform (FFT) operations. - * - `hipRAND `_ - - :version-ref:`hipRAND rocm_version` - - Provides fast random number generation for GPUs. - * - `hipSOLVER `_ - - :version-ref:`hipSOLVER rocm_version` - - Provides GPU-accelerated solvers for linear systems, eigenvalues, and - singular value decompositions (SVD). - * - `hipSPARSE `_ - - :version-ref:`hipSPARSE rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - * - `hipSPARSELt `_ - - :version-ref:`hipSPARSELt rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - * - `hipTensor `_ - - :version-ref:`hipTensor rocm_version` - - Optimizes for high-performance tensor operations, such as contractions. - * - `MIOpen `_ - - :version-ref:`MIOpen rocm_version` - - Optimizes deep learning primitives such as convolutions, pooling, - normalization, and activation functions. - * - `MIGraphX `_ - - :version-ref:`MIGraphX rocm_version` - - Adds graph-level optimizations, ONNX models and mixed precision support - and enable Ahead-of-Time (AOT) Compilation. - * - `MIVisionX `_ - - :version-ref:`MIVisionX rocm_version` - - Optimizes acceleration for computer vision and AI workloads like - preprocessing, augmentation, and inferencing. - * - `rocAL `_ - - :version-ref:`rocAL rocm_version` - - Accelerates the data pipeline by offloading intensive preprocessing and - augmentation tasks. rocAL is part of MIVisionX. - * - `RCCL `_ - - :version-ref:`RCCL rocm_version` - - Optimizes for multi-GPU communication for operations like AllReduce and - Broadcast. - * - `rocDecode `_ - - :version-ref:`rocDecode rocm_version` - - Provides hardware-accelerated data decoding capabilities, particularly - for image, video, and other dataset formats. - * - `rocJPEG `_ - - :version-ref:`rocJPEG rocm_version` - - Provides hardware-accelerated JPEG image decoding and encoding. - * - `RPP `_ - - :version-ref:`RPP rocm_version` - - Speeds up data augmentation, transformation, and other preprocessing steps. - * - `rocThrust `_ - - :version-ref:`rocThrust rocm_version` - - Provides a C++ template library for parallel algorithms like sorting, - reduction, and scanning. - * - `rocWMMA `_ - - :version-ref:`rocWMMA rocm_version` - - Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix - multiplication (GEMM) and accumulation operations with mixed precision - support. - - -Supported features -================================================================================ - -Many functions and methods available in DGL Upstream are also supported in DGL ROCm. -Instead of listing them all, support is grouped into the following categories to provide a general overview. - -* DGL Base -* DGL Backend -* DGL Data -* DGL Dataloading -* DGL DGLGraph -* DGL Function -* DGL Ops -* DGL Sampling -* DGL Transforms -* DGL Utils -* DGL Distributed -* DGL Geometry -* DGL Mpops -* DGL NN -* DGL Optim -* DGL Sparse - - -Unsupported features -================================================================================ - -* Graphbolt -* Partial TF32 Support (MI250x only) -* Kineto/ ROCTracer integration - - -Unsupported functions -================================================================================ - -* ``more_nnz`` -* ``format`` -* ``multiprocess_sparse_adam_state_dict`` -* ``record_stream_ndarray`` -* ``half_spmm`` -* ``segment_mm`` -* ``gather_mm_idx_b`` -* ``pgexplainer`` -* ``sample_labors_prob`` -* ``sample_labors_noprob`` diff --git a/docs/compatibility/ml-compatibility/flashinfer-compatibility.rst b/docs/compatibility/ml-compatibility/flashinfer-compatibility.rst deleted file mode 100644 index 45ecc6a75..000000000 --- a/docs/compatibility/ml-compatibility/flashinfer-compatibility.rst +++ /dev/null @@ -1,107 +0,0 @@ -:orphan: - -.. meta:: - :description: FlashInfer deep learning framework compatibility - :keywords: GPU, LLM, FlashInfer, compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -FlashInfer compatibility -******************************************************************************** - -`FlashInfer `__ is a library and kernel generator -for Large Language Models (LLMs) that provides high-performance implementation of graphics -processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well -as advanced performance across diverse scenarios. - -FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized -techniques, while supporting customized attention variants. It’s compatible with ``torch.compile``, and -offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs. - -.. note:: - - The ROCm port of FlashInfer is under active development, and some features are not yet available. - For the latest feature compatibility matrix, refer to the ``README`` of the - `https://github.com/ROCm/flashinfer `__ repository. - -Support for the ROCm port of FlashInfer is available as follows: - -- ROCm support for FlashInfer is hosted in the `https://github.com/ROCm/flashinfer - `__ repository. This location differs from the - `https://github.com/flashinfer-ai/flashinfer `_ - upstream repository. - -- To install FlashInfer, use the prebuilt :ref:`Docker image `, - which includes ROCm, FlashInfer, and all required dependencies. - - - See the :doc:`ROCm FlashInfer installation guide ` - to install and get started. - - - See the `Installation guide `__ - in the upstream FlashInfer documentation. - -.. note:: - - Flashinfer is supported on ROCm 6.4.1. - -Supported devices -================================================================================ - -**Officially Supported**: AMD Instinct™ MI300X - - -.. _flashinfer-recommendations: - -Use cases and recommendations -================================================================================ - -This release of FlashInfer on ROCm provides the decode functionality for LLM inferencing. -In the decode phase, tokens are generated sequentially, with the model predicting each new -token based on the previously generated tokens and the input context. - -FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense -attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs. - -Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm -also implements cascade attention from upstream to reduce memory usage. - -For currently supported use cases and recommendations, refer to the `AMD ROCm blog `__, -where you can search for examples and best practices to optimize your workloads on AMD GPUs. - -.. _flashinfer-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes `ROCm FlashInfer images `__ -with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated -inventories represent the FlashInfer version from the official Docker Hub. -The Docker images have been validated for `ROCm 6.4.1 `__. -Click |docker-icon| to view the image on Docker Hub. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker image - - ROCm - - FlashInfer - - PyTorch - - Ubuntu - - Python - - * - .. raw:: html - - rocm/flashinfer - - `6.4.1 `__ - - `v0.2.5 `__ - - `2.7.1 `__ - - 24.04 - - `3.12 `__ - - diff --git a/docs/compatibility/ml-compatibility/jax-compatibility.rst b/docs/compatibility/ml-compatibility/jax-compatibility.rst deleted file mode 100644 index 479256c48..000000000 --- a/docs/compatibility/ml-compatibility/jax-compatibility.rst +++ /dev/null @@ -1,303 +0,0 @@ -:orphan: - -.. meta:: - :description: JAX compatibility - :keywords: GPU, JAX compatibility - -.. version-set:: rocm_version latest - -******************************************************************************* -JAX compatibility -******************************************************************************* - -JAX provides a NumPy-like API, which combines automatic differentiation and the -Accelerated Linear Algebra (XLA) compiler to achieve high-performance machine -learning at scale. - -JAX uses composable transformations of Python and NumPy through just-in-time -(JIT) compilation, automatic vectorization, and parallelization. To learn about -JAX, including profiling and optimizations, see the official `JAX documentation -`_. - -ROCm support for JAX is upstreamed, and users can build the official source code -with ROCm support: - -- ROCm JAX release: - - - Offers AMD-validated and community :ref:`Docker images ` - with ROCm and JAX preinstalled. - - - ROCm JAX repository: `ROCm/rocm-jax `_ - - - See the :doc:`ROCm JAX installation guide ` - to get started. - -- Official JAX release: - - - Official JAX repository: `jax-ml/jax `_ - - - See the `AMD GPU (Linux) installation section - `_ in - the JAX documentation. - -.. note:: - - AMD releases official `ROCm JAX Docker images `_ - quarterly alongside new ROCm releases. These images undergo full AMD testing. - `Community ROCm JAX Docker images `_ - follow upstream JAX releases and use the latest available ROCm version. - -Use cases and recommendations -================================================================================ - -* The `nanoGPT in JAX `_ - blog explores the implementation and training of a Generative Pre-trained - Transformer (GPT) model in JAX, inspired by Andrej Karpathy’s JAX-based - nanoGPT. Comparing how essential GPT components—such as self-attention - mechanisms and optimizers—are realized in JAX and JAX, also highlights - JAX’s unique features. - -* The `Optimize GPT Training: Enabling Mixed Precision Training in JAX using - ROCm on AMD GPUs `_ - blog post provides a comprehensive guide on enhancing the training efficiency - of GPT models by implementing mixed precision techniques in JAX, specifically - tailored for AMD GPUs utilizing the ROCm platform. - -* The `Supercharging JAX with Triton Kernels on AMD GPUs `_ - blog demonstrates how to develop a custom fused dropout-activation kernel for - matrices using Triton, integrate it with JAX, and benchmark its performance - using ROCm. - -* The `Distributed fine-tuning with JAX on AMD GPUs `_ - outlines the process of fine-tuning a Bidirectional Encoder Representations - from Transformers (BERT)-based large language model (LLM) using JAX for a text - classification task. The blog post discuss techniques for parallelizing the - fine-tuning across multiple AMD GPUs and assess the model's performance on a - holdout dataset. During the fine-tuning, a BERT-base-cased transformer model - and the General Language Understanding Evaluation (GLUE) benchmark dataset was - used on a multi-GPU setup. - -* The `MI300X workload optimization guide `_ - provides detailed guidance on optimizing workloads for the AMD Instinct MI300X - accelerator using ROCm. The page is aimed at helping users achieve optimal - performance for deep learning and other high-performance computing tasks on - the MI300X GPU. - -For more use cases and recommendations, see `ROCm JAX blog posts `_. - -.. _jax-docker-compat: - -Docker image compatibility -================================================================================ - -AMD provides preconfigured Docker images with JAX and the ROCm backend. -These images are published on `Docker Hub `__ and are the -recommended way to get started with deep learning with JAX on ROCm. -For ``jax-community`` images, see `rocm/jax-community -`__ on Docker Hub. - -To find the right image tag, see the :ref:`JAX on ROCm installation -documentation ` for a list of -available ``rocm/jax`` images. - -.. _key_rocm_libraries: - -Key ROCm libraries for JAX -================================================================================ - -The following ROCm libraries represent potential targets that could be utilized -by JAX on ROCm for various computational tasks. The actual libraries used will -depend on the specific implementation and operations performed. - -.. list-table:: - :header-rows: 1 - - * - ROCm library - - Version - - Purpose - * - `hipBLAS `_ - - :version-ref:`hipBLAS rocm_version` - - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for - matrix and vector operations. - * - `hipBLASLt `_ - - :version-ref:`hipBLASLt rocm_version` - - hipBLASLt is an extension of hipBLAS, providing additional - features like epilogues fused into the matrix multiplication kernel or - use of integer tensor cores. - * - `hipCUB `_ - - :version-ref:`hipCUB rocm_version` - - Provides a C++ template library for parallel algorithms for reduction, - scan, sort and select. - * - `hipFFT `_ - - :version-ref:`hipFFT rocm_version` - - Provides GPU-accelerated Fast Fourier Transform (FFT) operations. - * - `hipRAND `_ - - :version-ref:`hipRAND rocm_version` - - Provides fast random number generation for GPUs. - * - `hipSOLVER `_ - - :version-ref:`hipSOLVER rocm_version` - - Provides GPU-accelerated solvers for linear systems, eigenvalues, and - singular value decompositions (SVD). - * - `hipSPARSE `_ - - :version-ref:`hipSPARSE rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - * - `hipSPARSELt `_ - - :version-ref:`hipSPARSELt rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - * - `MIOpen `_ - - :version-ref:`MIOpen rocm_version` - - Optimized for deep learning primitives such as convolutions, pooling, - normalization, and activation functions. - * - `RCCL `_ - - :version-ref:`RCCL rocm_version` - - Optimized for multi-GPU communication for operations like all-reduce, - broadcast, and scatter. - * - `rocThrust `_ - - :version-ref:`rocThrust rocm_version` - - Provides a C++ template library for parallel algorithms like sorting, - reduction, and scanning. - -.. note:: - - This table shows ROCm libraries that could potentially be utilized by JAX. Not - all libraries may be used in every configuration, and the actual library usage - will depend on the specific operations and implementation details. - -Supported data types and modules -=============================================================================== - -The following tables lists the supported public JAX API data types and modules. - -Supported data types --------------------------------------------------------------------------------- - -ROCm supports all the JAX data types of `jax.dtypes `_ -module, `jax.numpy.dtype `_ -and `default_dtype `_ . -The ROCm supported data types in JAX are collected in the following table. - -.. list-table:: - :header-rows: 1 - - * - Data type - - Description - - * - ``bfloat16`` - - 16-bit bfloat (brain floating point). - - * - ``bool`` - - Boolean. - - * - ``complex128`` - - 128-bit complex. - - * - ``complex64`` - - 64-bit complex. - - * - ``float16`` - - 16-bit (half precision) floating-point. - - * - ``float32`` - - 32-bit (single precision) floating-point. - - * - ``float64`` - - 64-bit (double precision) floating-point. - - * - ``half`` - - 16-bit (half precision) floating-point. - - * - ``int16`` - - Signed 16-bit integer. - - * - ``int32`` - - Signed 32-bit integer. - - * - ``int64`` - - Signed 64-bit integer. - - * - ``int8`` - - Signed 8-bit integer. - - * - ``uint16`` - - Unsigned 16-bit (word) integer. - - * - ``uint32`` - - Unsigned 32-bit (dword) integer. - - * - ``uint64`` - - Unsigned 64-bit (qword) integer. - - * - ``uint8`` - - Unsigned 8-bit (byte) integer. - -.. note:: - - JAX data type support is effected by the :ref:`key_rocm_libraries` and it's - collected on :doc:`ROCm data types and precision support ` - page. - -Supported modules --------------------------------------------------------------------------------- - -For a complete and up-to-date list of JAX public modules (for example, ``jax.numpy``, -``jax.scipy``, ``jax.lax``), their descriptions, and usage, please refer directly to the -`official JAX API documentation `_. - -.. note:: - - Since version 0.1.56, JAX has full support for ROCm, and the - :ref:`Known issues and important notes ` section - contains details about limitations specific to the ROCm backend. The list of - JAX API modules are maintained by the JAX project and is subject to change. - Refer to the official Jax documentation for the most up-to-date information. - -Key features and enhancements for ROCm 7.0 -=============================================================================== - -- Upgraded XLA backend: Integrates a newer XLA version, enabling better - optimizations, broader operator support, and potential performance gains. - -- RNN support: Native RNN support (including LSTMs via ``jax.experimental.rnn``) - now available on ROCm, aiding sequence model development. - -- Comprehensive linear algebra capabilities: Offers robust ``jax.linalg`` - operations, essential for scientific and machine learning tasks. - -- Expanded AMD GPU architecture support: Provides ongoing support for gfx1101 - GPUs and introduces support for gfx950 and gfx12xx GPUs. - -- Mixed FP8 precision support: Enables ``lax.dot_general`` operations with mixed FP8 - types, offering pathways for memory and compute efficiency. - -- Streamlined PyPi packaging: Provides reliable PyPi wheels for JAX on ROCm, - simplifying the installation process. - -- Pallas experimental kernel development: Continued Pallas framework - enhancements for custom GPU kernels, including new intrinsics (specific - kernel behaviors under review). - -- Improved build system and CI: Enhanced ROCm build system and CI for greater - reliability and maintainability. - -- Enhanced distributed computing setup: Improved JAX setup in multi-GPU - distributed environments. - -.. _jax_comp_known_issues: - -Known issues and notes for ROCm 7.0 -=============================================================================== - -- ``nn.dot_product_attention``: Certain configurations of ``jax.nn.dot_product_attention`` - may cause segmentation faults, though the majority of use cases work correctly. - -- SVD with dynamic shapes: SVD on inputs with dynamic/symbolic shapes might result in an error. - SVD with static shapes is unaffected. - -- QR decomposition with symbolic shapes: QR decomposition operations may fail when using - symbolic/dynamic shapes in shape polymorphic contexts. - -- Pallas kernels: Specific advanced Pallas kernels may exhibit variations in - numerical output or resource usage. These are actively reviewed as part of - Pallas's experimental development. diff --git a/docs/compatibility/ml-compatibility/llama-cpp-compatibility.rst b/docs/compatibility/ml-compatibility/llama-cpp-compatibility.rst deleted file mode 100644 index 902c61a2a..000000000 --- a/docs/compatibility/ml-compatibility/llama-cpp-compatibility.rst +++ /dev/null @@ -1,275 +0,0 @@ -:orphan: - -.. meta:: - :description: llama.cpp deep learning framework compatibility - :keywords: GPU, GGML, llama.cpp compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -llama.cpp compatibility -******************************************************************************** - -`llama.cpp `__ is an open-source framework -for Large Language Model (LLM) inference that runs on both central processing units -(CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing -a simple, dependency-free setup. - -The framework supports multiple quantization options, from 1.5-bit to 8-bit integers, -to accelerate inference and reduce memory usage. Originally built as a CPU-first library, -llama.cpp is easy to integrate with other programming environments and is widely -adopted across diverse platforms, including consumer devices. - -ROCm support for llama.cpp is upstreamed, and you can build the official source code -with ROCm support: - -- ROCm support for llama.cpp is hosted in the official `https://github.com/ROCm/llama.cpp - `_ repository. - -- Due to independent compatibility considerations, this location differs from the - `https://github.com/ggml-org/llama.cpp `_ upstream repository. - -- To install llama.cpp, use the prebuilt :ref:`Docker image `, - which includes ROCm, llama.cpp, and all required dependencies. - - - See the :doc:`ROCm llama.cpp installation guide ` - to install and get started. - - - See the `Installation guide `__ - in the upstream llama.cpp documentation. - -.. note:: - - llama.cpp is supported on ROCm 7.0.0 and ROCm 6.4.x. - -Supported devices -================================================================================ - -**Officially Supported**: AMD Instinct™ MI300X, MI325X, MI210 - - -Use cases and recommendations -================================================================================ - -llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements: - -- Plain C/C++ implementation with no external dependencies -- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage -- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units) -- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory) - -llama.cpp is also used in a range of real-world applications, including: - -- Games such as `Lucy's Labyrinth `__: - A simple maze game where AI-controlled agents attempt to trick the player. -- Tools such as `Styled Lines `__: - A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example. -- Various other AI applications use llama.cpp as their inference engine; - for a detailed list, see the `user interfaces (UIs) section `__. - -For more use cases and recommendations, refer to the `AMD ROCm blog `__, -where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs. - -- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration `__ - blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``, - server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for - AMD Instinct GPUs within the ROCm ecosystem. - -.. _llama-cpp-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes `ROCm llama.cpp Docker images `__ -with ROCm backends on Docker Hub. The following Docker image tags and associated -inventories represent the available llama.cpp versions from the official Docker Hub. -Click |docker-icon| to view the image on Docker Hub. - -.. important:: - - Tag endings of ``_full``, ``_server``, and ``_light`` serve different purposes for entrypoints as follows: - - - Full: This image includes both the main executable file and the tools to convert ``LLaMA`` models into ``ggml`` and convert into 4-bit quantization. - - Server: This image only includes the server executable file. - - Light: This image only includes the main executable file. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Full Docker - - Server Docker - - Light Docker - - llama.cpp - - ROCm - - Ubuntu - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `7.0.0 `__ - - 24.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `7.0.0 `__ - - 22.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.3 `__ - - 24.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.3 `__ - - 22.04 - - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.2 `__ - - 24.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.2 `__ - - 22.04 - - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.1 `__ - - 24.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b6356 `__ - - `6.4.1 `__ - - 22.04 - - * - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - .. raw:: html - - rocm/llama.cpp - - `b5997 `__ - - `6.4.0 `__ - - 24.04 - - -Key ROCm libraries for llama.cpp -================================================================================ - -llama.cpp functionality on ROCm is determined by its underlying library -dependencies. These ROCm components affect the capabilities, performance, and -feature set available to developers. Ensure you have the required libraries for -your corresponding ROCm version. - -.. list-table:: - :header-rows: 1 - - * - ROCm library - - ROCm 7.0.0 version - - ROCm 6.4.x version - - Purpose - - Usage - * - `hipBLAS `__ - - 3.0.0 - - 2.4.0 - - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for - matrix and vector operations. - - Supports operations such as matrix multiplication, matrix-vector - products, and tensor contractions. Utilized in both dense and batched - linear algebra operations. - * - `hipBLASLt `__ - - 1.0.0 - - 0.12.0 - - hipBLASLt is an extension of the hipBLAS library, providing additional - features like epilogues fused into the matrix multiplication kernel or - use of integer tensor cores. - - By setting the flag ``ROCBLAS_USE_HIPBLASLT``, you can dispatch hipblasLt - kernels where possible. - * - `rocWMMA `__ - - 2.0.0 - - 1.7.0 - - Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix - multiplication (GEMM) and accumulation operations with mixed precision - support. - - Can be used to enhance the flash attention performance on AMD compute, by enabling - the flag during compile time. - -Previous versions -=============================================================================== -See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/llama-cpp-history` to find documentation for previous releases -of the ``ROCm/llama.cpp`` Docker image. \ No newline at end of file diff --git a/docs/compatibility/ml-compatibility/megablocks-compatibility.rst b/docs/compatibility/ml-compatibility/megablocks-compatibility.rst deleted file mode 100644 index 50c2c3821..000000000 --- a/docs/compatibility/ml-compatibility/megablocks-compatibility.rst +++ /dev/null @@ -1,93 +0,0 @@ -:orphan: - -.. meta:: - :description: Megablocks compatibility - :keywords: GPU, megablocks, compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -Megablocks compatibility -******************************************************************************** - -Megablocks is a light-weight library for mixture-of-experts (MoE) training. -The core of the system is efficient "dropless-MoE" and standard MoE layers. -Megablocks is integrated with `https://github.com/stanford-futuredata/Megatron-LM `_, -where data and pipeline parallel training of MoEs is supported. - -* ROCm support for Megablocks is hosted in the official `https://github.com/ROCm/megablocks `_ repository. -* Due to independent compatibility considerations, this location differs from the `https://github.com/stanford-futuredata/Megatron-LM `_ upstream repository. -* Use the prebuilt :ref:`Docker image ` with ROCm, PyTorch, and Megablocks preinstalled. -* See the :doc:`ROCm Megablocks installation guide ` to install and get started. - -.. note:: - - Megablocks is supported on ROCm 6.3.0. - -Supported devices -================================================================================ - -- **Officially Supported**: AMD Instinct MI300X -- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210 - -Supported models and features -================================================================================ - -This section summarizes the Megablocks features supported by ROCm. - -* Distributed Pre-training -* Activation Checkpointing and Recomputation -* Distributed Optimizer -* Mixture-of-Experts -* dropless-Mixture-of-Experts - - -.. _megablocks-recommendations: - -Use cases and recommendations -================================================================================ - -The `ROCm Megablocks blog posts `_ -guide how to leverage the ROCm platform for pre-training using the Megablocks framework. -It features how to pre-process datasets and how to begin pre-training on AMD GPUs through: - -* Single-GPU pre-training -* Multi-GPU pre-training - - -.. _megablocks-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes `ROCm Megablocks images `_ -with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated -inventories represent the latest Megatron-LM version from the official Docker Hub. -The Docker images have been validated for `ROCm 6.3.0 `_. -Click |docker-icon| to view the image on Docker Hub. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker image - - ROCm - - Megablocks - - PyTorch - - Ubuntu - - Python - - * - .. raw:: html - - rocm/megablocks - - `6.3.0 `_ - - `0.7.0 `_ - - `2.4.0 `_ - - 24.04 - - `3.12.9 `_ - - diff --git a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst deleted file mode 100644 index 245296532..000000000 --- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst +++ /dev/null @@ -1,477 +0,0 @@ -:orphan: - -.. meta:: - :description: PyTorch compatibility - :keywords: GPU, PyTorch compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -PyTorch compatibility -******************************************************************************** - -`PyTorch `__ is an open-source tensor library designed for -deep learning. PyTorch on ROCm provides mixed-precision and large-scale training -using `MIOpen `__ and -`RCCL `__ libraries. - -ROCm support for PyTorch is upstreamed into the official PyTorch repository. Due -to independent compatibility considerations, this results in two distinct -release cycles for PyTorch on ROCm: - -- ROCm PyTorch release: - - - Provides the latest version of ROCm but might not necessarily support the - latest stable PyTorch version. - - - Offers :ref:`Docker images ` with ROCm and PyTorch - preinstalled. - - - ROCm PyTorch repository: ``__ - - - See the :doc:`ROCm PyTorch installation guide ` - to get started. - -- Official PyTorch release: - - - Provides the latest stable version of PyTorch but might not necessarily - support the latest ROCm version. - - - Official PyTorch repository: ``__ - - - See the `Nightly and latest stable version installation guide `__ - or `Previous versions `__ - to get started. - -PyTorch includes tooling that generates HIP source code from the CUDA backend. -This approach allows PyTorch to support ROCm without requiring manual code -modifications. For more information, see :doc:`HIPIFY `. - -ROCm development is aligned with the stable release of PyTorch, while upstream -PyTorch testing uses the stable release of ROCm to maintain consistency. - -.. _pytorch-recommendations: - -Use cases and recommendations -================================================================================ - -* :doc:`Using ROCm for AI: training a model ` - guides how to leverage the ROCm platform for training AI models. It covers the - steps, tools, and best practices for optimizing training workflows on AMD GPUs - using PyTorch features. - -* :doc:`Single-GPU fine-tuning and inference ` - describes and demonstrates how to use the ROCm platform for the fine-tuning - and inference of machine learning models, particularly large language models - (LLMs), on systems with a single GPU. This topic provides a detailed guide for - setting up, optimizing, and executing fine-tuning and inference workflows in - such environments. - -* :doc:`Multi-GPU fine-tuning and inference optimization ` - describes and demonstrates the fine-tuning and inference of machine learning - models on systems with multiple GPUs. - -* The :doc:`Instinct MI300X workload optimization guide ` - provides detailed guidance on optimizing workloads for the AMD Instinct MI300X - accelerator using ROCm. This guide helps users achieve optimal performance for - deep learning and other high-performance computing tasks on the MI300X - accelerator. - -* The :doc:`Inception with PyTorch documentation ` - describes how PyTorch integrates with ROCm for AI workloads It outlines the - use of PyTorch on the ROCm platform and focuses on efficiently leveraging AMD - GPU hardware for training and inference tasks in AI applications. - -For more use cases and recommendations, see `ROCm PyTorch blog posts `__. - -.. _pytorch-docker-compat: - -Docker image compatibility -================================================================================ - -AMD provides preconfigured Docker images with PyTorch and the ROCm backend. -These images are published on `Docker Hub `__ and are the -recommended way to get started with deep learning with PyTorch on ROCm. - -To find the right image tag, see the :ref:`PyTorch on ROCm installation -documentation ` for a list of -available ``rocm/pytorch`` images. - -Key ROCm libraries for PyTorch -================================================================================ - -PyTorch functionality on ROCm is determined by its underlying library -dependencies. These ROCm components affect the capabilities, performance, and -feature set available to developers. - -.. list-table:: - :header-rows: 1 - - * - ROCm library - - Version - - Purpose - - Used in - * - `Composable Kernel `__ - - :version-ref:`"Composable Kernel" rocm_version` - - Enables faster execution of core operations like matrix multiplication - (GEMM), convolutions and transformations. - - Speeds up ``torch.permute``, ``torch.view``, ``torch.matmul``, - ``torch.mm``, ``torch.bmm``, ``torch.nn.Conv2d``, ``torch.nn.Conv3d`` - and ``torch.nn.MultiheadAttention``. - * - `hipBLAS `__ - - :version-ref:`hipBLAS rocm_version` - - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for - matrix and vector operations. - - Supports operations such as matrix multiplication, matrix-vector - products, and tensor contractions. Utilized in both dense and batched - linear algebra operations. - * - `hipBLASLt `__ - - :version-ref:`hipBLASLt rocm_version` - - hipBLASLt is an extension of the hipBLAS library, providing additional - features like epilogues fused into the matrix multiplication kernel or - use of integer tensor cores. - - Accelerates operations such as ``torch.matmul``, ``torch.mm``, and the - matrix multiplications used in convolutional and linear layers. - * - `hipCUB `__ - - :version-ref:`hipCUB rocm_version` - - Provides a C++ template library for parallel algorithms for reduction, - scan, sort and select. - - Supports operations such as ``torch.sum``, ``torch.cumsum``, - ``torch.sort`` irregular shapes often involve scanning, sorting, and - filtering, which hipCUB handles efficiently. - * - `hipFFT `__ - - :version-ref:`hipFFT rocm_version` - - Provides GPU-accelerated Fast Fourier Transform (FFT) operations. - - Used in functions like the ``torch.fft`` module. - * - `hipRAND `__ - - :version-ref:`hipRAND rocm_version` - - Provides fast random number generation for GPUs. - - The ``torch.rand``, ``torch.randn``, and stochastic layers like - ``torch.nn.Dropout`` rely on hipRAND. - * - `hipSOLVER `__ - - :version-ref:`hipSOLVER rocm_version` - - Provides GPU-accelerated solvers for linear systems, eigenvalues, and - singular value decompositions (SVD). - - Supports functions like ``torch.linalg.solve``, - ``torch.linalg.eig``, and ``torch.linalg.svd``. - * - `hipSPARSE `__ - - :version-ref:`hipSPARSE rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - - Sparse tensor operations ``torch.sparse``. - * - `hipSPARSELt `__ - - :version-ref:`hipSPARSELt rocm_version` - - Accelerates operations on sparse matrices, such as sparse matrix-vector - or matrix-matrix products. - - Sparse tensor operations ``torch.sparse``. - * - `hipTensor `__ - - :version-ref:`hipTensor rocm_version` - - Optimizes for high-performance tensor operations, such as contractions. - - Accelerates tensor algebra, especially in deep learning and scientific - computing. - * - `MIOpen `__ - - :version-ref:`MIOpen rocm_version` - - Optimizes deep learning primitives such as convolutions, pooling, - normalization, and activation functions. - - Speeds up convolutional neural networks (CNNs), recurrent neural - networks (RNNs), and other layers. Used in operations like - ``torch.nn.Conv2d``, ``torch.nn.ReLU``, and ``torch.nn.LSTM``. - * - `MIGraphX `__ - - :version-ref:`MIGraphX rocm_version` - - Adds graph-level optimizations, ONNX models and mixed precision support - and enable Ahead-of-Time (AOT) Compilation. - - Speeds up inference models and executes ONNX models for - compatibility with other frameworks. - ``torch.nn.Conv2d``, ``torch.nn.ReLU``, and ``torch.nn.LSTM``. - * - `MIVisionX `__ - - :version-ref:`MIVisionX rocm_version` - - Optimizes acceleration for computer vision and AI workloads like - preprocessing, augmentation, and inferencing. - - Faster data preprocessing and augmentation pipelines for datasets like - ImageNet or COCO and easy to integrate into PyTorch's ``torch.utils.data`` - and ``torchvision`` workflows. - * - `rocAL `__ - - :version-ref:`rocAL rocm_version` - - Accelerates the data pipeline by offloading intensive preprocessing and - augmentation tasks. rocAL is part of MIVisionX. - - Easy to integrate into PyTorch's ``torch.utils.data`` and - ``torchvision`` data load workloads. - * - `RCCL `__ - - :version-ref:`RCCL rocm_version` - - Optimizes for multi-GPU communication for operations like AllReduce and - Broadcast. - - Distributed data parallel training (``torch.nn.parallel.DistributedDataParallel``). - Handles communication in multi-GPU setups. - * - `rocDecode `__ - - :version-ref:`rocDecode rocm_version` - - Provides hardware-accelerated data decoding capabilities, particularly - for image, video, and other dataset formats. - - Can be integrated in ``torch.utils.data``, ``torchvision.transforms`` - and ``torch.distributed``. - * - `rocJPEG `__ - - :version-ref:`rocJPEG rocm_version` - - Provides hardware-accelerated JPEG image decoding and encoding. - - GPU accelerated ``torchvision.io.decode_jpeg`` and - ``torchvision.io.encode_jpeg`` and can be integrated in - ``torch.utils.data`` and ``torchvision``. - * - `RPP `__ - - :version-ref:`RPP rocm_version` - - Speeds up data augmentation, transformation, and other preprocessing steps. - - Easy to integrate into PyTorch's ``torch.utils.data`` and - ``torchvision`` data load workloads to speed up data processing. - * - `rocThrust `__ - - :version-ref:`rocThrust rocm_version` - - Provides a C++ template library for parallel algorithms like sorting, - reduction, and scanning. - - Utilized in backend operations for tensor computations requiring - parallel processing. - * - `rocWMMA `__ - - :version-ref:`rocWMMA rocm_version` - - Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix - multiplication (GEMM) and accumulation operations with mixed precision - support. - - Linear layers (``torch.nn.Linear``), convolutional layers - (``torch.nn.Conv2d``), attention layers, general tensor operations that - involve matrix products, such as ``torch.matmul``, ``torch.bmm``, and - more. - -Supported modules and data types -================================================================================ - -The following section outlines the supported data types, modules, and domain -libraries available in PyTorch on ROCm. - -Supported data types --------------------------------------------------------------------------------- - -The tensor data type is specified using the ``dtype`` attribute or argument. -PyTorch supports many data types for different use cases. - -The following table lists `torch.Tensor `__ -single data types: - -.. list-table:: - :header-rows: 1 - - * - Data type - - Description - * - ``torch.float8_e4m3fn`` - - 8-bit floating point, e4m3 - * - ``torch.float8_e5m2`` - - 8-bit floating point, e5m2 - * - ``torch.float16`` or ``torch.half`` - - 16-bit floating point - * - ``torch.bfloat16`` - - 16-bit floating point - * - ``torch.float32`` or ``torch.float`` - - 32-bit floating point - * - ``torch.float64`` or ``torch.double`` - - 64-bit floating point - * - ``torch.complex32`` or ``torch.chalf`` - - 32-bit complex numbers - * - ``torch.complex64`` or ``torch.cfloat`` - - 64-bit complex numbers - * - ``torch.complex128`` or ``torch.cdouble`` - - 128-bit complex numbers - * - ``torch.uint8`` - - 8-bit integer (unsigned) - * - ``torch.uint16`` - - 16-bit integer (unsigned); - Not natively supported in ROCm - * - ``torch.uint32`` - - 32-bit integer (unsigned); - Not natively supported in ROCm - * - ``torch.uint64`` - - 64-bit integer (unsigned); - Not natively supported in ROCm - * - ``torch.int8`` - - 8-bit integer (signed) - * - ``torch.int16`` or ``torch.short`` - - 16-bit integer (signed) - * - ``torch.int32`` or ``torch.int`` - - 32-bit integer (signed) - * - ``torch.int64`` or ``torch.long`` - - 64-bit integer (signed) - * - ``torch.bool`` - - Boolean - * - ``torch.quint8`` - - Quantized 8-bit integer (unsigned) - * - ``torch.qint8`` - - Quantized 8-bit integer (signed) - * - ``torch.qint32`` - - Quantized 32-bit integer (signed) - * - ``torch.quint4x2`` - - Quantized 4-bit integer (unsigned) - -.. note:: - - Unsigned types, except ``uint8``, have limited support in eager mode. They - primarily exist to assist usage with ``torch.compile``. - - See :doc:`ROCm precision support ` for the - native hardware support of data types. - -Supported modules --------------------------------------------------------------------------------- - -For a complete and up-to-date list of PyTorch core modules (for example., ``torch``, -``torch.nn``, ``torch.cuda``, ``torch.backends.cuda`` and -``torch.backends.cudnn``), their descriptions, and usage, please refer directly -to the `official PyTorch documentation `_. - -Core PyTorch functionality on ROCm includes tensor operations, neural network -layers, automatic differentiation, distributed training, mixed-precision -training, compilation features, and domain-specific libraries for audio, vision, -text processing, and more. - -Supported domain libraries --------------------------------------------------------------------------------- - -PyTorch offers specialized `domain libraries `_ with -GPU acceleration that build on its core features to support specific application -areas. The table below lists the PyTorch domain libraries that are compatible -with ROCm. - -.. list-table:: - :header-rows: 1 - - * - Library - - Description - - * - `torchaudio `_ - - Audio and signal processing library for PyTorch. Provides utilities for - audio I/O, signal and data processing functions, datasets, model - implementations, and application components for audio and speech - processing tasks. - - **Note:** To ensure GPU-acceleration with ``torchaudio.transforms``, - you need to explicitly move audio data (waveform tensor) to GPU using - ``.to('cuda')``. - - * - `torchtune `_ - - PyTorch-native library designed for fine-tuning large language models - (LLMs). Provides supports the full fine-tuning workflow and offers - compatibility with popular production inference systems. - - **Note:** Only official release exists. - - * - `torchvision `_ - - Computer vision library that is part of the PyTorch project. Provides - popular datasets, model architectures, and common image transformations - for computer vision applications. - - * - `torchtext `_ - - Text processing library for PyTorch. Provides data processing utilities - and popular datasets for natural language processing, including - tokenization, vocabulary management, and text embeddings. - - **Note:** ``torchtext`` does not implement ROCm-specific kernels. - ROCm acceleration is provided through the underlying PyTorch framework - and ROCm library integration. Only official release exists. - - * - `torchdata `_ - - Beta library of common modular data loading primitives for easily - constructing flexible and performant data pipelines, with features still - in prototype stage. - - * - `torchrec `_ - - PyTorch domain library for common sparsity and parallelism primitives - needed for large-scale recommender systems, enabling authors to train - models with large embedding tables shared across many GPUs. - - **Note:** ``torchrec`` does not implement ROCm-specific kernels. ROCm - acceleration is provided through the underlying PyTorch framework and - ROCm library integration. - - * - `torchserve `_ - - Performant, flexible and easy-to-use tool for serving PyTorch models in - production, providing features for model management, batch processing, - and scalable deployment. - - **Note:** `torchserve `_ is no longer - actively maintained. Last official release is sent out with PyTorch 2.4. - - * - `torchrl `_ - - Open-source, Python-first Reinforcement Learning library for PyTorch - with a focus on high modularity and good runtime performance, providing - low and high-level RL abstractions and reusable functionals for cost - functions, returns, and data processing. - - **Note:** Only official release exists. - - * - `tensordict `_ - - Dictionary-like class that simplifies operations on batches of tensors, - enhancing code readability, compactness, and modularity by abstracting - tailored operations and reducing errors through automatic operation - dispatching. - - **Note:** Only official release exists. - -Key features and enhancements for PyTorch 2.7 with ROCm 7.0 -================================================================================ - -- Enhanced TunableOp framework: Introduces ``tensorfloat32`` support for - TunableOp operations, improved offline tuning for ScaledGEMM operations, - submatrix offline tuning capabilities, and better logging for BLAS operations - without bias vectors. - -- Expanded GPU architecture support: Provides optimized support for newer GPU - architectures, including gfx1200 and gfx1201 with preferred hipBLASLt backend - selection, along with improvements for gfx950 and gfx1100 series GPUs. - -- Advanced Triton Integration: AOTriton 0.10b introduces official support for - gfx950 and gfx1201, along with experimental support for gfx1101, gfx1151, - gfx1150, and gfx1200. - -- Improved element-wise kernel performance: Delivers enhanced vectorized - element-wise kernels with better support for heterogeneous tensor types and - optimized input vectorization for tensors with mixed data types. - -- MIOpen deep learning optimizations: Enables NHWC BatchNorm by default on - ROCm 7.0+, provides ``maxpool`` forward and backward performance improvements - targeting ResNet scenarios, and includes updated launch configurations for - better performance. - -- Enhanced memory and tensor operations: Features fixes for in-place ``aten`` - sum operations with specialized templated kernels, improved 3D tensor - performance with NHWC format, and better handling of memory-bound matrix - multiplication operations. - -- Robust testing and quality improvements: Includes comprehensive test suite - updates with improved tolerance handling for Navi3x architectures, generalized - ROCm-specific test conditions, and enhanced unit test coverage for Flash - Attention and Memory Efficient operations. - -- Build system and infrastructure improvements: Provides updated CentOS Stream 9 - support, improved Docker configuration, migration to public MAGMA repository, - and enhanced QA automation scripts for PyTorch unit testing. - -- Composable Kernel (CK) updates: Features updated CK submodule integration with - the latest optimizations and performance improvements for core mathematical - operations. - -- Development and debugging enhancements: Includes improved source handling for - dynamic compilation, better error handling for atomic operations, and enhanced - state checking for trace operations. - -- Integrate APEX fused layer normalization, which can have positive impact on - text-to-video models. - -- Integrate APEX distributed fused LAMB and distributed fused ADAM, which can - have positive impact on BERT-L and Llama2-SFT. - -- FlashAttention v3 has been integrated for AMD GPUs. - -- `Pytorch C++ extensions `_ - provide a mechanism for compiling custom operations that can be used during - network training or inference. For AMD platforms, ``amdclang++`` has been - validated as the supported compiler for building these extensions. - -Known issues and notes for PyTorch 2.7 with ROCm 7.0 -================================================================================ - -- The ``matmul.allow_fp16_reduced_precision_reduction`` and - ``matmul.allow_bf16_reduced_precision_reduction`` options under - ``torch.backends.cuda`` are not supported. As a result, - reduced-precision reductions using FP16 or BF16 accumulation types are not - available. diff --git a/docs/compatibility/ml-compatibility/ray-compatibility.rst b/docs/compatibility/ml-compatibility/ray-compatibility.rst deleted file mode 100644 index 2f5c83589..000000000 --- a/docs/compatibility/ml-compatibility/ray-compatibility.rst +++ /dev/null @@ -1,111 +0,0 @@ -:orphan: - -.. meta:: - :description: Ray deep learning framework compatibility - :keywords: GPU, Ray compatibility - -.. version-set:: rocm_version latest - -******************************************************************************* -Ray compatibility -******************************************************************************* - -Ray is a unified framework for scaling AI and Python applications from your laptop -to a full cluster, without changing your code. Ray consists of `a core distributed -runtime `_ and a set of -`AI libraries `_ for -simplifying machine learning computations. - -Ray is a general-purpose framework that runs many types of workloads efficiently. -Any Python application can be scaled with Ray, without extra infrastructure. - -ROCm support for Ray is upstreamed, and you can build the official source code -with ROCm support: - -- ROCm support for Ray is hosted in the official `https://github.com/ROCm/ray - `_ repository. - -- Due to independent compatibility considerations, this location differs from the - `https://github.com/ray-project/ray `_ upstream repository. - -- To install Ray, use the prebuilt :ref:`Docker image ` - which includes ROCm, Ray, and all required dependencies. - - - See the :doc:`ROCm Ray installation guide ` - for instructions to get started. - - - See the `Installation section `_ - in the upstream Ray documentation. - - - The Docker image provided is based on the upstream Ray `Daily Release (Nightly) wheels `__ - corresponding to commit `005c372 `__. - -.. note:: - - Ray is supported on ROCm 6.4.1. - -Supported devices -================================================================================ - -**Officially Supported**: AMD Instinct™ MI300X, MI210 - - -Use cases and recommendations -================================================================================ - -* The `Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm - Integration `__ - blog provides an overview of Volcano Engine Reinforcement Learning (verl) - for large language models (LLMs) and discusses its benefits in large-scale - reinforcement learning from human feedback (RLHF). It uses Ray as part of a - hybrid orchestration engine to schedule and coordinate training and inference - tasks in parallel, enabling optimized resource utilization and potential overlap - between these phases. This dynamic resource allocation strategy significantly - improves overall system efficiency. The blog presents verl’s performance results, - focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X - GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and - accelerate your RLHF training with ROCm-optimized performance. - -* The `Exploring Use Cases for Scalable AI: Implementing Ray with ROCm Support for Efficient ML Workflows - `__ - blog post describes key use cases such as training and inference for large language models (LLMs), - model serving, hyperparameter tuning, reinforcement learning, and the orchestration of large-scale - workloads using Ray in the ROCm environment. - -For more use cases and recommendations, see the AMD GPU tabs in the `Accelerator Support -topic `__ -of the Ray core documentation and refer to the `AMD ROCm blog `__, -where you can search for Ray examples and best practices to optimize your workloads on AMD GPUs. - -.. _ray-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes ready-made `ROCm Ray Docker images `__ -with ROCm backends on Docker Hub. The following Docker image tags and -associated inventories represent the latest Ray version from the official Docker Hub and are validated for -`ROCm 6.4.1 `_. Click the |docker-icon| -icon to view the image on Docker Hub. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker image - - Ray - - Pytorch - - Ubuntu - - Python - - * - .. raw:: html - - rocm/ray - - `2.48.0.post0 `_ - - 2.6.0+git684f6f2 - - 24.04 - - `3.12.10 `_ diff --git a/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst b/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst deleted file mode 100644 index 1550a82d1..000000000 --- a/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst +++ /dev/null @@ -1,100 +0,0 @@ -:orphan: - -.. meta:: - :description: Stanford Megatron-LM compatibility - :keywords: Stanford, Megatron-LM, compatibility - -.. version-set:: rocm_version latest - -******************************************************************************** -Stanford Megatron-LM compatibility -******************************************************************************** - -Stanford Megatron-LM is a large-scale language model training framework developed by NVIDIA `https://github.com/NVIDIA/Megatron-LM `_. It is -designed to train massive transformer-based language models efficiently by model and data parallelism. - -* ROCm support for Stanford Megatron-LM is hosted in the official `https://github.com/ROCm/Stanford-Megatron-LM `_ repository. -* Due to independent compatibility considerations, this location differs from the `https://github.com/stanford-futuredata/Megatron-LM `_ upstream repository. -* Use the prebuilt :ref:`Docker image ` with ROCm, PyTorch, and Megatron-LM preinstalled. -* See the :doc:`ROCm Stanford Megatron-LM installation guide ` to install and get started. - -.. note:: - - Stanford Megatron-LM is supported on ROCm 6.3.0. - - -Supported Devices -================================================================================ - -- **Officially Supported**: AMD Instinct MI300X -- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210 - - -Supported models and features -================================================================================ - -This section details models & features that are supported by the ROCm version on Stanford Megatron-LM. - -Models: - -* Bert -* GPT -* T5 -* ICT - -Features: - -* Distributed Pre-training -* Activation Checkpointing and Recomputation -* Distributed Optimizer -* Mixture-of-Experts - -.. _megatron-lm-recommendations: - -Use cases and recommendations -================================================================================ - -See the `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs blog `_ post -to leverage the ROCm platform for pre-training by using the Stanford Megatron-LM framework of pre-processing datasets on AMD GPUs. -Coverage includes: - - * Single-GPU pre-training - * Multi-GPU pre-training - - -.. _megatron-lm-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes `Stanford Megatron-LM images `_ -with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated -inventories represent the latest Megatron-LM version from the official Docker Hub. -The Docker images have been validated for `ROCm 6.3.0 `_. -Click |docker-icon| to view the image on Docker Hub. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker image - - Stanford Megatron-LM - - PyTorch - - Ubuntu - - Python - - * - .. raw:: html - - - - - `85f95ae `_ - - `2.4.0 `_ - - 24.04 - - `3.12.9 `_ - - - diff --git a/docs/compatibility/ml-compatibility/taichi-compatibility.rst b/docs/compatibility/ml-compatibility/taichi-compatibility.rst deleted file mode 100644 index 58bbbd4f5..000000000 --- a/docs/compatibility/ml-compatibility/taichi-compatibility.rst +++ /dev/null @@ -1,76 +0,0 @@ -:orphan: - -.. meta:: - :description: Taichi compatibility - :keywords: GPU, Taichi compatibility - -.. version-set:: rocm_version latest - -******************************************************************************* -Taichi compatibility -******************************************************************************* - -`Taichi `_ is an open-source, imperative, and parallel -programming language designed for high-performance numerical computation. -Embedded in Python, it leverages just-in-time (JIT) compilation frameworks such as LLVM to accelerate -compute-intensive Python code by compiling it to native GPU or CPU instructions. - -Taichi is widely used across various domains, including real-time physical simulation, -numerical computing, augmented reality, artificial intelligence, computer vision, robotics, -visual effects in film and gaming, and general-purpose computing. - -* ROCm support for Taichi is hosted in the official `https://github.com/ROCm/taichi `_ repository. -* Due to independent compatibility considerations, this location differs from the `https://github.com/taichi-dev `_ upstream repository. -* Use the prebuilt :ref:`Docker image ` with ROCm, PyTorch, and Taichi preinstalled. -* See the :doc:`ROCm Taichi installation guide ` to install and get started. - -.. note:: - - Taichi is supported on ROCm 6.3.2. - -Supported devices and features -=============================================================================== -There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X series GPUs with the exception of Taichi’s GPU rendering system, CGUI. -AMD Instinct MI300X series GPUs will be supported by November. - -.. _taichi-recommendations: - -Use cases and recommendations -================================================================================ -To fully leverage Taichi's performance capabilities in compute-intensive tasks, it is best to adhere to specific coding patterns and utilize Taichi decorators. -A collection of example use cases is available in the `https://github.com/ROCm/taichi_examples `_ repository, -providing practical insights and foundational knowledge for working with the Taichi programming language. -You can also refer to the `AMD ROCm blog `_ to search for Taichi examples and best practices to optimize your workflows on AMD GPUs. - -.. _taichi-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes ready-made `ROCm Taichi Docker images `_ -with ROCm backends on Docker Hub. The following Docker image tags and associated inventories -represent the latest Taichi version from the official Docker Hub. -The Docker images have been validated for `ROCm 6.3.2 `_. -Click |docker-icon| to view the image on Docker Hub. - -.. list-table:: - :header-rows: 1 - :class: docker-image-compatibility - - * - Docker image - - ROCm - - Taichi - - Ubuntu - - Python - - * - .. raw:: html - - rocm/taichi - - `6.3.2 `_ - - `1.8.0b1 `_ - - 22.04 - - `3.10.12 `_ \ No newline at end of file diff --git a/docs/compatibility/ml-compatibility/tensorflow-compatibility.rst b/docs/compatibility/ml-compatibility/tensorflow-compatibility.rst deleted file mode 100644 index 4f48d9f4c..000000000 --- a/docs/compatibility/ml-compatibility/tensorflow-compatibility.rst +++ /dev/null @@ -1,439 +0,0 @@ -:orphan: - -.. meta:: - :description: TensorFlow compatibility - :keywords: GPU, TensorFlow compatibility - -.. version-set:: rocm_version latest - -******************************************************************************* -TensorFlow compatibility -******************************************************************************* - -`TensorFlow `__ is an open-source library for -solving machine learning, deep learning, and AI problems. It can solve many -problems across different sectors and industries but primarily focuses on -neural network training and inference. It is one of the most popular and -in-demand frameworks and is very active in open-source contribution and -development. - -The `official TensorFlow repository `__ -includes full ROCm support. AMD maintains a TensorFlow `ROCm repository -`__ in order to quickly add bug -fixes, updates, and support for the latest ROCM versions. - -- ROCm TensorFlow release: - - - Offers :ref:`Docker images ` with - ROCm and TensorFlow pre-installed. - - - ROCm TensorFlow repository: ``__ - - - See the :doc:`ROCm TensorFlow installation guide ` - to get started. - -- Official TensorFlow release: - - - Official TensorFlow repository: ``__ - - - See the `TensorFlow API versions `__ list. - - .. note:: - - The official TensorFlow documentation does not cover ROCm support. Use the - ROCm documentation for installation instructions for Tensorflow on ROCm. - See :doc:`rocm-install-on-linux:install/3rd-party/tensorflow-install`. - -.. _tensorflow-docker-compat: - -Docker image compatibility -================================================================================ - -AMD provides preconfigured Docker images with TensorFlow and the ROCm backend. -These images are published on `Docker Hub `__ and are the -recommended way to get started with deep learning with TensorFlow on ROCm. - -To find the right image tag, see the :ref:`TensorFlow on ROCm installation -documentation ` for a list of -available ``rocm/tensorflow`` images. - - -Critical ROCm libraries for TensorFlow -=============================================================================== - -TensorFlow depends on multiple components and the supported features of those -components can affect the TensorFlow ROCm supported feature set. The versions -in the following table refer to the first TensorFlow version where the ROCm -library was introduced as a dependency. The versions described -are available in ROCm :version:`rocm_version`. - -.. list-table:: - :widths: 25, 10, 35, 30 - :header-rows: 1 - - * - ROCm library - - Version - - Purpose - - Used in - * - `hipBLAS `__ - - :version-ref:`hipBLAS rocm_version` - - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for - matrix and vector operations. - - Accelerates operations like ``tf.matmul``, ``tf.linalg.matmul``, and - other matrix multiplications commonly used in neural network layers. - * - `hipBLASLt `__ - - :version-ref:`hipBLASLt rocm_version` - - Extends hipBLAS with additional optimizations like fused kernels and - integer tensor cores. - - Optimizes matrix multiplications and linear algebra operations used in - layers like dense, convolutional, and RNNs in TensorFlow. - * - `hipCUB `__ - - :version-ref:`hipCUB rocm_version` - - Provides a C++ template library for parallel algorithms for reduction, - scan, sort and select. - - Supports operations like ``tf.reduce_sum``, ``tf.cumsum``, ``tf.sort`` - and other tensor operations in TensorFlow, especially those involving - scanning, sorting, and filtering. - * - `hipFFT `__ - - :version-ref:`hipFFT rocm_version` - - Accelerates Fast Fourier Transforms (FFT) for signal processing tasks. - - Used for operations like signal processing, image filtering, and - certain types of neural networks requiring FFT-based transformations. - * - `hipSOLVER `__ - - :version-ref:`hipSOLVER rocm_version` - - Provides GPU-accelerated direct linear solvers for dense and sparse - systems. - - Optimizes linear algebra functions such as solving systems of linear - equations, often used in optimization and training tasks. - * - `hipSPARSE `__ - - :version-ref:`hipSPARSE rocm_version` - - Optimizes sparse matrix operations for efficient computations on sparse - data. - - Accelerates sparse matrix operations in models with sparse weight - matrices or activations, commonly used in neural networks. - * - `MIOpen `__ - - :version-ref:`MIOpen rocm_version` - - Provides optimized deep learning primitives such as convolutions, - pooling, - normalization, and activation functions. - - Speeds up convolutional neural networks (CNNs) and other layers. Used - in TensorFlow for layers like ``tf.nn.conv2d``, ``tf.nn.relu``, and - ``tf.nn.lstm_cell``. - * - `RCCL `__ - - :version-ref:`RCCL rocm_version` - - Optimizes for multi-GPU communication for operations like AllReduce and - Broadcast. - - Distributed data parallel training (``tf.distribute.MirroredStrategy``). - Handles communication in multi-GPU setups. - * - `rocThrust `__ - - :version-ref:`rocThrust rocm_version` - - Provides a C++ template library for parallel algorithms like sorting, - reduction, and scanning. - - Reduction operations like ``tf.reduce_sum``, ``tf.cumsum`` for computing - the cumulative sum of elements along a given axis or ``tf.unique`` to - finds unique elements in a tensor can use rocThrust. - -Supported and unsupported features -=============================================================================== - -The following section maps supported data types and GPU-accelerated TensorFlow -features to their minimum supported ROCm and TensorFlow versions. - -Data types -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The data type of a tensor is specified using the ``dtype`` attribute or -argument, and TensorFlow supports a wide range of data types for different use -cases. - -The basic, single data types of `tf.dtypes `__ -are as follows: - -.. list-table:: - :header-rows: 1 - - * - Data type - - Description - - Since TensorFlow - - Since ROCm - * - ``bfloat16`` - - 16-bit bfloat (brain floating point). - - 1.0.0 - - 1.7 - * - ``bool`` - - Boolean. - - 1.0.0 - - 1.7 - * - ``complex128`` - - 128-bit complex. - - 1.0.0 - - 1.7 - * - ``complex64`` - - 64-bit complex. - - 1.0.0 - - 1.7 - * - ``double`` - - 64-bit (double precision) floating-point. - - 1.0.0 - - 1.7 - * - ``float16`` - - 16-bit (half precision) floating-point. - - 1.0.0 - - 1.7 - * - ``float32`` - - 32-bit (single precision) floating-point. - - 1.0.0 - - 1.7 - * - ``float64`` - - 64-bit (double precision) floating-point. - - 1.0.0 - - 1.7 - * - ``half`` - - 16-bit (half precision) floating-point. - - 2.0.0 - - 2.0 - * - ``int16`` - - Signed 16-bit integer. - - 1.0.0 - - 1.7 - * - ``int32`` - - Signed 32-bit integer. - - 1.0.0 - - 1.7 - * - ``int64`` - - Signed 64-bit integer. - - 1.0.0 - - 1.7 - * - ``int8`` - - Signed 8-bit integer. - - 1.0.0 - - 1.7 - * - ``qint16`` - - Signed quantized 16-bit integer. - - 1.0.0 - - 1.7 - * - ``qint32`` - - Signed quantized 32-bit integer. - - 1.0.0 - - 1.7 - * - ``qint8`` - - Signed quantized 8-bit integer. - - 1.0.0 - - 1.7 - * - ``quint16`` - - Unsigned quantized 16-bit integer. - - 1.0.0 - - 1.7 - * - ``quint8`` - - Unsigned quantized 8-bit integer. - - 1.0.0 - - 1.7 - * - ``resource`` - - Handle to a mutable, dynamically allocated resource. - - 1.0.0 - - 1.7 - * - ``string`` - - Variable-length string, represented as byte array. - - 1.0.0 - - 1.7 - * - ``uint16`` - - Unsigned 16-bit (word) integer. - - 1.0.0 - - 1.7 - * - ``uint32`` - - Unsigned 32-bit (dword) integer. - - 1.5.0 - - 1.7 - * - ``uint64`` - - Unsigned 64-bit (qword) integer. - - 1.5.0 - - 1.7 - * - ``uint8`` - - Unsigned 8-bit (byte) integer. - - 1.0.0 - - 1.7 - * - ``variant`` - - Data of arbitrary type (known at runtime). - - 1.4.0 - - 1.7 - -Features -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This table provides an overview of key features in TensorFlow and their -availability in ROCm. - -.. list-table:: - :header-rows: 1 - - * - Module - - Description - - Since TensorFlow - - Since ROCm - * - ``tf.linalg`` (Linear Algebra) - - Operations for matrix and tensor computations, such as - ``tf.linalg.matmul`` (matrix multiplication), ``tf.linalg.inv`` - (matrix inversion) and ``tf.linalg.cholesky`` (Cholesky decomposition). - These leverage GPUs for high-performance linear algebra operations. - - 1.4 - - 1.8.2 - * - ``tf.nn`` (Neural Network Operations) - - GPU-accelerated building blocks for deep learning models, such as 2D - convolutions with ``tf.nn.conv2d``, max pooling operations with - ``tf.nn.max_pool``, activation functions like ``tf.nn.relu`` or softmax - for output layers with ``tf.nn.softmax``. - - 1.0 - - 1.8.2 - * - ``tf.image`` (Image Processing) - - GPU-accelerated functions for image preprocessing and augmentations, - such as resize images with ``tf.image.resize``, flip images horizontally - with ``tf.image.flip_left_right`` and adjust image brightness randomly - with ``tf.image.random_brightness``. - - 1.1 - - 1.8.2 - * - ``tf.keras`` (High-Level API) - - GPU acceleration for Keras layers and models, including dense layers - (``tf.keras.layers.Dense``), convolutional layers - (``tf.keras.layers.Conv2D``) and recurrent layers - (``tf.keras.layers.LSTM``). - - 1.4 - - 1.8.2 - * - ``tf.math`` (Mathematical Operations) - - GPU-accelerated mathematical operations, such as sum across dimensions - with ``tf.math.reduce_sum``, elementwise exponentiation with - ``tf.math.exp`` and sigmoid activation (``tf.math.sigmoid``). - - 1.5 - - 1.8.2 - * - ``tf.signal`` (Signal Processing) - - Functions for spectral analysis and signal transformations. - - 1.13 - - 2.1 - * - ``tf.data`` (Data Input Pipeline) - - GPU-accelerated data preprocessing for efficient input pipelines, - Prefetching with ``tf.data.experimental.AUTOTUNE``. GPU-enabled - transformations like map and batch. - - 1.4 - - 1.8.2 - * - ``tf.distribute`` (Distributed Training) - - Enabling to scale computations across multiple devices on a single - machine or across multiple machines. - - 1.13 - - 2.1 - * - ``tf.random`` (Random Number Generation) - - GPU-accelerated random number generation - - 1.12 - - 1.9.2 - * - ``tf.TensorArray`` (Dynamic Array Operations) - - Enables dynamic tensor manipulation on GPUs. - - 1.0 - - 1.8.2 - * - ``tf.sparse`` (Sparse Tensor Operations) - - GPU-accelerated sparse matrix manipulations. - - 1.9 - - 1.9.0 - * - ``tf.experimental.numpy`` - - GPU-accelerated NumPy-like API for numerical computations. - - 2.4 - - 4.1.1 - * - ``tf.RaggedTensor`` - - Handling of variable-length sequences and ragged tensors with GPU - support. - - 1.13 - - 2.1 - * - ``tf.function`` with XLA (Accelerated Linear Algebra) - - Enable GPU-accelerated functions in optimization. - - 1.14 - - 2.4 - * - ``tf.quantization`` - - Quantized operations for inference, accelerated on GPUs. - - 1.12 - - 1.9.2 - -Distributed library features -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Enables developers to scale computations across multiple devices on a single machine or -across multiple machines. - -.. list-table:: - :header-rows: 1 - - * - Feature - - Description - - Since TensorFlow - - Since ROCm - * - ``MultiWorkerMirroredStrategy`` - - Synchronous training across multiple workers using mirrored variables. - - 2.0 - - 3.0 - * - ``MirroredStrategy`` - - Synchronous training across multiple GPUs on one machine. - - 1.5 - - 2.5 - * - ``TPUStrategy`` - - Efficiently trains models on Google TPUs. - - 1.9 - - ❌ - * - ``ParameterServerStrategy`` - - Asynchronous training using parameter servers for variable management. - - 2.1 - - 4.0 - * - ``CentralStorageStrategy`` - - Keeps variables on a single device and performs computation on multiple - devices. - - 2.3 - - 4.1 - * - ``CollectiveAllReduceStrategy`` - - Synchronous training across multiple devices and hosts. - - 1.14 - - 3.5 - * - Distribution Strategies API - - High-level API to simplify distributed training configuration and - execution. - - 1.10 - - 3.0 - -Unsupported TensorFlow features -=============================================================================== - -The following are GPU-accelerated TensorFlow features not currently supported by -ROCm. - -.. list-table:: - :header-rows: 1 - - * - Feature - - Description - - Since TensorFlow - * - Mixed Precision with TF32 - - Mixed precision with TF32 is used for matrix multiplications, - convolutions, and other linear algebra operations, particularly in - deep learning workloads like CNNs and transformers. - - 2.4 - * - ``tf.distribute.TPUStrategy`` - - Efficiently trains models on Google TPUs. - - 1.9 - -Use cases and recommendations -=============================================================================== - -* The `Training a Neural Collaborative Filtering (NCF) Recommender on an AMD - GPU `__ - blog post discusses training an NCF recommender system using TensorFlow. It - explains how NCF improves traditional collaborative filtering methods by - leveraging neural networks to model non-linear user-item interactions. The - post outlines the implementation using the recommenders library, focusing on - the use of implicit data (for example, user interactions like viewing or - purchasing) and how it addresses challenges like the lack of negative values. - -* The `Creating a PyTorch/TensorFlow code environment on AMD GPUs - `__ - blog post provides instructions for creating a machine learning environment - for PyTorch and TensorFlow on AMD GPUs using ROCm. It covers steps like - installing the libraries, cloning code repositories, installing dependencies, - and troubleshooting potential issues with CUDA-based code. Additionally, it - explains how to HIPify code (port CUDA code to HIP) and manage Docker images - for a better experience on AMD GPUs. This guide aims to help data scientists - and ML practitioners adapt their code for AMD GPUs. - -For more use cases and recommendations, see the `ROCm Tensorflow blog posts `__. diff --git a/docs/compatibility/ml-compatibility/verl-compatibility.rst b/docs/compatibility/ml-compatibility/verl-compatibility.rst deleted file mode 100644 index 0351384e5..000000000 --- a/docs/compatibility/ml-compatibility/verl-compatibility.rst +++ /dev/null @@ -1,86 +0,0 @@ -:orphan: - -.. meta:: - :description: verl compatibility - :keywords: GPU, verl compatibility - -.. version-set:: rocm_version latest - -******************************************************************************* -verl compatibility -******************************************************************************* - -Volcano Engine Reinforcement Learning for LLMs (verl) is a reinforcement learning framework designed for large language models (LLMs). -verl offers a scalable, open-source fine-tuning solution optimized for AMD Instinct GPUs with full ROCm support. - -* See the `verl documentation `_ for more information about verl. -* The official verl GitHub repository is `https://github.com/volcengine/verl `_. -* Use the AMD-validated :ref:`Docker images ` with ROCm and verl preinstalled. -* See the :doc:`ROCm verl installation guide ` to install and get started. - -.. note:: - - verl is supported on ROCm 6.2.0. - -.. _verl-recommendations: - -Use cases and recommendations -================================================================================ - -The benefits of verl in large-scale reinforcement learning from human feedback (RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration `_ blog. - -.. _verl-supported_features: - -Supported features -=============================================================================== - -The following table shows verl on ROCm support for GPU-accelerated modules. - -.. list-table:: - :header-rows: 1 - - * - Module - - Description - - verl version - - ROCm version - * - ``FSDP`` - - Training engine - - 0.3.0.post0 - - 6.2.0 - * - ``vllm`` - - Inference engine - - 0.3.0.post0 - - 6.2.0 - -.. _verl-docker-compat: - -Docker image compatibility -================================================================================ - -.. |docker-icon| raw:: html - - - -AMD validates and publishes ready-made `ROCm verl Docker images `_ -with ROCm backends on Docker Hub. The following Docker image tags and associated inventories represent the available verl versions from the official Docker Hub. - -.. list-table:: - :header-rows: 1 - - * - Docker image - - ROCm - - verl - - Ubuntu - - Pytorch - - Python - - vllm - - * - .. raw:: html - - rocm/verl - - `6.2.0 `_ - - `0.3.0post0 `_ - - 20.04 - - `2.5.0 `_ - - `3.9.19 `_ - - `0.6.3 `_ diff --git a/docs/conceptual/ai-pytorch-inception.md b/docs/conceptual/ai-pytorch-inception.md deleted file mode 100644 index 9dc1b628c..000000000 --- a/docs/conceptual/ai-pytorch-inception.md +++ /dev/null @@ -1,1233 +0,0 @@ - - - - - - -# Deep learning: Inception V3 with PyTorch - -## Deep learning training - -Deep-learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective. - -The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model's performance during the training process. - -Training also includes the choice of an optimization algorithm that reduces the loss by adjusting the model's parameters. Training is an iterative process where training data is fed in, usually split into different batches, with the entirety of the training data passed during one training epoch. Training usually is run for multiple epochs. - -## Training phases - -Training occurs in multiple phases for every batch of training data. the following table provides an explanation of the types of training phases. - -:::{table} Types of Training Phases -:name: training-phases -:widths: auto - -| Types of Phases | | -| ----------------- | --- | -| Forward Pass | The input features are fed into the model, whose parameters may be randomly initialized initially. Activations (outputs) of each layer are retained during this pass to help in the loss gradient computation during the backward pass. | -| Loss Computation | The output is compared against the target outputs, and the loss is computed. | -| Backward Pass | The loss is propagated backward, and the model's error gradients are computed and stored for each trainable parameter. | -| Optimization Pass | The optimization algorithm updates the model parameters using the stored error gradients. | -::: - -Training is different from inference, particularly from the hardware perspective. The following table shows the contrast between training and inference. - -:::{table} Training vs. Inference -:name: training-inference -:widths: auto - -| Training | Inference | -| ----------- | ----------- | -| Training is measured in hours/days. | The inference is measured in minutes. | -| Training is generally run offline in a data center or cloud setting. | The inference is made on edge devices. | -| The memory requirements for training are higher than inference due to storing intermediate data, such as activations and error gradients. | The memory requirements are lower for inference than training. | -| Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed. | Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency). | -::: - -Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other data types, leading to improvement in performance if a faster datatype can be selected for the corresponding task. - -## Case studies - -The following sections contain case studies for the Inception V3 model. - -### Inception V3 with PyTorch - -Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception V3 is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets. - -The implementation uses PyTorch as a framework. This case study utilizes [TorchVision](https://pytorch.org/vision/stable/index.html), a repository of popular datasets and model architectures, for obtaining the model. TorchVision also provides pre-trained weights as a starting point to develop new models or fine-tune the model for a new task. - -#### Evaluating a pre-trained model - -The Inception V3 model introduces a simple image classification task with the pre-trained model. This does not involve training but utilizes an already pre-trained model from TorchVision. - -This example is adapted from the PyTorch research hub page on [Inception V3](https://pytorch.org/vision/master/models/inception.html). - -Follow these steps: - -1. Run the PyTorch ROCm-based Docker image or refer to the section {doc}`Installing PyTorch ` for setting up a PyTorch environment on ROCm. - - ```dockerfile - docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest - ``` - -2. Run the Python shell and import packages and libraries for model creation. - - ```py - import torch - import torchvision - ``` - -3. Set the model in evaluation mode. Evaluation mode directs PyTorch not to store intermediate data, which would have been used in training. - - ```py - model = torch.hub.load('pytorch/vision:v0.10.0', 'inception_v3', pretrained=True) - model.eval() - ``` - -4. Download a sample image for inference. - - ```py - import urllib - url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg") - try: urllib.URLopener().retrieve(url, filename) - except: urllib.request.urlretrieve(url, filename) - ``` - -5. Import torchvision and PILImage support libraries. - - ```py - from PIL import Image - from torchvision import transforms - input_image = Image.open(filename) - ``` - -6. Apply preprocessing and normalization. - - ```py - preprocess = transforms.Compose([ - transforms.Resize(299), - transforms.CenterCrop(299), - transforms.ToTensor(), - transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), - ]) - ``` - -7. Use input tensors and unsqueeze them later. - - ```py - input_tensor = preprocess(input_image) - input_batch = input_tensor.unsqueeze(0) - if torch.cuda.is_available(): - input_batch = input_batch.to('cuda') - model.to('cuda') - ``` - -8. Find out probabilities. - - ```py - with torch.no_grad(): - output = model(input_batch) - print(output[0]) - probabilities = torch.nn.functional.softmax(output[0], dim=0) - print(probabilities) - ``` - -9. To understand the probabilities, download and examine the ImageNet labels. - - ```py - wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt - ``` - -10. Read the categories and show the top categories for the image. - - ```py - with open("imagenet_classes.txt", "r") as f: - categories = [s.strip() for s in f.readlines()] - top5_prob, top5_catid = torch.topk(probabilities, 5) - for i in range(top5_prob.size(0)): - print(categories[top5_catid[i]], top5_prob[i].item()) - ``` - -#### Training Inception V3 - -The previous section focused on downloading and using the Inception V3 model for a simple image classification task. This section walks through training the model on a new dataset. - -Follow these steps: - -1. Run the PyTorch ROCm Docker image or refer to the section {doc}`Installing PyTorch ` for setting up a PyTorch environment on ROCm. - - ```dockerfile - docker pull rocm/pytorch:latest - docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest - ``` - -2. Download an ImageNet database. For this example, the `tiny-imagenet-200`, a smaller ImageNet variant with 200 image classes and a training dataset with 100,000 images, was downsized to 64x64 color images. - - ```bash - wget http://cs231n.stanford.edu/tiny-imagenet-200.zip - ``` - -3. Process the database to set the validation directory to the format expected by PyTorch's `DataLoader`. - -4. Run the following script: - - ```py - import io - import glob - import os - from shutil import move - from os.path import join - from os import listdir, rmdir - target_folder = './tiny-imagenet-200/val/' - val_dict = {} - with open('./tiny-imagenet-200/val/val_annotations.txt', 'r') as f: - for line in f.readlines(): - split_line = line.split('\t') - val_dict[split_line[0]] = split_line[1] - - paths = glob.glob('./tiny-imagenet-200/val/images/*') - for path in paths: - file = path.split('/')[-1] - folder = val_dict[file] - if not os.path.exists(target_folder + str(folder)): - os.mkdir(target_folder + str(folder)) - os.mkdir(target_folder + str(folder) + '/images') - - for path in paths: - file = path.split('/')[-1] - folder = val_dict[file] - dest = target_folder + str(folder) + '/images/' + str(file) - move(path, dest) - - rmdir('./tiny-imagenet-200/val/images') - ``` - -5. Open a Python shell. - -6. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision). - - ```py - import torch - import os - import torchvision - from torchvision import transforms - from torchvision.transforms.functional import InterpolationMode - ``` - -7. Set parameters to guide the training process. - - :::{note} - The device is set to `"cuda"`. In PyTorch, `"cuda"` is a generic keyword to denote a GPU. - ::: - - ```py - device = "cuda" - ``` - -8. Set the data_path to the location of the training and validation data. In this case, the `tiny-imagenet-200` is present as a subdirectory to the current directory. - - ```py - data_path = "tiny-imagenet-200" - ``` - - The training image size is cropped for input into Inception V3. - - ```py - train_crop_size = 299 - ``` - -9. To smooth the image, use bilinear interpolation, a resampling method that uses the distance weighted average of the four nearest pixel values to estimate a new pixel value. - - ```py - interpolation = "bilinear" - ``` - - The next parameters control the size to which the validation image is cropped and resized. - - ```py - val_crop_size = 299 - val_resize_size = 342 - ``` - - The pre-trained Inception V3 model is chosen to be downloaded from torchvision. - - ```py - model_name = "inception_v3" - pretrained = True - ``` - - During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined. - - ```py - batch_size = 32 - ``` - - This refers to the number of CPU threads the data loader uses to perform efficient multi-process data loading. - - ```py - num_workers = 16 - ``` - - The `torch.optim` package provides methods to adjust the learning rate as the training progresses. This example uses the `StepLR` scheduler, which decays the learning rate by `lr_gamma` at every `lr_step_size` number of epochs. - - ```py - learning_rate = 0.1 - momentum = 0.9 - weight_decay = 1e-4 - lr_step_size = 30 - lr_gamma = 0.1 - ``` - - :::{note} - One training epoch is when the neural network passes an entire dataset forward and backward. - ::: - - ```py - epochs = 90 - ``` - - The train and validation directories are determined. - - ```py - train_dir = os.path.join(data_path, "train") - val_dir = os.path.join(data_path, "val") - ``` - -10. Set up the training and testing data loaders. - - ```py - interpolation = InterpolationMode(interpolation) - - TRAIN_TRANSFORM_IMG = transforms.Compose([ - Normalizaing and standardardizing the image - transforms.RandomResizedCrop(train_crop_size, interpolation=interpolation), - transforms.PILToTensor(), - transforms.ConvertImageDtype(torch.float), - transforms.Normalize(mean=[0.485, 0.456, 0.406], - std=[0.229, 0.224, 0.225] ) - ]) - dataset = torchvision.datasets.ImageFolder( - train_dir, - transform=TRAIN_TRANSFORM_IMG - ) - TEST_TRANSFORM_IMG = transforms.Compose([ - transforms.Resize(val_resize_size, interpolation=interpolation), - transforms.CenterCrop(val_crop_size), - transforms.PILToTensor(), - transforms.ConvertImageDtype(torch.float), - transforms.Normalize(mean=[0.485, 0.456, 0.406], - std=[0.229, 0.224, 0.225] ) - ]) - - dataset_test = torchvision.datasets.ImageFolder( - val_dir, - transform=TEST_TRANSFORM_IMG - ) - - print("Creating data loaders") - train_sampler = torch.utils.data.RandomSampler(dataset) - test_sampler = torch.utils.data.SequentialSampler(dataset_test) - - data_loader = torch.utils.data.DataLoader( - dataset, - batch_size=batch_size, - sampler=train_sampler, - num_workers=num_workers, - pin_memory=True - ) - - data_loader_test = torch.utils.data.DataLoader( - dataset_test, batch_size=batch_size, sampler=test_sampler, num_workers=num_workers, pin_memory=True - ) - ``` - - :::{note} - Use torchvision to obtain the Inception V3 model. Use the pre-trained model weights to speed up training. - ::: - - ```py - print("Creating model") - print("Num classes = ", len(dataset.classes)) - model = torchvision.models.__dict__[model_name](pretrained=pretrained) - ``` - -11. Adapt Inception V3 for the current dataset. `tiny-imagenet-200` contains only 200 classes, whereas Inception V3 is designed for 1,000-class output. The last layer of Inception V3 is replaced to match the output features required. - - ```py - model.fc = torch.nn.Linear(model.fc.in_features, len(dataset.classes)) - model.aux_logits = False - model.AuxLogits = None - ``` - -12. Move the model to the GPU device. - - ```py - model.to(device) - ``` - -13. Set the loss criteria. For this example, Cross Entropy Loss is used. - - ```py - criterion = torch.nn.CrossEntropyLoss() - ``` - -14. Set the optimizer to Stochastic Gradient Descent. - - ```py - optimizer = torch.optim.SGD( - model.parameters(), - lr=learning_rate, - momentum=momentum, - weight_decay=weight_decay - ) - ``` - -15. Set the learning rate scheduler. - - ```py - lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=lr_step_size, gamma=lr_gamma) - ``` - -16. Iterate over epochs. Each epoch is a complete pass through the training data. - - ```py - print("Start training") - for epoch in range(epochs): - model.train() - epoch_loss = 0 - len_dataset = 0 - ``` - -17. Iterate over steps. The data is processed in batches, and each step passes through a full batch. - - ```py - for step, (image, target) in enumerate(data_loader): - ``` - -18. Pass the image and target to the GPU device. - - ```py - image, target = image.to(device), target.to(device) - ``` - - The following is the core training logic: - - a. The image is fed into the model. - - b. The output is compared with the target in the training data to obtain the loss. - - c. This loss is back propagated to all parameters that require optimization. - - d. The optimizer updates the parameters based on the selected optimization algorithm. - - ```py - output = model(image) - loss = criterion(output, target) - optimizer.zero_grad() - loss.backward() - optimizer.step() - ``` - - The epoch loss is updated, and the step loss prints. - - ```py - epoch_loss += output.shape[0] * loss.item() - len_dataset += output.shape[0]; - if step % 10 == 0: - print('Epoch: ', epoch, '| step : %d' % step, '| train loss : %0.4f' % loss.item() ) - epoch_loss = epoch_loss / len_dataset - print('Epoch: ', epoch, '| train loss : %0.4f' % epoch_loss ) - ``` - - The learning rate is updated at the end of each epoch. - - ```py - lr_scheduler.step() - ``` - - After training for the epoch, the model evaluates against the validation dataset. - - ```py - model.eval() - with torch.inference_mode(): - running_loss = 0 - for step, (image, target) in enumerate(data_loader_test): - image, target = image.to(device), target.to(device) - - output = model(image) - loss = criterion(output, target) - - running_loss += loss.item() - running_loss = running_loss / len(data_loader_test) - print('Epoch: ', epoch, '| test loss : %0.4f' % running_loss ) - ``` - -19. Save the model for use in inferencing tasks. - -```py -# save model -torch.save(model.state_dict(), "trained_inception_v3.pt") -``` - -Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in the following image. - -![Inception V3 train and loss graph](../data/conceptual/inception-v3.png "Inception V3 train and loss") - -### Custom model with CIFAR-10 on PyTorch - -The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below. - -Follow these steps: - -1. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision). - - ```py - import torch - import torchvision - import torchvision.transforms as transforms - import matplotlib.pyplot as plot - import numpy as np - ``` - -2. The output of torchvision datasets is `PILImage` images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1]. - - ```py - transform = transforms.Compose( - [transforms.ToTensor(), - transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) - ``` - - During each training step, a batch of images is processed to compute the loss gradient and perform the optimization. In the following setting, the size of the batch is determined. - - ```py - batch_size = 4 - ``` - -3. Download the dataset train and test datasets as follows. Specify the batch size, shuffle the dataset once, and specify the number of workers to the number of CPU threads used by the data loader to perform efficient multi-process data loading. - - ```py - train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) - train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2) - ``` - -4. Follow the same procedure for the testing set. - - ```py - test_set = TorchVision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) - test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=2) - print ("teast set and test loader") - ``` - -5. Specify the defined classes of images belonging to this dataset. - - ```py - classes = ('Aeroplane', 'motorcar', 'bird', 'cat', 'deer', 'puppy', 'frog', 'stallion', 'cruise', 'truck') - print("defined classes") - ``` - -6. Denormalize the images and then iterate over them. - - ```py - global image_number - image_number = 0 - def show_image(img): - global image_number - image_number = image_number + 1 - img = img / 2 + 0.5 # de-normalizing input image - npimg = img.numpy() - plot.imshow(np.transpose(npimg, (1, 2, 0))) - plot.savefig("fig{}.jpg".format(image_number)) - print("fig{}.jpg".format(image_number)) - plot.show() - data_iter = iter(train_loader) - images, labels = data_iter.next() - show_image(torchvision.utils.make_grid(images)) - print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size))) - print("image created and saved ") - ``` - -7. Import the `torch.nn` for constructing neural networks and `torch.nn.functional` to use the convolution functions. - - ```py - import torch.nn as nn - import torch.nn.functional as F - ``` - -8. Define the CNN (Convolution Neural Networks) and relevant activation functions. - - ```py - class Net(nn.Module): - def __init__(self): - super().__init__() - self.conv1 = nn.Conv2d(3, 6, 5) - self.pool = nn.MaxPool2d(2, 2) - self.conv2 = nn.Conv2d(6, 16, 5) - self.pool = nn.MaxPool2d(2, 2) - self.conv3 = nn.Conv2d(3, 6, 5) - self.fc2 = nn.Linear(120, 84) - self.fc3 = nn.Linear(84, 10) - - def forward(self, x): - x = self.pool(F.relu(self.conv1(x))) - x = self.pool(F.relu(self.conv2(x))) - x = torch.flatten(x, 1) # flatten all dimensions except batch - x = F.relu(self.fc1(x)) - x = F.relu(self.fc2(x)) - x = self.fc3(x) - return x - net = Net() - print("created Net() ") - ``` - -9. Set the optimizer to Stochastic Gradient Descent. - - ```py - import torch.optim as optim - ``` - -10. Set the loss criteria. For this example, Cross Entropy Loss is used. - - ```py - criterion = nn.CrossEntropyLoss() - optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9) - ``` - -11. Iterate over epochs. Each epoch is a complete pass through the training data. - - ```py - for epoch in range(2): # loop over the dataset multiple times - - running_loss = 0.0 - for i, data in enumerate(train_loader, 0): - # get the inputs; data is a list of [inputs, labels] - inputs, labels = data - - # zero the parameter gradients - optimizer.zero_grad() - - # forward + backward + optimize - outputs = net(inputs) - loss = criterion(outputs, labels) - loss.backward() - optimizer.step() - - # print statistics - running_loss += loss.item() - if i % 2000 == 1999: # print every 2000 mini-batches - print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 2000)) - running_loss = 0.0 - print('Finished Training') - ``` - - ```py - PATH = './cifar_net.pth' - torch.save(net.state_dict(), PATH) - print("saved model to path :",PATH) - net = Net() - net.load_state_dict(torch.load(PATH)) - print("loding back saved model") - outputs = net(images) - _, predicted = torch.max(outputs, 1) - print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4))) - correct = 0 - total = 0 - ``` - - As this is not training, calculating the gradients for outputs is not required. - - ```py - # calculate outputs by running images through the network - with torch.no_grad(): - for data in test_loader: - images, labels = data - # calculate outputs by running images through the network - outputs = net(images) - # the class with the highest energy is what you can choose as prediction - _, predicted = torch.max(outputs.data, 1) - total += labels.size(0) - correct += (predicted == labels).sum().item() - print('Accuracy of the network on the 10000 test images: %d %%' % ( 100 * correct / total)) - # prepare to count predictions for each class - correct_pred = {classname: 0 for classname in classes} - total_pred = {classname: 0 for classname in classes} - ``` - - ```py - # again no gradients needed - with torch.no_grad(): - for data in test_loader: - images, labels = data - outputs = net(images) - _, predictions = torch.max(outputs, 1) - # collect the correct predictions for each class - for label, prediction in zip(labels, predictions): - if label == prediction: - correct_pred[classes[label]] += 1 - total_pred[classes[label]] += 1 - # print accuracy for each class - for classname, correct_count in correct_pred.items(): - accuracy = 100 * float(correct_count) / total_pred[classname] - print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy)) - ``` - -### Case study: TensorFlow with Fashion-MNIST - -Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories. - -Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts. - -The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries. - -Access the source code from the following repository: - -[https://github.com/ROCm/tensorflow_fashionmnist/blob/main/fashion_mnist.py](https://github.com/ROCm/tensorflow_fashionmnist/blob/main/fashion_mnist.py) - -To understand the code step by step, follow these steps: - -1. Import libraries like TensorFlow, NumPy, and Matplotlib to train the neural network and calculate and plot graphs. - - ```py - import tensorflow as tf - import numpy as np - import matplotlib.pyplot as plt - ``` - -2. To verify that TensorFlow is installed, print the version of TensorFlow by using the below print statement: - - ```py - print(tf._version__) r - ``` - -3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn. - -4. The model is tested against the test set, test_images, and test_labels arrays. - - ```py - fashion_mnist = tf.keras.datasets.fashion_mnist - (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() - ``` - - Since you have 10 types of images in the dataset, assign labels from zero to nine. Each image is assigned one label. The images are 28x28 NumPy arrays, with pixel values ranging from zero to 255. - -5. Each image is mapped to a single label. Since the class names are not included with the dataset, store them, and later use them when plotting the images: - - ```py - class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat','Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] - ``` - -6. Use this code to explore the dataset by knowing its dimensions: - - ```py - train_images.shape - ``` - -7. Use this code to print the size of this training set: - - ```py - print(len(train_labels)) - ``` - -8. Use this code to print the labels of this training set: - - ```py - print(train_labels) - ``` - -9. Preprocess the data before training the network, and you can start inspecting the first image, as its pixels will fall in the range of zero to 255. - - ```py - plt.figure() - plt.imshow(train_images[0]) - plt.colorbar() - plt.grid(False) - plt.show() - ``` - - ![ ](../data/conceptual/mnist-1.png) - -10. From the above picture, you can see that values are from zero to 255. Before training this on the neural network, you must bring them in the range of zero to one. Hence, divide the values by 255. - - ```py - train_images = train_images / 255.0 - - test_images = test_images / 255.0 - ``` - -11. To ensure the data is in the correct format and ready to build and train the network, display the first 25 images from the training set and the class name below each image. - - ```py - plt.figure(figsize=(10,10)) - for i in range(25): - plt.subplot(5,5,i+1) - plt.xticks([]) - plt.yticks([]) - plt.grid(False) - plt.imshow(train_images[i], cmap=plt.cm.binary) - plt.xlabel(class_names[train_labels[i]]) - plt.show() - ``` - - ![ ](../data/conceptual/mnist-2.png) - - The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training. - - ```py - model = tf.keras.Sequential([ - tf.keras.layers.Flatten(input_shape=(28, 28)), - tf.keras.layers.Dense(128, activation='relu'), - tf.keras.layers.Dense(10) - ]) - ``` - - * The first layer in this network `tf.keras.layers.Flatten` transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data. - - * After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes. - -12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation. - - ```py - model.compile(optimizer='adam', - loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), - metrics=['accuracy']) - ``` - - * Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction. - - * Optimizer —This is how the model is updated based on the data it sees and its loss function. - - * Metrics —This is used to monitor the training and testing steps. - - The following example uses accuracy, the fraction of the correctly classified images. - - To train the neural network model, follow these steps: - - 1. Feed the training data to the model. The training data is in the train_images and train_labels arrays in this example. The model learns to associate images and labels. - - 2. Ask the model to make predictions about a test set—in this example, the test_images array. - - 3. Verify that the predictions match the labels from the test_labels array. - - 4. To start training, call the model.fit method because it "fits" the model to the training data. - - ```py - model.fit(train_images, train_labels, epochs=10) - ``` - - 5. Compare how the model will perform on the test dataset. - - ```py - test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2) - - print('\nTest accuracy:', test_acc) - ``` - - 6. With the model trained, you can use it to make predictions about some images: the model's linear outputs and logits. Attach a softmax layer to convert the logits to probabilities, making it easier to interpret. - - ```py - probability_model = tf.keras.Sequential([model, - tf.keras.layers.Softmax()]) - - predictions = probability_model.predict(test_images) - ``` - - 7. The model has predicted the label for each image in the testing set. Look at the first prediction: - - ```py - predictions[0] - ``` - - A prediction is an array of 10 numbers. They represent the model's "confidence" that the image corresponds to each of the 10 different articles of clothing. You can see which label has the highest confidence value: - - ```py - np.argmax(predictions[0]) - ``` - - 8. Plot a graph to look at the complete set of 10 class predictions. - - ```py - def plot_image(i, predictions_array, true_label, img): - true_label, img = true_label[i], img[i] - plt.grid(False) - plt.xticks([]) - plt.yticks([]) - - plt.imshow(img, cmap=plt.cm.binary) - - predicted_label = np.argmax(predictions_array) - if predicted_label == true_label: - color = 'blue' - else: - color = 'red' - - plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label], - 100*np.max(predictions_array), - class_names[true_label]), - color=color) - - def plot_value_array(i, predictions_array, true_label): - true_label = true_label[i] - plt.grid(False) - plt.xticks(range(10)) - plt.yticks([]) - thisplot = plt.bar(range(10), predictions_array, color="#777777") - plt.ylim([0, 1]) - predicted_label = np.argmax(predictions_array) - - thisplot[predicted_label].set_color('red') - thisplot[true_label].set_color('blue') - ``` - - 9. With the model trained, you can use it to make predictions about some images. Review the 0th image predictions and the prediction array. Correct prediction labels are blue, and incorrect prediction labels are red. The number gives the percentage (out of 100) for the predicted label. - - ```py - i = 0 - plt.figure(figsize=(6,3)) - plt.subplot(1,2,1) - plot_image(i, predictions[i], test_labels, test_images) - plt.subplot(1,2,2) - plot_value_array(i, predictions[i], test_labels) - plt.show() - ``` - - ![ ](../data/conceptual/mnist-3.png) - - ```py - i = 12 - plt.figure(figsize=(6,3)) - plt.subplot(1,2,1) - plot_image(i, predictions[i], test_labels, test_images) - plt.subplot(1,2,2) - plot_value_array(i, predictions[i], test_labels) - plt.show() - ``` - - ![ ](../data/conceptual/mnist-4.png) - - 10. Use the trained model to predict a single image. - - ```py - # Grab an image from the test dataset. - img = test_images[1] - print(img.shape) - ``` - - 11. `tf.keras` models are optimized to make predictions on a batch, or collection, of examples at once. Accordingly, even though you are using a single image, you must add it to a list. - - ```py - # Add the image to a batch where it's the only member. - img = (np.expand_dims(img,0)) - - print(img.shape) - ``` - - 12. Predict the correct label for this image. - - ```py - predictions_single = probability_model.predict(img) - - print(predictions_single) - - plot_value_array(1, predictions_single[0], test_labels) - _ = plt.xticks(range(10), class_names, rotation=45) - plt.show() - ``` - - ![ ](../data/conceptual/mnist-5.png) - - 13. `tf.keras.Model.predict` returns a list of lists—one for each image in the batch of data. Grab the predictions for our (only) image in the batch. - - ```py - np.argmax(predictions_single[0]) - ``` - -### Case study: TensorFlow with text classification - -This procedure demonstrates text classification starting from plain text files stored on disk. You will train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try in which you will train a multi-class classifier to predict the tag for a programming question on Stack Overflow. - -Follow these steps: - -1. Import the necessary libraries. - - ```py - import matplotlib.pyplot as plt - import os - import re - import shutil - import string - import tensorflow as tf - - from tensorflow.keras import layers - from tensorflow.keras import losses - ``` - -2. Get the data for the text classification, and extract the database from the given link of IMDB. - - ```py - url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" - - dataset = tf.keras.utils.get_file("aclImdb_v1", url, - untar=True, cache_dir='.', - cache_subdir='') - ``` - - ```bash - Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz - 84131840/84125825 [==============================] – 1s 0us/step - 84149932/84125825 [==============================] – 1s 0us/step - ``` - -3. Fetch the data from the directory. - - ```py - dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb') - print(os.listdir(dataset_dir)) - ``` - -4. Load the data for training purposes. - - ```py - train_dir = os.path.join(dataset_dir, 'train') - os.listdir(train_dir) - ``` - - ```py - ['labeledBow.feat', - 'urls_pos.txt', - 'urls_unsup.txt', - 'unsup', - 'pos', - 'unsupBow.feat', - 'urls_neg.txt', - 'neg'] - ``` - -5. The directories contain many text files, each of which is a single movie review. To look at one of them, use the following: - - ```py - sample_file = os.path.join(train_dir, 'pos/1181_9.txt') - with open(sample_file) as f: - print(f.read()) - ``` - -6. As the IMDB dataset contains additional folders, remove them before using this utility. - - ```py - remove_dir = os.path.join(train_dir, 'unsup') - shutil.rmtree(remove_dir) - batch_size = 32 - seed = 42 - ``` - -7. The IMDB dataset has already been divided into train and test but lacks a validation set. Create a validation set using an 80:20 split of the training data by using the validation_split argument below: - - ```py - raw_train_ds=tf.keras.utils.text_dataset_from_directory('aclImdb/train',batch_size=batch_size, validation_split=0.2,subset='training', seed=seed) - ``` - -8. As you will see in a moment, you can train a model by passing a dataset directly to `model.fit`. If you are new to `tf.data`, you can also iterate over the dataset and print a few examples as follows: - - ```py - for text_batch, label_batch in raw_train_ds.take(1): - for i in range(3): - print("Review", text_batch.numpy()[i]) - print("Label", label_batch.numpy()[i]) - ``` - -9. The labels are zero or one. To see which of these correspond to positive and negative movie reviews, check the class_names property on the dataset. - - ```py - print("Label 0 corresponds to", raw_train_ds.class_names[0]) - print("Label 1 corresponds to", raw_train_ds.class_names[1]) - ``` - -10. Next, create validation and test the dataset. Use the remaining 5,000 reviews from the training set for validation into two classes of 2,500 reviews each. - - ```py - raw_val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', - batch_size=batch_size,validation_split=0.2,subset='validation', seed=seed) - - raw_test_ds = - tf.keras.utils.text_dataset_from_directory( - 'aclImdb/test', - batch_size=batch_size) - ``` - -To prepare the data for training, follow these steps: - -1. Standardize, tokenize, and vectorize the data using the helpful `tf.keras.layers.TextVectorization` layer. - - ```py - def custom_standardization(input_data): - lowercase = tf.strings.lower(input_data) - stripped_html = tf.strings.regex_replace(lowercase, '
', ' ') - return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation),'') - ``` - -2. Create a `TextVectorization` layer. Use this layer to standardize, tokenize, and vectorize our data. Set the output_mode to int to create unique integer indices for each token. Note that we are using the default split function and the custom standardization function you defined above. You will also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values. - - ```py - max_features = 10000 - sequence_length = 250 - vectorize_layer = layers.TextVectorization( - standardize=custom_standardization, - max_tokens=max_features, - output_mode='int', - output_sequence_length=sequence_length) - ``` - -3. Call adapt to fit the state of the preprocessing layer to the dataset. This causes the model to build an index of strings to integers. - - ```py - # Make a text-only dataset (without labels), then call adapt - train_text = raw_train_ds.map(lambda x, y: x) - vectorize_layer.adapt(train_text) - ``` - -4. Create a function to see the result of using this layer to preprocess some data. - - ```py - def vectorize_text(text, label): - text = tf.expand_dims(text, -1) - return vectorize_layer(text), label - - text_batch, label_batch = next(iter(raw_train_ds)) - first_review, first_label = text_batch[0], label_batch[0] - print("Review", first_review) - print("Label", raw_train_ds.class_names[first_label]) - print("Vectorized review", vectorize_text(first_review, first_label)) - ``` - - ![ ](../data/conceptual/TextClassification-3.png) - -5. As you can see above, each token has been replaced by an integer. Look up the token (string) that each integer corresponds to by calling get_vocabulary() on the layer. - - ```py - print("1287 ---> ",vectorize_layer.get_vocabulary()[1287]) - print(" 313 ---> ",vectorize_layer.get_vocabulary()[313]) - print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary()))) - ``` - -6. You are nearly ready to train your model. As a final preprocessing step, apply the `TextVectorization` layer we created earlier to train, validate, and test the dataset. - - ```py - train_ds = raw_train_ds.map(vectorize_text) - val_ds = raw_val_ds.map(vectorize_text) - test_ds = raw_test_ds.map(vectorize_text) - ``` - - The `cache()` function keeps data in memory after it is loaded off disk. This ensures the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files. - - The `prefetch()` function overlaps data preprocessing and model execution while training. - - ```py - AUTOTUNE = tf.data.AUTOTUNE - - train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE) - val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE) - test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE) - ``` - -7. Create your neural network. - - ```py - embedding_dim = 16 - model = tf.keras.Sequential([layers.Embedding(max_features + 1, embedding_dim),layers.Dropout(0.2),layers.GlobalAveragePooling1D(), - layers.Dropout(0.2),layers.Dense(1)]) - model.summary() - ``` - - ![ ](../data/conceptual/TextClassification-4.png) - -8. A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), use [`losses.BinaryCrossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) loss function. - - ```py - model.compile(loss=losses.BinaryCrossentropy(from_logits=True), - optimizer='adam',metrics=tf.metrics.BinaryAccuracy(threshold=0.0)) - ``` - -9. Train the model by passing the dataset object to the fit method. - - ```py - epochs = 10 - history = model.fit(train_ds,validation_data=val_ds,epochs=epochs) - ``` - - ![ ](../data/conceptual/TextClassification-5.png) - -10. See how the model performs. Two values are returned: loss (a number representing our error; lower values are better) and accuracy. - - ```py - loss, accuracy = model.evaluate(test_ds) - - print("Loss: ", loss) - print("Accuracy: ", accuracy) - ``` - - :::{note} - `model.fit()` returns a History object that contains a dictionary with everything that happened during - training. - ::: - - ```py - history_dict = history.history - history_dict.keys() - ``` - -11. Four entries are for each monitored metric during training and validation. Use these to plot the training and validation loss for comparison, as well as the training and validation accuracy: - - ```py - acc = history_dict['binary_accuracy'] - val_acc = history_dict['val_binary_accuracy'] - loss = history_dict['loss'] - val_loss = history_dict['val_loss'] - - epochs = range(1, len(acc) + 1) - - # "bo" is for "blue dot" - plt.plot(epochs, loss, 'bo', label='Training loss') - # b is for "solid blue line" - plt.plot(epochs, val_loss, 'b', label='Validation loss') - plt.title('Training and validation loss') - plt.xlabel('Epochs') - plt.ylabel('Loss') - plt.legend() - - plt.show() - ``` - - The following images illustrate the training and validation loss and the training and validation accuracy. - - ![Training and validation loss](../data/conceptual/TextClassification-6.png "Training and validation loss") - - ![Training and validation accuracy](../data/conceptual/TextClassification-7.png "Training and validation accuracy") - -12. Export the model. - - ```py - export_model = tf.keras.Sequential([ - vectorize_layer, - model, - layers.Activation('sigmoid') - ]) - - export_model.compile( - loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy'] - ) - - # Test it with `raw_test_ds`, which yields raw strings - loss, accuracy = export_model.evaluate(raw_test_ds) - print(accuracy) - ``` - -13. To get predictions for new examples, call model.predict(). - - ```py - examples = [ - "The movie was great!", - "The movie was okay.", - "The movie was terrible..." - ] - - export_model.predict(examples) - ``` diff --git a/docs/conceptual/cmake-packages.rst b/docs/conceptual/cmake-packages.rst deleted file mode 100644 index 4627bacc4..000000000 --- a/docs/conceptual/cmake-packages.rst +++ /dev/null @@ -1,407 +0,0 @@ -.. meta:: - :description: Using CMake - :keywords: CMake, dependencies, HIP, C++, AMD, ROCm - -********************************* -Using CMake -********************************* - -Most components in ROCm support CMake. Projects depending on header-only or -library components typically require CMake 3.5 or higher whereas those wanting -to make use of the CMake HIP language support will require CMake 3.21 or higher. - -Finding dependencies -==================== - -.. note:: - - For a complete - reference on how to deal with dependencies in CMake, refer to the CMake docs - on `find_package - `_ and the - `Using Dependencies Guide - `_ - to get an overview of CMake related facilities. - -In short, CMake supports finding dependencies in two ways: - -* In Module mode, it consults a file ``Find.cmake`` which tries to find the component - in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects - may ship them as well. - -* In Config mode, it locates a file named ``-config.cmake`` or - ``Config.cmake`` which describes the installed component in all regards needed to - consume it. - -ROCm predominantly relies on Config mode, one notable exception being the Module -driving the compilation of HIP programs on NVIDIA runtimes. As such, when -dependencies are not found in standard system locations, one either has to -instruct CMake to search for package config files in additional folders using -the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of file system -paths), or using ``_ROOT`` variable on a project-specific basis. - -There are nearly a dozen ways to set these variables. One may be more convenient -over the other depending on your workflow. Conceptually the simplest is adding -it to your CMake configuration command on the command line via -``-D CMAKE_PREFIX_PATH=....`` . AMD packaged ROCm installs can typically be -added to the config file search paths such as: - -* Windows: ``-D CMAKE_PREFIX_PATH=${env:HIP_PATH}`` - -* Linux: ``-D CMAKE_PREFIX_PATH=/opt/rocm`` - -ROCm provides the respective *config-file* packages, and this enables -``find_package`` to be used directly. ROCm does not require any Find module as -the *config-file* packages are shipped with the upstream projects, such as -rocPRIM and other ROCm libraries. - -For a complete guide on where and how ROCm may be installed on a system, refer -to the installation guides for -`Linux `_ -and -`Windows `_. - -Using HIP in CMake -================== - -ROCm components providing a C/C++ interface support consumption via any -C/C++ toolchain that CMake knows how to drive. ROCm also supports the CMake HIP -language features, allowing users to program using the HIP single-source -programming model. When a program (or translation-unit) uses the HIP API without -compiling any GPU device code, HIP can be treated in CMake as a simple C/C++ -library. - -Using the HIP single-source programming model ---------------------------------------------- - -Source code written in the HIP dialect of C++ typically uses the `.hip` -extension. When the HIP CMake language is enabled, it will automatically -associate such source files with the HIP toolchain being used. - -.. code-block:: cmake - - cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21 - cmake_policy(VERSION 3.21.3...3.27) - project(MyProj LANGUAGES HIP) - add_executable(MyApp Main.hip) - -Should you have existing CUDA code that is from the source compatible subset of -HIP, you can tell CMake that despite their `.cu` extension, they're HIP sources. -Do note that this mostly facilitates compiling kernel code-only source files, -as host-side CUDA API won't compile in this fashion. - -.. code-block:: cmake - - add_library(MyLib MyLib.cu) - set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP) - -CMake itself only hosts part of the HIP language support, such as defining -HIP-specific properties, etc. while the other half ships with the HIP -implementation, such as ROCm. CMake will search for a file -`hip-lang-config.cmake` describing how the the properties defined by CMake -translate to toolchain invocations. If one installs ROCm using non-standard -methods or layouts and CMake can't locate this file or detect parts of the SDK, -there's a catch-all, last resort variable consulted locating this file, -``-D CMAKE_HIP_COMPILER_ROCM_ROOT:PATH=`` which should be set the root of the -ROCm installation. - -.. note:: - Imported targets defined by `hip-lang-config.cmake` are for internal use - only. - -If the user doesn't provide a semi-colon delimited list of device architectures -via ``CMAKE_HIP_ARCHITECTURES``, CMake will select some sensible default. It is -advised though that if a user knows what devices they wish to target, then set -this variable explicitly. - -Consuming ROCm C/C++ libraries ------------------------------- - -Libraries such as rocBLAS, rocFFT, MIOpen, etc. behave as C/C++ libraries. -Illustrated in the example below is a C++ application using MIOpen from CMake. -It calls ``find_package(miopen)``, which provides the ``MIOpen`` imported -target. This can be linked with ``target_link_libraries`` - -.. code-block:: cmake - - cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5 - cmake_policy(VERSION 3.5...3.27) - project(MyProj LANGUAGES CXX) - find_package(miopen) - add_library(MyLib ...) - target_link_libraries(MyLib PUBLIC MIOpen) - -.. note:: - - Most libraries are designed as host-only API, so using a GPU device - compiler is not necessary for downstream projects unless they use GPU device - code. - -Consuming the HIP API in C++ code ---------------------------------- - -Consuming the HIP API without compiling single-source GPU device code can be -done using any C++ compiler. The ``find_package(hip)`` provides the -``hip::host`` imported target to use HIP in this scenario. - -.. code-block:: cmake - - cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5 - cmake_policy(VERSION 3.5...3.27) - project(MyProj LANGUAGES CXX) - find_package(hip REQUIRED) - add_executable(MyApp ...) - target_link_libraries(MyApp PRIVATE hip::host) - -When mixing such ``CXX`` sources with ``HIP`` sources holding device-code, link -only to `hip::host`. If HIP sources don't have `.hip` as their extension, use -`set_source_files_properties(... PROPERTIES LANGUAGE HIP)` on them. -Linking to `hip::host` will set all the necessary flags for the ``CXX`` sources -while ``HIP`` sources inherit all flags from the built-in language support. -Having HIP sources in a target will turn the |LINK_LANG|_ into ``HIP``. - -.. |LINK_LANG| replace:: ``LINKER_LANGUAGE`` -.. _LINK_LANG: https://cmake.org/cmake/help/latest/prop_tgt/LINKER_LANGUAGE.html - -Compiling device code in C++ language mode ------------------------------------------- - -.. attention:: - - The workflow detailed here is considered legacy and is shown for - understanding's sake. It pre-dates the existence of HIP language support in - CMake. If source code has HIP device code in it, it is a HIP source file - and should be compiled as such. Only resort to the method below if your - HIP-enabled CMake code path can't mandate CMake version 3.21. - -If code uses the HIP API and compiles GPU device code, it requires using a -device compiler. The compiler for CMake can be set using either the -``CMAKE_C_COMPILER`` and ``CMAKE_CXX_COMPILER`` variable or using the ``CC`` -and ``CXX`` environment variables. This can be set when configuring CMake or -put into a CMake toolchain file. The device compiler must be set to a -compiler that supports AMD GPU targets, which is usually Clang. - -The ``find_package(hip)`` provides the ``hip::device`` imported target to add -all the flags necessary for device compilation. - -.. code-block:: cmake - - cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8 - cmake_policy(VERSION 3.8...3.27) - project(MyProj LANGUAGES CXX) - find_package(hip REQUIRED) - add_library(MyLib ...) - target_link_libraries(MyLib PRIVATE hip::device) - target_compile_features(MyLib PRIVATE cxx_std_11) - -.. note:: - - Compiling for the GPU device requires at least C++11. - -This project can then be configured with the following CMake commands: - -* Windows: ``cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe`` -* Linux: ``cmake -D CMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++`` - -Which use the device compiler provided from the binary packages of -`ROCm HIP SDK `_ and -`repo.radeon.com `_ respectively. - -When using the ``CXX`` language support to compile HIP device code, selecting the -target GPU architectures is done via setting the ``GPU_TARGETS`` variable. -``CMAKE_HIP_ARCHITECTURES`` only exists when the HIP language is enabled. By -default, this is set to some subset of the currently supported architectures of -AMD ROCm. It can be set to the CMake option ``-D GPU_TARGETS="gfx1032;gfx1035"``. - -ROCm CMake packages -------------------- - -+-----------+----------+--------------------------------------------------------+ -| Component | Package | Targets | -+===========+==========+========================================================+ -| HIP | hip | ``hip::host``, ``hip::device`` | -+-----------+----------+--------------------------------------------------------+ -| rocPRIM | rocprim | ``roc::rocprim`` | -+-----------+----------+--------------------------------------------------------+ -| rocThrust | rocthrust| ``roc::rocthrust`` | -+-----------+----------+--------------------------------------------------------+ -| hipCUB | hipcub | ``hip::hipcub`` | -+-----------+----------+--------------------------------------------------------+ -| rocRAND | rocrand | ``roc::rocrand`` | -+-----------+----------+--------------------------------------------------------+ -| rocBLAS | rocblas | ``roc::rocblas`` | -+-----------+----------+--------------------------------------------------------+ -| rocSOLVER | rocsolver| ``roc::rocsolver`` | -+-----------+----------+--------------------------------------------------------+ -| hipBLAS | hipblas | ``roc::hipblas`` | -+-----------+----------+--------------------------------------------------------+ -| rocFFT | rocfft | ``roc::rocfft`` | -+-----------+----------+--------------------------------------------------------+ -| hipFFT | hipfft | ``hip::hipfft`` | -+-----------+----------+--------------------------------------------------------+ -| rocSPARSE | rocsparse| ``roc::rocsparse`` | -+-----------+----------+--------------------------------------------------------+ -| hipSPARSE | hipsparse| ``roc::hipsparse`` | -+-----------+----------+--------------------------------------------------------+ -| rocALUTION|rocalution| ``roc::rocalution`` | -+-----------+----------+--------------------------------------------------------+ -| RCCL | rccl | ``rccl`` | -+-----------+----------+--------------------------------------------------------+ -| MIOpen | miopen | ``MIOpen`` | -+-----------+----------+--------------------------------------------------------+ -| MIGraphX | migraphx | ``migraphx::migraphx``, ``migraphx::migraphx_c``, | -| | | ``migraphx::migraphx_cpu``, ``migraphx::migraphx_gpu``,| -| | | ``migraphx::migraphx_onnx``, ``migraphx::migraphx_tf`` | -+-----------+----------+--------------------------------------------------------+ - -Using CMake presets -=================== - -CMake command lines depending on how specific users like to be when compiling -code can grow to unwieldy lengths. This is the primary reason why projects tend -to bake script snippets into their build definitions controlling compiler -warning levels, changing CMake defaults (``CMAKE_BUILD_TYPE`` or -``BUILD_SHARED_LIBS`` just to name a few) and all sorts anti-patterns, all in -the name of convenience. - -Load on the command-line interface (CLI) starts immediately by selecting a -toolchain, the set of utilities used to compile programs. To ease some of the -toolchain related pains, CMake does consult the ``CC`` and ``CXX`` environmental -variables when setting a default ``CMAKE_C[XX]_COMPILER`` respectively, but that -is just the tip of the iceberg. There's a fair number of variables related to -just the toolchain itself (typically supplied using -`toolchain files `_ -), and then we still haven't talked about user preference or project-specific -options. - -IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came -up with their own way to register command-line fragments of different purpose in -a setup-and-forget fashion for quick assembly using graphical front-ends. This is -all nice, but configurations aren't portable, nor can they be reused in -Continuous Integration (CI) pipelines. CMake has condensed existing practice -into a portable JSON format that works in all IDEs and can be invoked from any -command line. This is -`CMake Presets `_. - -There are two types of preset files: one supplied by the project, called -``CMakePresets.json`` which is meant to be committed to version control, -typically used to drive CI; and one meant for the user to provide, called -``CMakeUserPresets.json``, typically used to house user preference and adapting -the build to the user's environment. These JSON files are allowed to include -other JSON files and the user presets always implicitly includes the non-user -variant. - -Using HIP with presets ----------------------- - -Following is an example ``CMakeUserPresets.json`` file which actually compiles -the `amd/rocm-examples `_ suite of sample -applications on a typical ROCm installation: - -.. code-block:: json - - { - "version": 3, - "cmakeMinimumRequired": { - "major": 3, - "minor": 21, - "patch": 0 - }, - "configurePresets": [ - { - "name": "layout", - "hidden": true, - "binaryDir": "${sourceDir}/build/${presetName}", - "installDir": "${sourceDir}/install/${presetName}" - }, - { - "name": "generator-ninja-multi-config", - "hidden": true, - "generator": "Ninja Multi-Config" - }, - { - "name": "toolchain-makefiles-c/c++-amdclang", - "hidden": true, - "cacheVariables": { - "CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang", - "CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++", - "CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++" - } - }, - { - "name": "clang-strict-iso-high-warn", - "hidden": true, - "cacheVariables": { - "CMAKE_C_FLAGS": "-Wall -Wextra -pedantic", - "CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic", - "CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic" - } - }, - { - "name": "ninja-mc-rocm", - "displayName": "Ninja Multi-Config ROCm", - "inherits": [ - "layout", - "generator-ninja-multi-config", - "toolchain-makefiles-c/c++-amdclang", - "clang-strict-iso-high-warn" - ] - } - ], - "buildPresets": [ - { - "name": "ninja-mc-rocm-debug", - "displayName": "Debug", - "configuration": "Debug", - "configurePreset": "ninja-mc-rocm" - }, - { - "name": "ninja-mc-rocm-release", - "displayName": "Release", - "configuration": "Release", - "configurePreset": "ninja-mc-rocm" - }, - { - "name": "ninja-mc-rocm-debug-verbose", - "displayName": "Debug (verbose)", - "configuration": "Debug", - "configurePreset": "ninja-mc-rocm", - "verbose": true - }, - { - "name": "ninja-mc-rocm-release-verbose", - "displayName": "Release (verbose)", - "configuration": "Release", - "configurePreset": "ninja-mc-rocm", - "verbose": true - } - ], - "testPresets": [ - { - "name": "ninja-mc-rocm-debug", - "displayName": "Debug", - "configuration": "Debug", - "configurePreset": "ninja-mc-rocm", - "execution": { - "jobs": 0 - } - }, - { - "name": "ninja-mc-rocm-release", - "displayName": "Release", - "configuration": "Release", - "configurePreset": "ninja-mc-rocm", - "execution": { - "jobs": 0 - } - } - ] - } - -.. note:: - - Getting presets to work reliably on Windows requires some CMake improvements - and/or support from compiler vendors. (Refer to - `Add support to the Visual Studio generators `_ - and `Sourcing environment scripts `_ - .) diff --git a/docs/conceptual/compiler-topics.md b/docs/conceptual/compiler-topics.md deleted file mode 100644 index e117e84d6..000000000 --- a/docs/conceptual/compiler-topics.md +++ /dev/null @@ -1,14 +0,0 @@ - - - - - - -# Using compiler features - -The following topics describe using specific features of the compilation tools: - -* [ROCm compiler infrastructure](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html) -* [Using AddressSanitizer](https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/using-gpu-sanitizer.html) -* [OpenMP support](https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/openmp.html) diff --git a/docs/conceptual/file-reorg.md b/docs/conceptual/file-reorg.md deleted file mode 100644 index bc0812011..000000000 --- a/docs/conceptual/file-reorg.md +++ /dev/null @@ -1,172 +0,0 @@ - - - - - - -# ROCm Linux Filesystem Hierarchy Standard reorganization - -## Introduction - -The ROCm Software has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm. - -## Adopting the FHS - -In order to standardize ROCm directory structure and directory content layout ROCm has adopted the [FHS](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html), adhering to open source conventions for Linux-based distribution. FHS ensures internal consistency within the ROCm stack, as well as external consistency with other systems and distributions. The ROCm proposed file structure is outlined below: - -```none -/opt/rocm- - | -- bin - | -- all public binaries - | -- lib - | -- lib.so->lib.so.major->lib.so.major.minor.patch - (public libaries to link with applications) - | -- - | -- architecture dependent libraries and binaries used internally by components - | -- cmake - | -- - | ---config.cmake - | -- libexec - | -- - | -- non ISA/architecture independent executables used internally by components - | -- include - | -- - | -- public header files - | -- share - | -- html - | -- - | -- html documentation - | -- info - | -- - | -- info files - | -- man - | -- - | -- man pages - | -- doc - | -- - | -- license files - | -- - | -- samples - | -- architecture independent misc files -``` - -## Changes from earlier ROCm versions - -The following table provides a brief overview of the new ROCm FHS layout, compared to the layout of earlier ROCm versions. Note that /opt/ is used to denote the default rocm-installation-path and should be replaced in case of a non-standard installation location of the ROCm distribution. - -```none - ______________________________________________________ -| New ROCm Layout | Previous ROCm Layout | -|_____________________________|________________________| -| /opt/rocm- | /opt/rocm- | -| | -- bin | | -- bin | -| | -- lib | | -- lib | -| | -- cmake | | -- include | -| | -- libexec | | -- | -| | -- include | | -- bin | -| | -- | | -- cmake | -| | -- share | | -- doc | -| | -- html | | -- lib | -| | -- info | | -- include | -| | -- man | | -- samples | -| | -- doc | | -- | -| | -- | | -- bin | -| | -- samples | | -- cmake | -| | -- .. | | -- doc | -| | -- | | -- lib | -| | -- samples | | -- include | -| | -- .. | | -- samples | -|______________________________________________________| -``` - -## ROCm FHS reorganization: backward compatibility - -The FHS file organization for ROCm was first introduced in the release of ROCm 5.2 . Backward compatibility was implemented to make sure users could still run their ROCm applications while transitioning to the new FHS. ROCm has moved header files and libraries to their new locations as indicated in the above structure, and included symbolic-links and wrapper header files in their old location for backward compatibility. The following sections detail ROCm backward compatibility implementation for wrapper header files, executable files, library files and CMake config files. - -### Wrapper header files - -Wrapper header files are placed in the old location ( -`/opt/rocm-//include`) with a warning message to include files -from the new location (`/opt/rocm-/include`) as shown in the example below. - -```cpp -#pragma message "This file is deprecated. Use file from include path /opt/rocm-ver/include/ and prefix with hip." -#include -``` - -* Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: `#pragma` message announcing `#warning`. -* Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the `#pragma` message will be announcing `#error`. - -### Executable files - -Executable files are available in the `/opt/rocm-/bin` folder. For backward -compatibility, the old library location (`/opt/rocm-//bin`) has a -soft link to the library at the new location. Soft links will be removed in a -future release, tentatively ROCm v6.0. - -```bash -$ ls -l /opt/rocm/hip/bin/ -lrwxrwxrwx 1 root root 24 Jan 1 23:32 hipcc -> ../../bin/hipcc -``` - -### Library files - -Library files are available in the `/opt/rocm-/lib` folder. For backward -compatibility, the old library location (`/opt/rocm-//lib`) has a -soft link to the library at the new location. Soft links will be removed in a -future release, tentatively ROCm v6.0. - -```shell -$ ls -l /opt/rocm/hip/lib/ -drwxr-xr-x 4 root root 4096 Jan 1 10:45 cmake -lrwxrwxrwx 1 root root 24 Jan 1 23:32 libamdhip64.so -> ../../lib/libamdhip64.so -``` - -### CMake config files - -All CMake configuration files are available in the -`/opt/rocm-/lib/cmake/` folder. For backward compatibility, the -old CMake locations (`/opt/rocm-//lib/cmake`) consist of a soft -link to the new CMake config. Soft links will be removed in a future release, -tentatively ROCm v6.0. - -```shell -$ ls -l /opt/rocm/hip/lib/cmake/hip/ -lrwxrwxrwx 1 root root 42 Jan 1 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake -``` - -## Changes required in applications using ROCm - -Applications using ROCm are advised to use the new file paths. As the old files -will be deprecated in a future release. Applications have to make sure to include -correct header file and use correct search paths. - -1. `#include` needs to be changed to - `#include ` - - For example: `#include ` needs to change - to `#include ` - -2. Any variable in CMake or Makefiles pointing to component folder needs to - changed. - - For example: `VAR1=/opt/rocm/hip` needs to be changed to `VAR1=/opt/rocm` - `VAR2=/opt/rocm/hsa` needs to be changed to `VAR2=/opt/rocm` - -3. Any reference to `/opt/rocm//bin` or `/opt/rocm//lib` - needs to be changed to `/opt/rocm/bin` and `/opt/rocm/lib/`, respectively. - -## Changes in versioning specifications - -In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, ROCm software shall adhere to the following scheme when numbering and incrementing ROCm files versions: - -rocm-\, where \ = \ - -x.y.z denote: MAJOR.MINOR.PATCH - -z: PATCH - increment z when implementing backward compatible bug fixes. - -y: MINOR - increment y when implementing minor changes that add functionality but are still backward compatible. - -x: MAJOR - increment x when implementing major changes that are not backward compatible. diff --git a/docs/conceptual/gpu-arch.md b/docs/conceptual/gpu-arch.md deleted file mode 100644 index f14a39421..000000000 --- a/docs/conceptual/gpu-arch.md +++ /dev/null @@ -1,73 +0,0 @@ - - - - - - -(gpu-arch-documentation)= - -# GPU architecture documentation - -:::::{grid} 1 1 2 2 -:gutter: 1 - -:::{grid-item-card} -**AMD Instinct MI300 series** - -Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3 -architecture. - -* [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md) -* [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) -* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf) -* [MI300 performance counters](./gpu-arch/mi300-mi200-performance-counters.rst) -* [MI350 series performance counters](./gpu-arch/mi350-performance-counters.rst) -::: - -:::{grid-item-card} -**AMD Instinct MI200 series** - -Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2 -architecture. - -* [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md) -* [AMD Instinct MI200/CDNA2 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) -* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna2-white-paper.pdf) -* [Performance counters](./gpu-arch/mi300-mi200-performance-counters.rst) - -::: - -:::{grid-item-card} -**AMD Instinct MI100** - -Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1 -architecture. - -* [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md) -* [AMD Instinct MI100/CDNA1 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) -* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna-white-paper.pdf) - -::: - -:::{grid-item-card} -**RDNA** - -* [AMD RDNA4 ISA](https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf) -* [AMD RDNA3 ISA](https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf) -* [AMD RDNA2 ISA](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) -* [AMD RDNA ISA](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf) - -::: - -:::{grid-item-card} -**Older architectures** - -* [AMD Instinct MI50/Vega 7nm ISA](https://www.amd.com/system/files/TechDocs/vega-7nm-shader-instruction-set-architecture.pdf) -* [AMD Instinct MI25/Vega ISA](https://www.amd.com/system/files/TechDocs/vega-shader-instruction-set-architecture.pdf) -* [AMD GCN3 ISA](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf) -* [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf) - -::: - -::::: diff --git a/docs/conceptual/gpu-arch/mi100.md b/docs/conceptual/gpu-arch/mi100.md deleted file mode 100644 index f98d4c0db..000000000 --- a/docs/conceptual/gpu-arch/mi100.md +++ /dev/null @@ -1,95 +0,0 @@ ---- -myst: - html_meta: - "description lang=en": "Learn about the AMD Instinct MI100 series architecture." - "keywords": "Instinct, MI100, microarchitecture, AMD, ROCm" ---- - -# AMD Instinct™ MI100 microarchitecture - -The following image shows the node-level architecture of a system that -comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators. -The two EPYC processors are connected to each other with the AMD Infinity™ -fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such -that each processor can access the available node memory as a single -shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a -2P, or dual-socket, configuration, three AMD Infinity™ fabric links are -available to connect the processors plus one PCIe Gen 4 x16 link per processor -can attach additional I/O devices such as the host adapters for the network -fabric. - -![Structure of a single GCD in the AMD Instinct MI100 accelerator](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.") - -In a typical node configuration, each processor can host up to four AMD -Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec, -which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive -of four accelerators can participate in a fully connected, coherent AMD -Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD -Infinity fabric links that run at a higher frequency than the inter-processor -links. This inter-GPU link can be established in certified server systems if the -GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity -Fabric™ bridge for the AMD Instinct™ accelerators. - -## Microarchitecture - -The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA -architecture, which targets compute applications such as high-performance -computing (HPC) and AI & machine learning (ML) that run on everything from -individual servers to the world's largest exascale supercomputers. The overall -system architecture is designed for extreme scalability and compute performance. - -![Structure of the AMD Instinct accelerator (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct accelerator (MI100 generation)") - -The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16 -link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host -processor(s). It also shows the three AMD Infinity Fabric ports that provide -high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local -hive. - -On the left and right of the floor plan, the High Bandwidth Memory (HBM) -attaches via the GPU memory controller. The MI100 generation of the AMD -Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total -of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the -attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz. - -The execution units of the GPU are depicted in the above image as Compute -Units (CU). There are a total 120 compute units that are physically organized -into eight Shader Engines (SE) with fifteen compute units per shader engine. -Each compute unit is further sub-divided into four SIMD units that process SIMD -instructions of 16 data elements per instruction. This enables the CU to process -64 data elements (a so-called 'wavefront') at a peak clock frequency of 1.5 GHz. -Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS -(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`). - -![Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture](../../data/conceptual/gpu-arch/image006.png "An MI100 compute unit with detailed SIMD view of the AMD CDNA architecture") - -The preceding image shows the block diagram of a single CU of an AMD Instinct™ -MI100 accelerator and summarizes how instructions flow through the execution -engines. The CU fetches the instructions via a 32KB instruction cache and moves -them forward to execution via a dispatcher. The CU can handle up to ten -wavefronts at a time and feed their instructions into the execution unit. The -execution unit contains 256 vector general-purpose registers (VGPR) and 800 -scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically -allocated to the executing wavefronts. A wavefront can access a maximum of 102 -scalar registers. Excess scalar-register usage will cause register spilling and -thus may affect execution performance. - -A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting -occupancy; that is, the number of concurrently active wavefronts in the CU. For -instance, with 119 VGPRs used, only two wavefronts can be active in the CU at -the same time. With the instruction latency of four cycles per SIMD instruction, -the occupancy should be as high as possible such that the compute unit can -improve execution efficiency by scheduling instructions from multiple -wavefronts. - -:::{table} Peak-performance capabilities of MI100 for different data types. -:name: mi100-perf -| Computation and Data Type | FLOPS/CLOCK/CU | Peak TFLOPS | -| :------------------------ | :------------: | ----------: | -| Vector FP64 | 64 | 11.5 | -| Matrix FP32 | 256 | 46.1 | -| Vector FP32 | 128 | 23.1 | -| Matrix FP16 | 1024 | 184.6 | -| Matrix BF16 | 512 | 92.3 | - -::: diff --git a/docs/conceptual/gpu-arch/mi250.md b/docs/conceptual/gpu-arch/mi250.md deleted file mode 100644 index 47f7f8b1b..000000000 --- a/docs/conceptual/gpu-arch/mi250.md +++ /dev/null @@ -1,134 +0,0 @@ ---- -myst: - html_meta: - "description lang=en": "Learn about the AMD Instinct MI250 series architecture." - "keywords": "Instinct, MI250, microarchitecture, AMD, ROCm" ---- - -# AMD Instinct™ MI250 microarchitecture - -The microarchitecture of the AMD Instinct MI250 accelerators is based on the -AMD CDNA 2 architecture that targets compute applications such as HPC, -artificial intelligence (AI), and machine learning (ML) and that run on -everything from individual servers to the world’s largest exascale -supercomputers. The overall system architecture is designed for extreme -scalability and compute performance. - -The following image shows the components of a single Graphics Compute Die (GCD) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™ -interfaces and their physical links that are used to connect the GPU die to the -other system-level components of the node (see also Section 2.2). Both -interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity -Fabric links of the controller at the bottom can be configured as a PCIe link. -Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec, -which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link ( -two bytes per transaction). Section 2.2 has more details on the number of AMD -Infinity Fabric links and the resulting transfer rates between the system-level -components. - -To the left and the right are memory controllers that attach the High Bandwidth -Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers -a peak memory bandwidth of 1.6 TB/sec per GCD. - -The execution units of the GPU are depicted in the following image as Compute -Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further -subdivided into four SIMD units that process SIMD instructions of 16 data -elements per instruction (for the FP64 data type). This enables the CU to -process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7 -GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 22.6 -TFLOPS for vector instructions. This equates to 45.3 TFLOPS for vector instructions for both GCDs together. The MI250 compute units also provide specialized -execution units (also called matrix cores), which are geared toward executing -matrix operations like matrix-matrix multiplications. For FP64, the peak -performance of these units amounts to 90.5 TFLOPS. - -![Structure of a single GCD in the AMD Instinct MI250 accelerator.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 accelerator.") - -```{list-table} Peak-performance capabilities of the MI250 OAM for different data types. -:header-rows: 1 -:name: mi250-perf-table - -* - - Computation and Data Type - - FLOPS/CLOCK/CU - - Peak TFLOPS -* - - Matrix FP64 - - 256 - - 90.5 -* - - Vector FP64 - - 128 - - 45.3 -* - - Matrix FP32 - - 256 - - 90.5 -* - - Packed FP32 - - 256 - - 90.5 -* - - Vector FP32 - - 128 - - 45.3 -* - - Matrix FP16 - - 1024 - - 362.1 -* - - Matrix BF16 - - 1024 - - 362.1 -* - - Matrix INT8 - - 1024 - - 362.1 -``` - -The above table summarizes the aggregated peak performance of the AMD -Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute -Platform) and its two GCDs for different data types and execution units. The -middle column lists the peak performance (number of data elements processed in a -single instruction) of a single compute unit if a SIMD (or matrix) instruction -is being retired in each clock cycle. The third column lists the theoretical -peak performance of the OAM module. The theoretical aggregated peak memory -bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD). - -![Dual-GCD architecture of the AMD Instinct MI250 accelerators](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 accelerators") - -The following image shows the block diagram of an OAM package that consists -of two GCDs, each of which constitutes one GPU device in the system. The two -GCDs in the package are connected via four AMD Infinity Fabric links running at -a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth -between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of -400 GB/sec for the same. - -## Node-level architecture - -The following image shows the node-level architecture of a system that is -based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host -system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe -x16 link to the host part of the system. Depending on the server platform, the -GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch -. Note that some platforms may offer an x8 interface to the GCDs, which reduces -the available host-to-GPU bandwidth. - -![Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor") - -The preceding image shows the node-level architecture of a system with AMD -EPYC processors in a dual-socket configuration and four AMD Instinct MI250 -accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4 -x16 links (yellow lines). Depending on the system design, a PCIe switch may -exist to make more PCIe lanes available for additional components like network -interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to -the host part of the system or to the PCIe switch. Please note, some platforms -may offer an x8 interface to the GCDs, which will reduce the available -host-to-GPU bandwidth. - -Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows -for direct data exchange between the GPU dies via AMD Infinity Fabric links ( -black, green, and red lines). Each of these 16-wide links connects to one of the -two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a -theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec -bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0 -and 4 connect via two XGMI links, which is indicated by the thicker red line in -the preceding image. diff --git a/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst b/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst deleted file mode 100644 index 8598d2c7c..000000000 --- a/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst +++ /dev/null @@ -1,757 +0,0 @@ -.. meta:: - :description: MI300 and MI200 series performance counters and metrics - :keywords: MI300, MI200, performance counters, command processor counters - -*************************************************************************************************** -MI300 and MI200 series performance counters and metrics -*************************************************************************************************** - -This document lists and describes the hardware performance counters and derived metrics available -for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the -:doc:`ROCprofiler-SDK `. - -MI300 and MI200 series performance counters -=============================================================== - -Series performance counters include the following categories: - -* :ref:`command-processor-counters` -* :ref:`graphics-register-bus-manager-counters` -* :ref:`spi-counters` -* :ref:`compute-unit-counters` -* :ref:`l1i-and-sl1d-cache-counters` -* :ref:`vector-l1-cache-subsystem-counters` -* :ref:`l2-cache-access-counters` - -The following sections provide additional details for each category. - -.. note:: - - Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with - an asterisk (*) require further evaluation. - -.. _command-processor-counters: - -Command processor counters ---------------------------------------------------------------------------------------------------------------- - -Command processor counters are further classified into command processor-fetcher and command -processor-compute. - -Command processor-fetcher counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``CPF_CMP_UTCL1_STALL_ON_TRANSLATION``", "Cycles", "Number of cycles one of the compute unified translation caches (L1) is stalled waiting on translation" - "``CPF_CPF_STAT_BUSY``", "Cycles", "Number of cycles command processor-fetcher is busy" - "``CPF_CPF_STAT_IDLE``", "Cycles", "Number of cycles command processor-fetcher is idle" - "``CPF_CPF_STAT_STALL``", "Cycles", "Number of cycles command processor-fetcher is stalled" - "``CPF_CPF_TCIU_BUSY``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is busy" - "``CPF_CPF_TCIU_IDLE``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is idle" - "``CPF_CPF_TCIU_STALL``", "Cycles", "Number of cycles command processor-fetcher texture cache interface unit interface is stalled waiting on free tags" - -The texture cache interface unit is the interface between the command processor and the memory -system. - -Command processor-compute counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``CPC_ME1_BUSY_FOR_PACKET_DECODE``", "Cycles", "Number of cycles command processor-compute micro engine is busy decoding packets" - "``CPC_UTCL1_STALL_ON_TRANSLATION``", "Cycles", "Number of cycles one of the unified translation caches (L1) is stalled waiting on translation" - "``CPC_CPC_STAT_BUSY``", "Cycles", "Number of cycles command processor-compute is busy" - "``CPC_CPC_STAT_IDLE``", "Cycles", "Number of cycles command processor-compute is idle" - "``CPC_CPC_STAT_STALL``", "Cycles", "Number of cycles command processor-compute is stalled" - "``CPC_CPC_TCIU_BUSY``", "Cycles", "Number of cycles command processor-compute texture cache interface unit interface is busy" - "``CPC_CPC_TCIU_IDLE``", "Cycles", "Number of cycles command processor-compute texture cache interface unit interface is idle" - "``CPC_CPC_UTCL2IU_BUSY``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is busy" - "``CPC_CPC_UTCL2IU_IDLE``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is idle" - "``CPC_CPC_UTCL2IU_STALL``", "Cycles", "Number of cycles command processor-compute unified translation cache (L2) interface is stalled" - "``CPC_ME1_DC0_SPI_BUSY``", "Cycles", "Number of cycles command processor-compute micro engine processor is busy" - -The micro engine runs packet-processing firmware on the command processor-compute counter. - -.. _graphics-register-bus-manager-counters: - -Graphics register bus manager counters ---------------------------------------------------------------------------------------------------------------- - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``GRBM_COUNT``", "Cycles","Number of free-running GPU cycles" - "``GRBM_GUI_ACTIVE``", "Cycles", "Number of GPU active cycles" - "``GRBM_CP_BUSY``", "Cycles", "Number of cycles any of the command processor blocks are busy" - "``GRBM_SPI_BUSY``", "Cycles", "Number of cycles any of the shader processor input is busy in the shader engines" - "``GRBM_TA_BUSY``", "Cycles", "Number of cycles any of the texture addressing unit is busy in the shader engines" - "``GRBM_TC_BUSY``", "Cycles", "Number of cycles any of the texture cache blocks are busy" - "``GRBM_CPC_BUSY``", "Cycles", "Number of cycles the command processor-compute is busy" - "``GRBM_CPF_BUSY``", "Cycles", "Number of cycles the command processor-fetcher is busy" - "``GRBM_UTCL2_BUSY``", "Cycles", "Number of cycles the unified translation cache (Level 2 [L2]) block is busy" - "``GRBM_EA_BUSY``", "Cycles", "Number of cycles the efficiency arbiter block is busy" - -Texture cache blocks include: - -* Texture cache arbiter -* Texture cache per pipe, also known as vector Level 1 (L1) cache -* Texture cache per channel, also known as known as L2 cache -* Texture cache interface - -.. _spi-counters: - -Shader processor input counters ---------------------------------------------------------------------------------------------------------------- - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SPI_CSN_BUSY``", "Cycles", "Number of cycles with outstanding waves" - "``SPI_CSN_WINDOW_VALID``", "Cycles", "Number of cycles enabled by ``perfcounter_start`` event" - "``SPI_CSN_NUM_THREADGROUPS``", "Workgroups", "Number of dispatched workgroups" - "``SPI_CSN_WAVE``", "Wavefronts", "Number of dispatched wavefronts" - "``SPI_RA_REQ_NO_ALLOC``", "Cycles", "Number of arbiter cycles with requests but no allocation" - "``SPI_RA_REQ_NO_ALLOC_CSN``", "Cycles", "Number of arbiter cycles with compute shader (n\ :sup:`th` pipe) requests but no compute shader (n\ :sup:`th` pipe) allocation" - "``SPI_RA_RES_STALL_CSN``", "Cycles", "Number of arbiter stall cycles due to shortage of compute shader (n\ :sup:`th` pipe) pipeline slots" - "``SPI_RA_TMP_STALL_CSN``", "Cycles", "Number of stall cycles due to shortage of temp space" - "``SPI_RA_WAVE_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of single instruction, multiple data (SIMD) per cycle affected by shortage of wave slots for compute shader (n\ :sup:`th` pipe) wave dispatch" - "``SPI_RA_VGPR_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of SIMDs per cycle affected by shortage of vector general-purpose register (VGPR) slots for compute shader (n\ :sup:`th` pipe) wave dispatch" - "``SPI_RA_SGPR_SIMD_FULL_CSN``", "SIMD-cycles", "Accumulated number of SIMDs per cycle affected by shortage of scalar general-purpose register (SGPR) slots for compute shader (n\ :sup:`th` pipe) wave dispatch" - "``SPI_RA_LDS_CU_FULL_CSN``", "CU", "Number of compute units affected by shortage of local data share (LDS) space for compute shader (n\ :sup:`th` pipe) wave dispatch" - "``SPI_RA_BAR_CU_FULL_CSN``", "CU", "Number of compute units with compute shader (n\ :sup:`th` pipe) waves waiting at a BARRIER" - "``SPI_RA_BULKY_CU_FULL_CSN``", "CU", "Number of compute units with compute shader (n\ :sup:`th` pipe) waves waiting for BULKY resource" - "``SPI_RA_TGLIM_CU_FULL_CSN``", "Cycles", "Number of compute shader (n\ :sup:`th` pipe) wave stall cycles due to restriction of ``tg_limit`` for thread group size" - "``SPI_RA_WVLIM_STALL_CSN``", "Cycles", "Number of cycles compute shader (n\ :sup:`th` pipe) is stalled due to ``WAVE_LIMIT``" - "``SPI_VWC_CSC_WR``", "Qcycles", "Number of quad-cycles taken to initialize VGPRs when launching waves" - "``SPI_SWC_CSC_WR``", "Qcycles", "Number of quad-cycles taken to initialize SGPRs when launching waves" - -.. _compute-unit-counters: - -Compute unit counters ---------------------------------------------------------------------------------------------------------------- - -The compute unit counters are further classified into instruction mix, matrix fused multiply-add (FMA) -operation counters, level counters, wavefront counters, wavefront cycle counters, and LDS counters. - -Instruction mix -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_INSTS``", "Instr", "Number of instructions issued" - "``SQ_INSTS_VALU``", "Instr", "Number of vector arithmetic logic unit (VALU) instructions including matrix FMA issued" - "``SQ_INSTS_VALU_ADD_F16``", "Instr", "Number of VALU half-precision floating-point (F16) ``ADD`` or ``SUB`` instructions issued" - "``SQ_INSTS_VALU_MUL_F16``", "Instr", "Number of VALU F16 Multiply instructions issued" - "``SQ_INSTS_VALU_FMA_F16``", "Instr", "Number of VALU F16 FMA or multiply-add instructions issued" - "``SQ_INSTS_VALU_TRANS_F16``", "Instr", "Number of VALU F16 Transcendental instructions issued" - "``SQ_INSTS_VALU_ADD_F32``", "Instr", "Number of VALU full-precision floating-point (F32) ``ADD`` or ``SUB`` instructions issued" - "``SQ_INSTS_VALU_MUL_F32``", "Instr", "Number of VALU F32 Multiply instructions issued" - "``SQ_INSTS_VALU_FMA_F32``", "Instr", "Number of VALU F32 FMAor multiply-add instructions issued" - "``SQ_INSTS_VALU_TRANS_F32``", "Instr", "Number of VALU F32 Transcendental instructions issued" - "``SQ_INSTS_VALU_ADD_F64``", "Instr", "Number of VALU F64 ``ADD`` or ``SUB`` instructions issued" - "``SQ_INSTS_VALU_MUL_F64``", "Instr", "Number of VALU F64 Multiply instructions issued" - "``SQ_INSTS_VALU_FMA_F64``", "Instr", "Number of VALU F64 FMA or multiply-add instructions issued" - "``SQ_INSTS_VALU_TRANS_F64``", "Instr", "Number of VALU F64 Transcendental instructions issued" - "``SQ_INSTS_VALU_INT32``", "Instr", "Number of VALU 32-bit integer instructions (signed or unsigned) issued" - "``SQ_INSTS_VALU_INT64``", "Instr", "Number of VALU 64-bit integer instructions (signed or unsigned) issued" - "``SQ_INSTS_VALU_CVT``", "Instr", "Number of VALU Conversion instructions issued" - "``SQ_INSTS_VALU_MFMA_I8``", "Instr", "Number of 8-bit Integer matrix FMA instructions issued" - "``SQ_INSTS_VALU_MFMA_F16``", "Instr", "Number of F16 matrix FMA instructions issued" - "``SQ_INSTS_VALU_MFMA_F32``", "Instr", "Number of F32 matrix FMA instructions issued" - "``SQ_INSTS_VALU_MFMA_F64``", "Instr", "Number of F64 matrix FMA instructions issued" - "``SQ_INSTS_MFMA``", "Instr", "Number of matrix FMA instructions issued" - "``SQ_INSTS_VMEM_WR``", "Instr", "Number of vector memory write instructions (including flat) issued" - "``SQ_INSTS_VMEM_RD``", "Instr", "Number of vector memory read instructions (including flat) issued" - "``SQ_INSTS_VMEM``", "Instr", "Number of vector memory instructions issued, including both flat and buffer instructions" - "``SQ_INSTS_SALU``", "Instr", "Number of scalar arithmetic logic unit (SALU) instructions issued" - "``SQ_INSTS_SMEM``", "Instr", "Number of scalar memory instructions issued" - "``SQ_INSTS_SMEM_NORM``", "Instr", "Number of scalar memory instructions normalized to match ``smem_level`` issued" - "``SQ_INSTS_FLAT``", "Instr", "Number of flat instructions issued" - "``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled." - "``SQ_INSTS_LDS``", "Instr", "Number of LDS instructions issued **(MI200: includes flat; MI300: does not include flat)**" - "``SQ_INSTS_GDS``", "Instr", "Number of global data share instructions issued" - "``SQ_INSTS_EXP_GDS``", "Instr", "Number of EXP and global data share instructions excluding skipped export instructions issued" - "``SQ_INSTS_BRANCH``", "Instr", "Number of branch instructions issued" - "``SQ_INSTS_SENDMSG``", "Instr", "Number of ``SENDMSG`` instructions including ``s_endpgm`` issued" - "``SQ_INSTS_VSKIPPED``", "Instr", "Number of vector instructions skipped" - -Flat instructions allow read, write, and atomic access to a generic memory address pointer that can -resolve to any of the following physical memories: - -* Global Memory -* Scratch ("private") -* LDS ("shared") -* Invalid - ``MEM_VIOL`` TrapStatus - -Matrix fused multiply-add operation counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_INSTS_VALU_MFMA_MOPS_I8``", "IOP", "Number of 8-bit integer matrix FMA ops in the unit of 512" - "``SQ_INSTS_VALU_MFMA_MOPS_F16``", "FLOP", "Number of F16 floating matrix FMA ops in the unit of 512" - "``SQ_INSTS_VALU_MFMA_MOPS_BF16``", "FLOP", "Number of BF16 floating matrix FMA ops in the unit of 512" - "``SQ_INSTS_VALU_MFMA_MOPS_F32``", "FLOP", "Number of F32 floating matrix FMA ops in the unit of 512" - "``SQ_INSTS_VALU_MFMA_MOPS_F64``", "FLOP", "Number of F64 floating matrix FMA ops in the unit of 512" - -Level counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. note:: - - All level counters must be followed by ``SQ_ACCUM_PREV_HIRES`` counter to measure average latency. - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_ACCUM_PREV``", "Count", "Accumulated counter sample value where accumulation takes place once every four cycles" - "``SQ_ACCUM_PREV_HIRES``", "Count", "Accumulated counter sample value where accumulation takes place once every cycle" - "``SQ_LEVEL_WAVES``", "Waves", "Number of inflight waves" - "``SQ_INST_LEVEL_VMEM``", "Instr", "Number of inflight vector memory (including flat) instructions" - "``SQ_INST_LEVEL_SMEM``", "Instr", "Number of inflight scalar memory instructions" - "``SQ_INST_LEVEL_LDS``", "Instr", "Number of inflight LDS (including flat) instructions" - "``SQ_IFETCH_LEVEL``", "Instr", "Number of inflight instruction fetch requests from the cache" - -Use the following formulae to calculate latencies: - -* Vector memory latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_VMEM`` -* Wave latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_WAVE`` -* LDS latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_LDS`` -* Scalar memory latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_INSTS_SMEM_NORM`` -* Instruction fetch latency = ``SQ_ACCUM_PREV_HIRES`` divided by ``SQ_IFETCH`` - -Wavefront counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_WAVES``", "Waves", "Number of wavefronts dispatched to sequencers, including both new and restored wavefronts" - "``SQ_WAVES_SAVED``", "Waves", "Number of context-saved waves" - "``SQ_WAVES_RESTORED``", "Waves", "Number of context-restored waves sent to sequencers" - "``SQ_WAVES_EQ_64``", "Waves", "Number of wavefronts with exactly 64 active threads sent to sequencers" - "``SQ_WAVES_LT_64``", "Waves", "Number of wavefronts with less than 64 active threads sent to sequencers" - "``SQ_WAVES_LT_48``", "Waves", "Number of wavefronts with less than 48 active threads sent to sequencers" - "``SQ_WAVES_LT_32``", "Waves", "Number of wavefronts with less than 32 active threads sent to sequencers" - "``SQ_WAVES_LT_16``", "Waves", "Number of wavefronts with less than 16 active threads sent to sequencers" - -Wavefront cycle counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_CYCLES``", "Cycles", "Clock cycles" - "``SQ_BUSY_CYCLES``", "Cycles", "Number of cycles while sequencers reports it to be busy" - "``SQ_BUSY_CU_CYCLES``", "Qcycles", "Number of quad-cycles each compute unit is busy" - "``SQ_VALU_MFMA_BUSY_CYCLES``", "Cycles", "Number of cycles the matrix FMA arithmetic logic unit (ALU) is busy" - "``SQ_WAVE_CYCLES``", "Qcycles", "Number of quad-cycles spent by waves in the compute units" - "``SQ_WAIT_ANY``", "Qcycles", "Number of quad-cycles spent waiting for anything" - "``SQ_WAIT_INST_ANY``", "Qcycles", "Number of quad-cycles spent waiting for any instruction to be issued" - "``SQ_ACTIVE_INST_ANY``", "Qcycles", "Number of quad-cycles spent by each wave to work on an instruction" - "``SQ_ACTIVE_INST_VMEM``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a vector memory instruction" - "``SQ_ACTIVE_INST_LDS``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on an LDS instruction" - "``SQ_ACTIVE_INST_VALU``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a VALU instruction" - "``SQ_ACTIVE_INST_SCA``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a SALU or scalar memory instruction" - "``SQ_ACTIVE_INST_EXP_GDS``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on an ``EXPORT`` or ``GDS`` instruction" - "``SQ_ACTIVE_INST_MISC``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a ``BRANCH`` or ``SENDMSG`` instruction" - "``SQ_ACTIVE_INST_FLAT``", "Qcycles", "Number of quad-cycles spent by the sequencer instruction arbiter to work on a flat instruction" - "``SQ_INST_CYCLES_VMEM_WR``", "Qcycles", "Number of quad-cycles spent to send addr and cmd data for vector memory write instructions" - "``SQ_INST_CYCLES_VMEM_RD``", "Qcycles", "Number of quad-cycles spent to send addr and cmd data for vector memory read instructions" - "``SQ_INST_CYCLES_SMEM``", "Qcycles", "Number of quad-cycles spent to execute scalar memory reads" - "``SQ_INST_CYCLES_SALU``", "Qcycles", "Number of quad-cycles spent to execute non-memory read scalar operations" - "``SQ_THREAD_CYCLES_VALU``", "Qcycles", "Number of quad-cycles spent to execute VALU operations on active threads" - "``SQ_WAIT_INST_LDS``", "Qcycles", "Number of quad-cycles spent waiting for LDS instruction to be issued" - -``SQ_THREAD_CYCLES_VALU`` is similar to ``INST_CYCLES_VALU``, but it's multiplied by the number of -active threads. - -LDS counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_LDS_ATOMIC_RETURN``", "Cycles", "Number of atomic return cycles in LDS" - "``SQ_LDS_BANK_CONFLICT``", "Cycles", "Number of cycles LDS is stalled by bank conflicts" - "``SQ_LDS_ADDR_CONFLICT``", "Cycles", "Number of cycles LDS is stalled by address conflicts" - "``SQ_LDS_UNALIGNED_STALL``", "Cycles", "Number of cycles LDS is stalled processing flat unaligned load or store operations" - "``SQ_LDS_MEM_VIOLATIONS``", "Count", "Number of threads that have a memory violation in the LDS" - "``SQ_LDS_IDX_ACTIVE``", "Cycles", "Number of cycles LDS is used for indexed operations" - -Miscellaneous counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQ_IFETCH``", "Count", "Number of instruction fetch requests from L1i, in 32-byte width" - "``SQ_ITEMS``", "Threads", "Number of valid items per wave" - -.. _l1i-and-sl1d-cache-counters: - -L1 instruction cache (L1i) and scalar L1 data cache (L1d) counters ---------------------------------------------------------------------------------------------------------------- - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition" - - "``SQC_ICACHE_REQ``", "Req", "Number of L1 instruction (L1i) cache requests" - "``SQC_ICACHE_HITS``", "Count", "Number of L1i cache hits" - "``SQC_ICACHE_MISSES``", "Count", "Number of non-duplicate L1i cache misses including uncached requests" - "``SQC_ICACHE_MISSES_DUPLICATE``", "Count", "Number of duplicate L1i cache misses whose previous lookup miss on the same cache line is not fulfilled yet" - "``SQC_DCACHE_REQ``", "Req", "Number of scalar L1d requests" - "``SQC_DCACHE_INPUT_VALID_READYB``", "Cycles", "Number of cycles while sequencer input is valid but scalar L1d is not ready" - "``SQC_DCACHE_HITS``", "Count", "Number of scalar L1d hits" - "``SQC_DCACHE_MISSES``", "Count", "Number of non-duplicate scalar L1d misses including uncached requests" - "``SQC_DCACHE_MISSES_DUPLICATE``", "Count", "Number of duplicate scalar L1d misses" - "``SQC_DCACHE_REQ_READ_1``", "Req", "Number of constant cache read requests in a single 32-bit data word" - "``SQC_DCACHE_REQ_READ_2``", "Req", "Number of constant cache read requests in two 32-bit data words" - "``SQC_DCACHE_REQ_READ_4``", "Req", "Number of constant cache read requests in four 32-bit data words" - "``SQC_DCACHE_REQ_READ_8``", "Req", "Number of constant cache read requests in eight 32-bit data words" - "``SQC_DCACHE_REQ_READ_16``", "Req", "Number of constant cache read requests in 16 32-bit data words" - "``SQC_DCACHE_ATOMIC``", "Req", "Number of atomic requests" - "``SQC_TC_REQ``", "Req", "Number of texture cache requests that were issued by instruction and constant caches" - "``SQC_TC_INST_REQ``", "Req", "Number of instruction requests to the L2 cache" - "``SQC_TC_DATA_READ_REQ``", "Req", "Number of data Read requests to the L2 cache" - "``SQC_TC_DATA_WRITE_REQ``", "Req", "Number of data write requests to the L2 cache" - "``SQC_TC_DATA_ATOMIC_REQ``", "Req", "Number of data atomic requests to the L2 cache" - "``SQC_TC_STALL``", "Cycles", "Number of cycles while the valid requests to the L2 cache are stalled" - -.. _vector-l1-cache-subsystem-counters: - -Vector L1 cache subsystem counters ---------------------------------------------------------------------------------------------------------------- - -The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data -unit, vector L1d or texture cache per pipe, and texture cache arbiter counters. - -Texture addressing unit counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TA_TA_BUSY[n]``", "Cycles", "Texture addressing unit busy cycles", "0-15" - "``TA_TOTAL_WAVEFRONTS[n]``", "Instr", "Number of wavefronts processed by texture addressing unit", "0-15" - "``TA_BUFFER_WAVEFRONTS[n]``", "Instr", "Number of buffer wavefronts processed by texture addressing unit", "0-15" - "``TA_BUFFER_READ_WAVEFRONTS[n]``", "Instr", "Number of buffer read wavefronts processed by texture addressing unit", "0-15" - "``TA_BUFFER_WRITE_WAVEFRONTS[n]``", "Instr", "Number of buffer write wavefronts processed by texture addressing unit", "0-15" - "``TA_BUFFER_ATOMIC_WAVEFRONTS[n]``", "Instr", "Number of buffer atomic wavefronts processed by texture addressing unit", "0-15" - "``TA_BUFFER_TOTAL_CYCLES[n]``", "Cycles", "Number of buffer cycles (including read and write) issued to texture cache", "0-15" - "``TA_BUFFER_COALESCED_READ_CYCLES[n]``", "Cycles", "Number of coalesced buffer read cycles issued to texture cache", "0-15" - "``TA_BUFFER_COALESCED_WRITE_CYCLES[n]``", "Cycles", "Number of coalesced buffer write cycles issued to texture cache", "0-15" - "``TA_ADDR_STALLED_BY_TC_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit address path is stalled by texture cache", "0-15" - "``TA_DATA_STALLED_BY_TC_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit data path is stalled by texture cache", "0-15" - "``TA_ADDR_STALLED_BY_TD_CYCLES[n]``", "Cycles", "Number of cycles texture addressing unit address path is stalled by texture data unit", "0-15" - "``TA_FLAT_WAVEFRONTS[n]``", "Instr", "Number of flat opcode wavefronts processed by texture addressing unit", "0-15" - "``TA_FLAT_READ_WAVEFRONTS[n]``", "Instr", "Number of flat opcode read wavefronts processed by texture addressing unit", "0-15" - "``TA_FLAT_WRITE_WAVEFRONTS[n]``", "Instr", "Number of flat opcode write wavefronts processed by texture addressing unit", "0-15" - "``TA_FLAT_ATOMIC_WAVEFRONTS[n]``", "Instr", "Number of flat opcode atomic wavefronts processed by texture addressing unit", "0-15" - -Texture data unit counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TD_TD_BUSY[n]``", "Cycle", "Texture data unit busy cycles while it is processing or waiting for data", "0-15" - "``TD_TC_STALL[n]``", "Cycle", "Number of cycles texture data unit is stalled waiting for texture cache data", "0-15" - "``TD_SPI_STALL[n]``", "Cycle", "Number of cycles texture data unit is stalled by shader processor input", "0-15" - "``TD_LOAD_WAVEFRONT[n]``", "Instr", "Number of wavefront instructions (read, write, atomic)", "0-15" - "``TD_STORE_WAVEFRONT[n]``", "Instr", "Number of write wavefront instructions", "0-15" - "``TD_ATOMIC_WAVEFRONT[n]``", "Instr", "Number of atomic wavefront instructions", "0-15" - "``TD_COALESCABLE_WAVEFRONT[n]``", "Instr", "Number of coalescable wavefronts according to texture addressing unit", "0-15" - -Texture cache per pipe counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TCP_GATE_EN1[n]``", "Cycles", "Number of cycles vector L1d interface clocks are turned on", "0-15" - "``TCP_GATE_EN2[n]``", "Cycles", "Number of cycles vector L1d core clocks are turned on", "0-15" - "``TCP_TD_TCP_STALL_CYCLES[n]``", "Cycles", "Number of cycles texture data unit stalls vector L1d", "0-15" - "``TCP_TCR_TCP_STALL_CYCLES[n]``", "Cycles", "Number of cycles texture cache router stalls vector L1d", "0-15" - "``TCP_READ_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on a read", "0-15" - "``TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on a write", "0-15" - "``TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]``", "Cycles", "Number of cycles tag RAM conflict stalls on an atomic", "0-15" - "``TCP_PENDING_STALL_CYCLES[n]``", "Cycles", "Number of cycles vector L1d is stalled due to data pending from L2 Cache", "0-15" - "``TCP_TCP_TA_DATA_STALL_CYCLES``", "Cycles", "Number of cycles texture cache per pipe stalls texture addressing unit data interface", "NA" - "``TCP_TA_TCP_STATE_READ[n]``", "Req", "Number of state reads", "0-15" - "``TCP_VOLATILE[n]``", "Req", "Number of L1 volatile pixels or buffers from texture addressing unit", "0-15" - "``TCP_TOTAL_ACCESSES[n]``", "Req", "Number of vector L1d accesses. Equals ``TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD``", "0-15" - "``TCP_TOTAL_READ[n]``", "Req", "Number of vector L1d read accesses", "0-15" - "``TCP_TOTAL_WRITE[n]``", "Req", "Number of vector L1d write accesses", "0-15" - "``TCP_TOTAL_ATOMIC_WITH_RET[n]``", "Req", "Number of vector L1d atomic requests with return", "0-15" - "``TCP_TOTAL_ATOMIC_WITHOUT_RET[n]``", "Req", "Number of vector L1d atomic without return", "0-15" - "``TCP_TOTAL_WRITEBACK_INVALIDATES[n]``", "Count", "Total number of vector L1d writebacks and invalidates", "0-15" - "``TCP_UTCL1_REQUEST[n]``", "Req", "Number of address translation requests to unified translation cache (L1)", "0-15" - "``TCP_UTCL1_TRANSLATION_HIT[n]``", "Req", "Number of unified translation cache (L1) translation hits", "0-15" - "``TCP_UTCL1_TRANSLATION_MISS[n]``", "Req", "Number of unified translation cache (L1) translation misses", "0-15" - "``TCP_UTCL1_PERMISSION_MISS[n]``", "Req", "Number of unified translation cache (L1) permission misses", "0-15" - "``TCP_TOTAL_CACHE_ACCESSES[n]``", "Req", "Number of vector L1d cache accesses including hits and misses", "0-15" - "``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15" - "``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15" - "``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15" - "``TCP_TCC_READ_REQ[n]``", "Req", "Number of read requests to L2 cache", "0-15" - "``TCP_TCC_WRITE_REQ[n]``", "Req", "Number of write requests to L2 cache", "0-15" - "``TCP_TCC_ATOMIC_WITH_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache with return", "0-15" - "``TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache without return", "0-15" - "``TCP_TCC_NC_READ_REQ[n]``", "Req", "Number of non-coherently cached read requests to L2 cache", "0-15" - "``TCP_TCC_UC_READ_REQ[n]``", "Req", "Number of uncached read requests to L2 cache", "0-15" - "``TCP_TCC_CC_READ_REQ[n]``", "Req", "Number of coherently cached read requests to L2 cache", "0-15" - "``TCP_TCC_RW_READ_REQ[n]``", "Req", "Number of coherently cached with write read requests to L2 cache", "0-15" - "``TCP_TCC_NC_WRITE_REQ[n]``", "Req", "Number of non-coherently cached write requests to L2 cache", "0-15" - "``TCP_TCC_UC_WRITE_REQ[n]``", "Req", "Number of uncached write requests to L2 cache", "0-15" - "``TCP_TCC_CC_WRITE_REQ[n]``", "Req", "Number of coherently cached write requests to L2 cache", "0-15" - "``TCP_TCC_RW_WRITE_REQ[n]``", "Req", "Number of coherently cached with write write requests to L2 cache", "0-15" - "``TCP_TCC_NC_ATOMIC_REQ[n]``", "Req", "Number of non-coherently cached atomic requests to L2 cache", "0-15" - "``TCP_TCC_UC_ATOMIC_REQ[n]``", "Req", "Number of uncached atomic requests to L2 cache", "0-15" - "``TCP_TCC_CC_ATOMIC_REQ[n]``", "Req", "Number of coherently cached atomic requests to L2 cache", "0-15" - "``TCP_TCC_RW_ATOMIC_REQ[n]``", "Req", "Number of coherently cached with write atomic requests to L2 cache", "0-15" - -Note that: - -* ``TCP_TOTAL_READ[n]`` = ``TCP_PERF_SEL_TOTAL_HIT_LRU_READ`` + ``TCP_PERF_SEL_TOTAL_MISS_LRU_READ`` + ``TCP_PERF_SEL_TOTAL_MISS_EVICT_READ`` -* ``TCP_TOTAL_WRITE[n]`` = ``TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE``+ ``TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE`` -* ``TCP_TOTAL_WRITEBACK_INVALIDATES[n]`` = ``TCP_PERF_SEL_TOTAL_WBINVL1``+ ``TCP_PERF_SEL_TOTAL_WBINVL1_VOL``+ ``TCP_PERF_SEL_CP_TCP_INVALIDATE``+ ``TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL`` - -Texture cache arbiter counters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TCA_CYCLE[n]``", "Cycles", "Number of texture cache arbiter cycles", "0-31" - "``TCA_BUSY[n]``", "Cycles", "Number of cycles texture cache arbiter has a pending request", "0-31" - -.. _l2-cache-access-counters: - -L2 cache access counters ---------------------------------------------------------------------------------------------------------------- - -L2 cache is also known as texture cache per channel. - -.. tab-set:: - - .. tab-item:: MI300 hardware counter - - .. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TCC_CYCLE[n]``", "Cycles", "Number of L2 cache free-running clocks", "0-31" - "``TCC_BUSY[n]``", "Cycles", "Number of L2 cache busy cycles", "0-31" - "``TCC_REQ[n]``", "Req", "Number of L2 cache requests of all types (measured at the tag block)", "0-31" - "``TCC_STREAMING_REQ[n]``", "Req", "Number of L2 cache streaming requests (measured at the tag block)", "0-31" - "``TCC_NC_REQ[n]``", "Req", "Number of non-coherently cached requests (measured at the tag block)", "0-31" - "``TCC_UC_REQ[n]``", "Req", "Number of uncached requests. This is measured at the tag block", "0-31" - "``TCC_CC_REQ[n]``", "Req", "Number of coherently cached requests. This is measured at the tag block", "0-31" - "``TCC_RW_REQ[n]``", "Req", "Number of coherently cached with write requests. This is measured at the tag block", "0-31" - "``TCC_PROBE[n]``", "Req", "Number of probe requests", "0-31" - "``TCC_PROBE_ALL[n]``", "Req", "Number of external probe requests with ``EA_TCC_preq_all == 1``", "0-31" - "``TCC_READ[n]``", "Req", "Number of L2 cache read requests (includes compressed reads but not metadata reads)", "0-31" - "``TCC_WRITE[n]``", "Req", "Number of L2 cache write requests", "0-31" - "``TCC_ATOMIC[n]``", "Req", "Number of L2 cache atomic requests of all types", "0-31" - "``TCC_HIT[n]``", "Req", "Number of L2 cache hits", "0-31" - "``TCC_MISS[n]``", "Req", "Number of L2 cache misses", "0-31" - "``TCC_WRITEBACK[n]``", "Req", "Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests", "0-31" - "``TCC_EA0_WRREQ[n]``", "Req", "Number of 32-byte and 64-byte transactions going over the ``TC_EA_wrreq`` interface (doesn't include probe commands)", "0-31" - "``TCC_EA0_WRREQ_64B[n]``", "Req", "Total number of 64-byte transactions (write or ``CMPSWAP``) going over the ``TC_EA_wrreq`` interface", "0-31" - "``TCC_EA0_WR_UNCACHED_32B[n]``", "Req", "Number of 32 or 64-byte write or atomic going over the ``TC_EA_wrreq`` interface due to uncached traffic", "0-31" - "``TCC_EA0_WRREQ_STALL[n]``", "Cycles", "Number of cycles a write request is stalled", "0-31" - "``TCC_EA0_WRREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits", "0-31" - "``TCC_EA0_WRREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits", "0-31" - "``TCC_EA0_WRREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits", "0-31" - "``TCC_TOO_MANY_EA_WRREQS_STALL[n]``", "Cycles", "Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests", "0-31" - "``TCC_EA0_WRREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter write requests in flight", "0-31" - "``TCC_EA0_ATOMIC[n]``", "Req", "Number of 32-byte or 64-byte atomic requests going over the ``TC_EA_wrreq`` interface", "0-31" - "``TCC_EA0_ATOMIC_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter atomic requests in flight", "0-31" - "``TCC_EA0_RDREQ[n]``", "Req", "Number of 32-byte or 64-byte read requests to efficiency arbiter", "0-31" - "``TCC_EA0_RDREQ_32B[n]``", "Req", "Number of 32-byte read requests to efficiency arbiter", "0-31" - "``TCC_EA0_RD_UNCACHED_32B[n]``", "Req", "Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2", "0-31" - "``TCC_EA0_RDREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of IO credits", "0-31" - "``TCC_EA0_RDREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of GMI credits", "0-31" - "``TCC_EA0_RDREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of DRAM credits", "0-31" - "``TCC_EA0_RDREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter read requests in flight", "0-31" - "``TCC_EA0_RDREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)", "0-31" - "``TCC_EA0_WRREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter write requests to HBM", "0-31" - "``TCC_TAG_STALL[n]``", "Cycles", "Number of cycles the normal request pipeline in the tag is stalled for any reason", "0-31" - "``TCC_NORMAL_WRITEBACK[n]``", "Req", "Number of writebacks due to requests that are not writeback requests", "0-31" - "``TCC_ALL_TC_OP_WB_WRITEBACK[n]``", "Req", "Number of writebacks due to all ``TC_OP`` writeback requests", "0-31" - "``TCC_NORMAL_EVICT[n]``", "Req", "Number of evictions due to requests that are not invalidate or probe requests", "0-31" - "``TCC_ALL_TC_OP_INV_EVICT[n]``", "Req", "Number of evictions due to all ``TC_OP`` invalidate requests", "0-31" - - .. tab-item:: MI200 hardware counter - - .. csv-table:: - :header: "Hardware counter", "Unit", "Definition", "Value range for ``n``" - - "``TCC_CYCLE[n]``", "Cycles", "Number of L2 cache free-running clocks", "0-31" - "``TCC_BUSY[n]``", "Cycles", "Number of L2 cache busy cycles", "0-31" - "``TCC_REQ[n]``", "Req", "Number of L2 cache requests of all types (measured at the tag block)", "0-31" - "``TCC_STREAMING_REQ[n]``", "Req", "Number of L2 cache streaming requests (measured at the tag block)", "0-31" - "``TCC_NC_REQ[n]``", "Req", "Number of non-coherently cached requests (measured at the tag block)", "0-31" - "``TCC_UC_REQ[n]``", "Req", "Number of uncached requests. This is measured at the tag block", "0-31" - "``TCC_CC_REQ[n]``", "Req", "Number of coherently cached requests. This is measured at the tag block", "0-31" - "``TCC_RW_REQ[n]``", "Req", "Number of coherently cached with write requests. This is measured at the tag block", "0-31" - "``TCC_PROBE[n]``", "Req", "Number of probe requests", "0-31" - "``TCC_PROBE_ALL[n]``", "Req", "Number of external probe requests with ``EA_TCC_preq_all == 1``", "0-31" - "``TCC_READ[n]``", "Req", "Number of L2 cache read requests (includes compressed reads but not metadata reads)", "0-31" - "``TCC_WRITE[n]``", "Req", "Number of L2 cache write requests", "0-31" - "``TCC_ATOMIC[n]``", "Req", "Number of L2 cache atomic requests of all types", "0-31" - "``TCC_HIT[n]``", "Req", "Number of L2 cache hits", "0-31" - "``TCC_MISS[n]``", "Req", "Number of L2 cache misses", "0-31" - "``TCC_WRITEBACK[n]``", "Req", "Number of lines written back to the main memory, including writebacks of dirty lines and uncached write or atomic requests", "0-31" - "``TCC_EA_WRREQ[n]``", "Req", "Number of 32-byte and 64-byte transactions going over the ``TC_EA_wrreq`` interface (doesn't include probe commands)", "0-31" - "``TCC_EA_WRREQ_64B[n]``", "Req", "Total number of 64-byte transactions (write or ``CMPSWAP``) going over the ``TC_EA_wrreq`` interface", "0-31" - "``TCC_EA_WR_UNCACHED_32B[n]``", "Req", "Number of 32 write or atomic going over the ``TC_EA_wrreq`` interface due to uncached traffic. A 64-byte request will be counted as 2", "0-31" - "``TCC_EA_WRREQ_STALL[n]``", "Cycles", "Number of cycles a write request is stalled", "0-31" - "``TCC_EA_WRREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of input-output (IO) credits", "0-31" - "``TCC_EA_WRREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits", "0-31" - "``TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits", "0-31" - "``TCC_TOO_MANY_EA_WRREQS_STALL[n]``", "Cycles", "Number of cycles the L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests", "0-31" - "``TCC_EA_WRREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter write requests in flight", "0-31" - "``TCC_EA_ATOMIC[n]``", "Req", "Number of 32-byte or 64-byte atomic requests going over the ``TC_EA_wrreq`` interface", "0-31" - "``TCC_EA_ATOMIC_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter atomic requests in flight", "0-31" - "``TCC_EA_RDREQ[n]``", "Req", "Number of 32-byte or 64-byte read requests to efficiency arbiter", "0-31" - "``TCC_EA_RDREQ_32B[n]``", "Req", "Number of 32-byte read requests to efficiency arbiter", "0-31" - "``TCC_EA_RD_UNCACHED_32B[n]``", "Req", "Number of 32-byte efficiency arbiter reads due to uncached traffic. A 64-byte request is counted as 2", "0-31" - "``TCC_EA_RDREQ_IO_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of IO credits", "0-31" - "``TCC_EA_RDREQ_GMI_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of GMI credits", "0-31" - "``TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]``", "Cycles", "Number of cycles there is a stall due to the read request interface running out of DRAM credits", "0-31" - "``TCC_EA_RDREQ_LEVEL[n]``", "Req", "The accumulated number of efficiency arbiter read requests in flight", "0-31" - "``TCC_EA_RDREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter read requests to High Bandwidth Memory (HBM)", "0-31" - "``TCC_EA_WRREQ_DRAM[n]``", "Req", "Number of 32-byte or 64-byte efficiency arbiter write requests to HBM", "0-31" - "``TCC_TAG_STALL[n]``", "Cycles", "Number of cycles the normal request pipeline in the tag is stalled for any reason", "0-31" - "``TCC_NORMAL_WRITEBACK[n]``", "Req", "Number of writebacks due to requests that are not writeback requests", "0-31" - "``TCC_ALL_TC_OP_WB_WRITEBACK[n]``", "Req", "Number of writebacks due to all ``TC_OP`` writeback requests", "0-31" - "``TCC_NORMAL_EVICT[n]``", "Req", "Number of evictions due to requests that are not invalidate or probe requests", "0-31" - "``TCC_ALL_TC_OP_INV_EVICT[n]``", "Req", "Number of evictions due to all ``TC_OP`` invalidate requests", "0-31" - -Note the following: - -* ``TCC_REQ[n]`` may be more than the number of requests arriving at the texture cache per channel, - but it's a good indication of the total amount of work that needs to be performed. -* For ``TCC_EA0_WRREQ[n]``, atomics may travel over the same interface and are generally classified as - write requests. -* CC mtypes can produce uncached requests, and those are included in - ``TCC_EA0_WR_UNCACHED_32B[n]`` -* ``TCC_EA0_WRREQ_LEVEL[n]`` is primarily intended to measure average efficiency arbiter write latency. - - * Average write latency = ``TCC_PERF_SEL_EA0_WRREQ_LEVEL`` divided by ``TCC_PERF_SEL_EA0_WRREQ`` - -* ``TCC_EA0_ATOMIC_LEVEL[n]`` is primarily intended to measure average efficiency arbiter atomic - latency - - * Average atomic latency = ``TCC_PERF_SEL_EA0_WRREQ_ATOMIC_LEVEL`` divided by ``TCC_PERF_SEL_EA0_WRREQ_ATOMIC`` - -* ``TCC_EA0_RDREQ_LEVEL[n]`` is primarily intended to measure average efficiency arbiter read latency. - - * Average read latency = ``TCC_PERF_SEL_EA0_RDREQ_LEVEL`` divided by ``TCC_PERF_SEL_EA0_RDREQ`` - -* Stalls can occur regardless of the need for a read to be performed -* Normally, stalls are measured exactly at one point in the pipeline however in the case of - ``TCC_TAG_STALL[n]``, probes can stall the pipeline at a variety of places. There is no single point that - can accurately measure the total stalls - -MI300 and MI200 series derived metrics list -============================================================== - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``ALUStalledByLDS``", "Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready (value range: 0% (optimal) to 100%)" - "``FetchSize``", "Total kilobytes fetched from the video memory; measured with all extra fetches and any cache or memory effects taken into account" - "``FlatLDSInsts``", "Average number of flat instructions that read from or write to LDS, run per work item (affected by flow control)" - "``FlatVMemInsts``", "Average number of flat instructions that read from or write to the video memory, run per work item (affected by flow control). Includes flat instructions that read from or write to scratch" - "``GDSInsts``", "Average number of global data share read or write instructions run per work item (affected by flow control)" - "``GPUBusy``", "Percentage of time GPU is busy" - "``L2CacheHit``", "Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache (value range: 0% (no hit) to 100% (optimal))" - "``LDSBankConflict``", "Percentage of GPU time LDS is stalled by bank conflicts (value range: 0% (optimal) to 100%)" - "``LDSInsts``", "Average number of LDS read or write instructions run per work item (affected by flow control). Excludes flat instructions that read from or write to LDS." - "``MemUnitBusy``", "Percentage of GPU time the memory unit is active, which is measured with all extra fetches and writes and any cache or memory effects taken into account (value range: 0% to 100% (fetch-bound))" - "``MemUnitStalled``", "Percentage of GPU time the memory unit is stalled (value range: 0% (optimal) to 100%)" - "``MemWrites32B``", "Total number of effective 32B write transactions to the memory" - "``TCA_BUSY_sum``", "Total number of cycles texture cache arbiter has a pending request, over all texture cache arbiter instances" - "``TCA_CYCLE_sum``", "Total number of cycles over all texture cache arbiter instances" - "``SALUBusy``", "Percentage of GPU time scalar ALU instructions are processed (value range: 0% to 100% (optimal))" - "``SALUInsts``", "Average number of scalar ALU instructions run per work item (affected by flow control)" - "``SFetchInsts``", "Average number of scalar fetch instructions from the video memory run per work item (affected by flow control)" - "``VALUBusy``", "Percentage of GPU time vector ALU instructions are processed (value range: 0% to 100% (optimal))" - "``VALUInsts``", "Average number of vector ALU instructions run per work item (affected by flow control)" - "``VALUUtilization``", "Percentage of active vector ALU threads in a wave, where a lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64 (value range: 0%, 100% (optimal - no thread divergence))" - "``VFetchInsts``", "Average number of vector fetch instructions from the video memory run per work-item (affected by flow control); excludes flat instructions that fetch from video memory" - "``VWriteInsts``", "Average number of vector write instructions to the video memory run per work-item (affected by flow control); excludes flat instructions that write to video memory" - "``Wavefronts``", "Total wavefronts" - "``WRITE_REQ_32B``", "Total number of 32-byte effective memory writes" - "``WriteSize``", "Total kilobytes written to the video memory; measured with all extra fetches and any cache or memory effects taken into account" - "``WriteUnitStalled``", "Percentage of GPU time the write unit is stalled (value range: 0% (optimal) to 100%)" - -You can lower ``ALUStalledByLDS`` by reducing LDS bank conflicts or number of LDS accesses. -You can lower ``MemUnitStalled`` by reducing the number or size of fetches and writes. -``MemUnitBusy`` includes the stall time (``MemUnitStalled``). - -Hardware counters by and over all texture addressing unit instances ---------------------------------------------------------------------------------------------------------------- - -The following table shows the hardware counters *by* all texture addressing unit instances. - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TA_BUFFER_WAVEFRONTS_sum``", "Total number of buffer wavefronts processed" - "``TA_BUFFER_READ_WAVEFRONTS_sum``", "Total number of buffer read wavefronts processed" - "``TA_BUFFER_WRITE_WAVEFRONTS_sum``", "Total number of buffer write wavefronts processed" - "``TA_BUFFER_ATOMIC_WAVEFRONTS_sum``", "Total number of buffer atomic wavefronts processed" - "``TA_BUFFER_TOTAL_CYCLES_sum``", "Total number of buffer cycles (including read and write) issued to texture cache" - "``TA_BUFFER_COALESCED_READ_CYCLES_sum``", "Total number of coalesced buffer read cycles issued to texture cache" - "``TA_BUFFER_COALESCED_WRITE_CYCLES_sum``", "Total number of coalesced buffer write cycles issued to texture cache" - "``TA_FLAT_READ_WAVEFRONTS_sum``", "Sum of flat opcode reads processed" - "``TA_FLAT_WRITE_WAVEFRONTS_sum``", "Sum of flat opcode writes processed" - "``TA_FLAT_WAVEFRONTS_sum``", "Total number of flat opcode wavefronts processed" - "``TA_FLAT_ATOMIC_WAVEFRONTS_sum``", "Total number of flat opcode atomic wavefronts processed" - "``TA_TOTAL_WAVEFRONTS_sum``", "Total number of wavefronts processed" - -The following table shows the hardware counters *over* all texture addressing unit instances. - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TA_ADDR_STALLED_BY_TC_CYCLES_sum``", "Total number of cycles texture addressing unit address path is stalled by texture cache" - "``TA_ADDR_STALLED_BY_TD_CYCLES_sum``", "Total number of cycles texture addressing unit address path is stalled by texture data unit" - "``TA_BUSY_avr``", "Average number of busy cycles" - "``TA_BUSY_max``", "Maximum number of texture addressing unit busy cycles" - "``TA_BUSY_min``", "Minimum number of texture addressing unit busy cycles" - "``TA_DATA_STALLED_BY_TC_CYCLES_sum``", "Total number of cycles texture addressing unit data path is stalled by texture cache" - "``TA_TA_BUSY_sum``", "Total number of texture addressing unit busy cycles" - -Hardware counters over all texture cache per channel instances ---------------------------------------------------------------------------------------------------------------- - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TCC_ALL_TC_OP_WB_WRITEBACK_sum``", "Total number of writebacks due to all ``TC_OP`` writeback requests." - "``TCC_ALL_TC_OP_INV_EVICT_sum``", "Total number of evictions due to all ``TC_OP`` invalidate requests." - "``TCC_ATOMIC_sum``", "Total number of L2 cache atomic requests of all types." - "``TCC_BUSY_avr``", "Average number of L2 cache busy cycles." - "``TCC_BUSY_sum``", "Total number of L2 cache busy cycles." - "``TCC_CC_REQ_sum``", "Total number of coherently cached requests." - "``TCC_CYCLE_sum``", "Total number of L2 cache free running clocks." - "``TCC_EA0_WRREQ_sum``", "Total number of 32-byte and 64-byte transactions going over the ``TC_EA0_wrreq`` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands." - "``TCC_EA0_WRREQ_64B_sum``", "Total number of 64-byte transactions (write or `CMPSWAP`) going over the ``TC_EA0_wrreq`` interface." - "``TCC_EA0_WR_UNCACHED_32B_sum``", "Total Number of 32-byte write or atomic going over the ``TC_EA0_wrreq`` interface due to uncached traffic. Note that coherently cached mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2." - "``TCC_EA0_WRREQ_STALL_sum``", "Total Number of cycles a write request is stalled, over all instances." - "``TCC_EA0_WRREQ_IO_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of IO credits, over all instances." - "``TCC_EA0_WRREQ_GMI_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of GMI credits, over all instances." - "``TCC_EA0_WRREQ_DRAM_CREDIT_STALL_sum``", "Total number of cycles an efficiency arbiter write request is stalled due to the interface running out of DRAM credits, over all instances." - "``TCC_EA0_WRREQ_LEVEL_sum``", "Total number of efficiency arbiter write requests in flight." - "``TCC_EA0_RDREQ_LEVEL_sum``", "Total number of efficiency arbiter read requests in flight." - "``TCC_EA0_ATOMIC_sum``", "Total Number of 32-byte or 64-byte atomic requests going over the ``TC_EA0_wrreq`` interface." - "``TCC_EA0_ATOMIC_LEVEL_sum``", "Total number of efficiency arbiter atomic requests in flight." - "``TCC_EA0_RDREQ_sum``", "Total number of 32-byte or 64-byte read requests to efficiency arbiter." - "``TCC_EA0_RDREQ_32B_sum``", "Total number of 32-byte read requests to efficiency arbiter." - "``TCC_EA0_RD_UNCACHED_32B_sum``", "Total number of 32-byte efficiency arbiter reads due to uncached traffic." - "``TCC_EA0_RDREQ_IO_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of IO credits." - "``TCC_EA0_RDREQ_GMI_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of GMI credits." - "``TCC_EA0_RDREQ_DRAM_CREDIT_STALL_sum``", "Total number of cycles there is a stall due to the read request interface running out of DRAM credits." - "``TCC_EA0_RDREQ_DRAM_sum``", "Total number of 32-byte or 64-byte efficiency arbiter read requests to HBM." - "``TCC_EA0_WRREQ_DRAM_sum``", "Total number of 32-byte or 64-byte efficiency arbiter write requests to HBM." - "``TCC_HIT_sum``", "Total number of L2 cache hits." - "``TCC_MISS_sum``", "Total number of L2 cache misses." - "``TCC_NC_REQ_sum``", "Total number of non-coherently cached requests." - "``TCC_NORMAL_WRITEBACK_sum``", "Total number of writebacks due to requests that are not writeback requests." - "``TCC_NORMAL_EVICT_sum``", "Total number of evictions due to requests that are not invalidate or probe requests." - "``TCC_PROBE_sum``", "Total number of probe requests." - "``TCC_PROBE_ALL_sum``", "Total number of external probe requests with ``EA0_TCC_preq_all == 1``." - "``TCC_READ_sum``", "Total number of L2 cache read requests (including compressed reads but not metadata reads)." - "``TCC_REQ_sum``", "Total number of all types of L2 cache requests." - "``TCC_RW_REQ_sum``", "Total number of coherently cached with write requests." - "``TCC_STREAMING_REQ_sum``", "Total number of L2 cache streaming requests." - "``TCC_TAG_STALL_sum``", "Total number of cycles the normal request pipeline in the tag is stalled for any reason." - "``TCC_TOO_MANY_EA0_WRREQS_STALL_sum``", "Total number of cycles L2 cache is unable to send an efficiency arbiter write request due to it reaching its maximum capacity of pending efficiency arbiter write requests." - "``TCC_UC_REQ_sum``", "Total number of uncached requests." - "``TCC_WRITE_sum``", "Total number of L2 cache write requests." - "``TCC_WRITEBACK_sum``", "Total number of lines written back to the main memory including writebacks of dirty lines and uncached write or atomic requests." - "``TCC_WRREQ_STALL_max``", "Maximum number of cycles a write request is stalled." - -Hardware counters by, for, or over all texture cache per pipe instances ----------------------------------------------------------------------------------------------------------------- - -The following table shows the hardware counters *by* all texture cache per pipe instances. - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TCP_TA_TCP_STATE_READ_sum``", "Total number of state reads by ATCPPI" - "``TCP_TOTAL_CACHE_ACCESSES_sum``", "Total number of vector L1d accesses (including hits and misses)" - "``TCP_UTCL1_PERMISSION_MISS_sum``", "Total number of unified translation cache (L1) permission misses" - "``TCP_UTCL1_REQUEST_sum``", "Total number of address translation requests to unified translation cache (L1)" - "``TCP_UTCL1_TRANSLATION_MISS_sum``", "Total number of unified translation cache (L1) translation misses" - "``TCP_UTCL1_TRANSLATION_HIT_sum``", "Total number of unified translation cache (L1) translation hits" - -The following table shows the hardware counters *for* all texture cache per pipe instances. - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TCP_TCC_READ_REQ_LATENCY_sum``", "Total vector L1d to L2 request latency over all wavefronts for reads and atomics with return" - "``TCP_TCC_WRITE_REQ_LATENCY_sum``", "Total vector L1d to L2 request latency over all wavefronts for writes and atomics without return" - "``TCP_TCP_LATENCY_sum``", "Total wave access latency to vector L1d over all wavefronts" - -The following table shows the hardware counters *over* all texture cache per pipe instances. - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on an atomic" - "``TCP_GATE_EN1_sum``", "Total number of cycles vector L1d interface clocks are turned on" - "``TCP_GATE_EN2_sum``", "Total number of cycles vector L1d core clocks are turned on" - "``TCP_PENDING_STALL_CYCLES_sum``", "Total number of cycles vector L1d cache is stalled due to data pending from L2 Cache" - "``TCP_READ_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on a read" - "``TCP_TCC_ATOMIC_WITH_RET_REQ_sum``", "Total number of atomic requests to L2 cache with return" - "``TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum``", "Total number of atomic requests to L2 cache without return" - "``TCP_TCC_CC_READ_REQ_sum``", "Total number of coherently cached read requests to L2 cache" - "``TCP_TCC_CC_WRITE_REQ_sum``", "Total number of coherently cached write requests to L2 cache" - "``TCP_TCC_CC_ATOMIC_REQ_sum``", "Total number of coherently cached atomic requests to L2 cache" - "``TCP_TCC_NC_READ_REQ_sum``", "Total number of non-coherently cached read requests to L2 cache" - "``TCP_TCC_NC_WRITE_REQ_sum``", "Total number of non-coherently cached write requests to L2 cache" - "``TCP_TCC_NC_ATOMIC_REQ_sum``", "Total number of non-coherently cached atomic requests to L2 cache" - "``TCP_TCC_READ_REQ_sum``", "Total number of read requests to L2 cache" - "``TCP_TCC_RW_READ_REQ_sum``", "Total number of coherently cached with write read requests to L2 cache" - "``TCP_TCC_RW_WRITE_REQ_sum``", "Total number of coherently cached with write write requests to L2 cache" - "``TCP_TCC_RW_ATOMIC_REQ_sum``", "Total number of coherently cached with write atomic requests to L2 cache" - "``TCP_TCC_UC_READ_REQ_sum``", "Total number of uncached read requests to L2 cache" - "``TCP_TCC_UC_WRITE_REQ_sum``", "Total number of uncached write requests to L2 cache" - "``TCP_TCC_UC_ATOMIC_REQ_sum``", "Total number of uncached atomic requests to L2 cache" - "``TCP_TCC_WRITE_REQ_sum``", "Total number of write requests to L2 cache" - "``TCP_TCR_TCP_STALL_CYCLES_sum``", "Total number of cycles texture cache router stalls vector L1d" - "``TCP_TD_TCP_STALL_CYCLES_sum``", "Total number of cycles texture data unit stalls vector L1d" - "``TCP_TOTAL_ACCESSES_sum``", "Total number of vector L1d accesses" - "``TCP_TOTAL_READ_sum``", "Total number of vector L1d read accesses" - "``TCP_TOTAL_WRITE_sum``", "Total number of vector L1d write accesses" - "``TCP_TOTAL_ATOMIC_WITH_RET_sum``", "Total number of vector L1d atomic requests with return" - "``TCP_TOTAL_ATOMIC_WITHOUT_RET_sum``", "Total number of vector L1d atomic requests without return" - "``TCP_TOTAL_WRITEBACK_INVALIDATES_sum``", "Total number of vector L1d writebacks and invalidates" - "``TCP_VOLATILE_sum``", "Total number of L1 volatile pixels or buffers from texture addressing unit" - "``TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum``", "Total number of cycles tag RAM conflict stalls on a write" - -Hardware counter over all texture data unit instances --------------------------------------------------------- - -.. csv-table:: - :header: "Hardware counter", "Definition" - - "``TD_ATOMIC_WAVEFRONT_sum``", "Total number of atomic wavefront instructions" - "``TD_COALESCABLE_WAVEFRONT_sum``", "Total number of coalescable wavefronts according to texture addressing unit" - "``TD_LOAD_WAVEFRONT_sum``", "Total number of wavefront instructions (read, write, atomic)" - "``TD_SPI_STALL_sum``", "Total number of cycles texture data unit is stalled by shader processor input" - "``TD_STORE_WAVEFRONT_sum``", "Total number of write wavefront instructions" - "``TD_TC_STALL_sum``", "Total number of cycles texture data unit is stalled waiting for texture cache data" - "``TD_TD_BUSY_sum``", "Total number of texture data unit busy cycles while it is processing or waiting for data" diff --git a/docs/conceptual/gpu-arch/mi300.md b/docs/conceptual/gpu-arch/mi300.md deleted file mode 100644 index f5b66ceae..000000000 --- a/docs/conceptual/gpu-arch/mi300.md +++ /dev/null @@ -1,129 +0,0 @@ ---- -myst: - html_meta: - "description lang=en": "Learn about the AMD Instinct MI300 series architecture." - "keywords": "Instinct, MI300X, MI300A, microarchitecture, AMD, ROCm" ---- - -# AMD Instinct™ MI300 series microarchitecture - -The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3 -architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine -learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running -on everything from individual servers to the world’s largest exascale supercomputers. - -With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the -GPU computational elements of the processor along with the lower levels of the cache hierarchy. - -The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series. - -```{figure} ../../data/shared/xcd-sys-arch.png ---- -name: mi300-xcd -align: center ---- -XCD-level system architecture showing 40 Compute Units, each with 32 KB L1 cache, a Unified Compute System with 4 ACE Compute Accelerators, shared 4MB of L2 cache and an HWS Hardware Scheduler. -``` - -On the XCD, four Asynchronous Compute Engines (ACEs) send compute shader workgroups to the -Compute Units (CUs). The XCD has 40 CUs: 38 active CUs at the aggregate level and 2 disabled CUs for -yield management. The CUs all share a 4 MB L2 cache that serves to coalesce all memory traffic for the -die. With less than half of the CUs of the AMD Instinct MI200 Series compute die, the AMD CDNA™ 3 -XCD die is a smaller building block. However, it uses more advanced packaging and the processor -can include 6 or 8 XCDs for up to 304 CUs, roughly 40% more than MI250X. - -The MI300 Series integrate up to 8 vertically stacked XCDs, 8 stacks of -High-Bandwidth Memory 3 (HBM3) and 4 I/O dies (containing system -infrastructure) using the AMD Infinity Fabric™ technology as interconnect. - -The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine -learning, enhancing throughput of existing data types while adding support for new data types. -CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X -accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a -performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32, -while TF32 has a gain of 4 times compared to FP32. - -```{list-table} Peak-performance capabilities of the MI300X for different data types. -:header-rows: 1 -:name: mi300x-perf-table - -* - - Computation and Data Type - - FLOPS/CLOCK/CU - - Peak TFLOPS -* - - Matrix FP64 - - 256 - - 163.4 -* - - Vector FP64 - - 128 - - 81.7 -* - - Matrix FP32 - - 256 - - 163.4 -* - - Vector FP32 - - 256 - - 163.4 -* - - Vector TF32 - - 1024 - - 653.7 -* - - Matrix FP16 - - 2048 - - 1307.4 -* - - Matrix BF16 - - 2048 - - 1307.4 -* - - Matrix FP8 - - 4096 - - 2614.9 -* - - Matrix INT8 - - 4096 - - 2614.9 -``` - -The above table summarizes the aggregated peak performance of the AMD Instinct MI300X Open -Compute Platform (OCP) Open Accelerator Modules (OAMs) for different data types and command -processors. The middle column lists the peak performance (number of data elements processed in a -single instruction) of a single compute unit if a SIMD (or matrix) instruction is submitted in each clock -cycle. The third column lists the theoretical peak performance of the OAM. The theoretical aggregated -peak memory bandwidth of the GPU is 5.3 TB per second. - -The following image shows the block diagram of the APU (left) and the OAM package (right) both -connected via AMD Infinity Fabric™ network on-chip. - -```{figure} ../../data/conceptual/gpu-arch/image008.png ---- -name: mi300-arch -alt: -align: center ---- -MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs. -``` - -## Node-level architecture - -```{figure} ../../data/shared/mi300-node-level-arch.png ---- -name: mi300-node - -align: center ---- -MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors. -``` - -The image above shows the node-level architecture of a system with AMD EPYC processors in a -dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the -host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth, -low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system. - - diff --git a/docs/conceptual/gpu-arch/mi350-performance-counters.rst b/docs/conceptual/gpu-arch/mi350-performance-counters.rst deleted file mode 100644 index fe103c17c..000000000 --- a/docs/conceptual/gpu-arch/mi350-performance-counters.rst +++ /dev/null @@ -1,530 +0,0 @@ -.. meta:: - :description: MI355 series performance counters and metrics - :keywords: MI355, MI355X, MI3XX - -*********************************** -MI350 series performance counters -*********************************** - -This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using `ROCprofiler-SDK `_ and `ROCm Compute Profiler `_. - -The following sections list the performance counters based on the IP blocks. - -Command processor packet processor counters (CPC) -================================================== - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - CPC_ALWAYS_COUNT - - Always count. - - * - CPC_ADC_VALID_CHUNK_NOT_AVAIL - - ADC valid chunk is not available when dispatch walking is in progress in the multi-xcc mode. - - * - CPC_ADC_DISPATCH_ALLOC_DONE - - ADC dispatch allocation is done. - - * - CPC_ADC_VALID_CHUNK_END - - ADC crawler's valid chunk end in the multi-xcc mode. - - * - CPC_SYNC_FIFO_FULL_LEVEL - - SYNC FIFO full last cycles. - - * - CPC_SYNC_FIFO_FULL - - SYNC FIFO full times. - - * - CPC_GD_BUSY - - ADC busy. - - * - CPC_TG_SEND - - ADC thread group send. - - * - CPC_WALK_NEXT_CHUNK - - ADC walking next valid chunk in the multi-xcc mode. - - * - CPC_STALLED_BY_SE0_SPI - - ADC CSDATA stalled by SE0SPI. - - * - CPC_STALLED_BY_SE1_SPI - - ADC CSDATA stalled by SE1SPI. - - * - CPC_STALLED_BY_SE2_SPI - - ADC CSDATA stalled by SE2SPI. - - * - CPC_STALLED_BY_SE3_SPI - - ADC CSDATA stalled by SE3SPI. - - * - CPC_LTE_ALL - - CPC sync counter LteAll. Only Master XCD manages LteAll. - - * - CPC_SYNC_WRREQ_FIFO_BUSY - - CPC sync counter request FIFO is not empty. - - * - CPC_CANE_BUSY - - CPC CANE bus is busy, which indicates the presence of inflight sync counter requests. - - * - CPC_CANE_STALL - - CPC sync counter sending is stalled by CANE. - -Shader pipe interpolators (SPI) counters -========================================= - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - SPI_CS0_WINDOW_VALID - - Clock count enabled by PIPE0 perfcounter_start event. - - * - SPI_CS0_BUSY - - Number of clocks with outstanding waves for PIPE0 (SPI or SH). - - * - SPI_CS0_NUM_THREADGROUPS - - Number of thread groups launched for PIPE0. - - * - SPI_CS0_CRAWLER_STALL - - Number of clocks when PIPE0 event or wave order FIFO is full. - - * - SPI_CS0_EVENT_WAVE - - Number of PIPE0 events and waves. - - * - SPI_CS0_WAVE - - Number of PIPE0 waves. - - * - SPI_CS1_WINDOW_VALID - - Clock count enabled by PIPE1 perfcounter_start event. - - * - SPI_CS1_BUSY - - Number of clocks with outstanding waves for PIPE1 (SPI or SH). - - * - SPI_CS1_NUM_THREADGROUPS - - Number of thread groups launched for PIPE1. - - * - SPI_CS1_CRAWLER_STALL - - Number of clocks when PIPE1 event or wave order FIFO is full. - - * - SPI_CS1_EVENT_WAVE - - Number of PIPE1 events and waves. - - * - SPI_CS1_WAVE - - Number of PIPE1 waves. - - * - SPI_CS2_WINDOW_VALID - - Clock count enabled by PIPE2 perfcounter_start event. - - * - SPI_CS2_BUSY - - Number of clocks with outstanding waves for PIPE2 (SPI or SH). - - * - SPI_CS2_NUM_THREADGROUPS - - Number of thread groups launched for PIPE2. - - * - SPI_CS2_CRAWLER_STALL - - Number of clocks when PIPE2 event or wave order FIFO is full. - - * - SPI_CS2_EVENT_WAVE - - Number of PIPE2 events and waves. - - * - SPI_CS2_WAVE - - Number of PIPE2 waves. - - * - SPI_CS3_WINDOW_VALID - - Clock count enabled by PIPE3 perfcounter_start event. - - * - SPI_CS3_BUSY - - Number of clocks with outstanding waves for PIPE3 (SPI or SH). - - * - SPI_CS3_NUM_THREADGROUPS - - Number of thread groups launched for PIPE3. - - * - SPI_CS3_CRAWLER_STALL - - Number of clocks when PIPE3 event or wave order FIFO is full. - - * - SPI_CS3_EVENT_WAVE - - Number of PIPE3 events and waves. - - * - SPI_CS3_WAVE - - Number of PIPE3 waves. - - * - SPI_CSQ_P0_Q0_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue0. - - * - SPI_CSQ_P0_Q1_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue1. - - * - SPI_CSQ_P0_Q2_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue2. - - * - SPI_CSQ_P0_Q3_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue3. - - * - SPI_CSQ_P0_Q4_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue4. - - * - SPI_CSQ_P0_Q5_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue5. - - * - SPI_CSQ_P0_Q6_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue6. - - * - SPI_CSQ_P0_Q7_OCCUPANCY - - Sum of occupancy info for PIPE0 Queue7. - - * - SPI_CSQ_P1_Q0_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue0. - - * - SPI_CSQ_P1_Q1_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue1. - - * - SPI_CSQ_P1_Q2_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue2. - - * - SPI_CSQ_P1_Q3_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue3. - - * - SPI_CSQ_P1_Q4_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue4. - - * - SPI_CSQ_P1_Q5_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue5. - - * - SPI_CSQ_P1_Q6_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue6. - - * - SPI_CSQ_P1_Q7_OCCUPANCY - - Sum of occupancy info for PIPE1 Queue7. - - * - SPI_CSQ_P2_Q0_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue0. - - * - SPI_CSQ_P2_Q1_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue1. - - * - SPI_CSQ_P2_Q2_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue2. - - * - SPI_CSQ_P2_Q3_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue3. - - * - SPI_CSQ_P2_Q4_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue4. - - * - SPI_CSQ_P2_Q5_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue5. - - * - SPI_CSQ_P2_Q6_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue6. - - * - SPI_CSQ_P2_Q7_OCCUPANCY - - Sum of occupancy info for PIPE2 Queue7. - - * - SPI_CSQ_P3_Q0_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue0. - - * - SPI_CSQ_P3_Q1_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue1. - - * - SPI_CSQ_P3_Q2_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue2. - - * - SPI_CSQ_P3_Q3_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue3. - - * - SPI_CSQ_P3_Q4_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue4. - - * - SPI_CSQ_P3_Q5_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue5. - - * - SPI_CSQ_P3_Q6_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue6. - - * - SPI_CSQ_P3_Q7_OCCUPANCY - - Sum of occupancy info for PIPE3 Queue7. - - * - SPI_CSQ_P0_OCCUPANCY - - Sum of occupancy info for all PIPE0 queues. - - * - SPI_CSQ_P1_OCCUPANCY - - Sum of occupancy info for all PIPE1 queues. - - * - SPI_CSQ_P2_OCCUPANCY - - Sum of occupancy info for all PIPE2 queues. - - * - SPI_CSQ_P3_OCCUPANCY - - Sum of occupancy info for all PIPE3 queues. - - * - SPI_VWC0_VDATA_VALID_WR - - Number of clocks VGPR bus_0 writes VGPRs. - - * - SPI_VWC1_VDATA_VALID_WR - - Number of clocks VGPR bus_1 writes VGPRs. - - * - SPI_CSC_WAVE_CNT_BUSY - - Number of cycles when there is any wave in the pipe. - -Compute unit (SQ) counters -=========================== - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - SQ_INSTS_VALU_MFMA_F6F4 - - Number of VALU V_MFMA_*_F6F4 instructions. - - * - SQ_INSTS_VALU_MFMA_MOPS_F6F4 - - Number of VALU matrix with the performed math operations (add or mul) divided by 512, assuming a full EXEC mask of F6 or F4 data type. - - * - SQ_ACTIVE_INST_VALU2 - - Number of quad-cycles when two VALU instructions are issued (per-simd, nondeterministic). - - * - SQ_INSTS_LDS_LOAD - - Number of LDS load instructions issued (per-simd, emulated). - - * - SQ_INSTS_LDS_STORE - - Number of LDS store instructions issued (per-simd, emulated). - - * - SQ_INSTS_LDS_ATOMIC - - Number of LDS atomic instructions issued (per-simd, emulated). - - * - SQ_INSTS_LDS_LOAD_BANDWIDTH - - Total number of 64-bytes loaded (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). - - * - SQ_INSTS_LDS_STORE_BANDWIDTH - - Total number of 64-bytes written (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). - - * - SQ_INSTS_LDS_ATOMIC_BANDWIDTH - - Total number of 64-bytes atomic (instrSize * CountOnes(EXEC))/64 (per-simd, emulated). - - * - SQ_INSTS_VALU_FLOPS_FP16 - - Counts FLOPS per instruction on float 16 excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_FLOPS_FP32 - - Counts FLOPS per instruction on float 32 excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_FLOPS_FP64 - - Counts FLOPS per instruction on float 64 excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_FLOPS_FP16_TRANS - - Counts FLOPS per instruction on float 16 trans excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_FLOPS_FP32_TRANS - - Counts FLOPS per instruction on float 32 trans excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_FLOPS_FP64_TRANS - - Counts FLOPS per instruction on float 64 trans excluding MFMA/SMFMA. - - * - SQ_INSTS_VALU_IOPS - - Counts OPS per instruction on integer or unsigned or bit data (per-simd, emulated). - - * - SQ_LDS_DATA_FIFO_FULL - - Number of cycles LDS data FIFO is full (nondeterministic, unwindowed). - - * - SQ_LDS_CMD_FIFO_FULL - - Number of cycles LDS command FIFO is full (nondeterministic, unwindowed). - - * - SQ_VMEM_TA_ADDR_FIFO_FULL - - Number of cycles texture requests are stalled due to full address FIFO in TA (nondeterministic, unwindowed). - - * - SQ_VMEM_TA_CMD_FIFO_FULL - - Number of cycles texture requests are stalled due to full cmd FIFO in TA (nondeterministic, unwindowed). - - * - SQ_VMEM_WR_TA_DATA_FIFO_FULL - - Number of cycles texture writes are stalled due to full data FIFO in TA (nondeterministic, unwindowed). - - * - SQC_ICACHE_MISSES_DUPLICATE - - Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic). - - * - SQC_DCACHE_MISSES_DUPLICATE - - Number of duplicate misses (access to a non-resident, miss pending CL) (per-SQ, per-Bank, nondeterministic). - -Texture addressing (TA) unit counters -====================================== - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - TA_BUFFER_READ_LDS_WAVEFRONTS - - Number of buffer read wavefronts for LDS return processed by the TA. - - * - TA_FLAT_READ_LDS_WAVEFRONTS - - Number of flat opcode reads for LDS return processed by the TA. - -Texture data (TD) unit counters -================================ - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - TD_WRITE_ACKT_WAVEFRONT - - Number of write acknowledgments, sent to SQ and not to SP. - - * - TD_TD_SP_TRAFFIC - - Number of times this TD sends data to the SP. - -Texture cache per pipe (TCP) counters -====================================== - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - TCP_TCP_TA_ADDR_STALL_CYCLES - - TCP stalls TA addr interface. - - * - TCP_TCP_TA_DATA_STALL_CYCLES - - TCP stalls TA data interface. Now windowed. - - * - TCP_LFIFO_STALL_CYCLES - - Memory latency FIFOs full stall. - - * - TCP_RFIFO_STALL_CYCLES - - Memory Request FIFOs full stall. - - * - TCP_TCR_RDRET_STALL - - Write into cache stalled by read return from TCR. - - * - TCP_PENDING_STALL_CYCLES - - Stall due to data pending from L2. - - * - TCP_UTCL1_SERIALIZATION_STALL - - Total number of stalls caused due to serializing translation requests through the UTCL1. - - * - TCP_UTCL1_THRASHING_STALL - - Stall caused by thrashing feature in any probe. Lacks accuracy when the stall signal overlaps between probe0 and probe1, which is worse with MECO of thrashing deadlock. Some probe0 events could miss being counted in with MECO on. This perf count provides a rough thrashing estimate. - - * - TCP_UTCL1_TRANSLATION_MISS_UNDER_MISS - - Translation miss_under_miss. - - * - TCP_UTCL1_STALL_INFLIGHT_MAX - - Total UTCL1 stalls due to inflight counter saturation. - - * - TCP_UTCL1_STALL_LRU_INFLIGHT - - Total UTCL1 stalls due to LRU cache line with inflight traffic. - - * - TCP_UTCL1_STALL_MULTI_MISS - - Total UTCL1 stalls due to arbitrated multiple misses. - - * - TCP_UTCL1_LFIFO_FULL - - Total UTCL1 and UTCL2 latency, which hides FIFO full cycles. - - * - TCP_UTCL1_STALL_LFIFO_NOT_RES - - Total UTCL1 stalls due to UTCL2 latency, which hides FIFO output (not resident). - - * - TCP_UTCL1_STALL_UTCL2_REQ_OUT_OF_CREDITS - - Total UTCL1 stalls due to UTCL2_req being out of credits. - - * - TCP_CLIENT_UTCL1_INFLIGHT - - The sum of inflight client to UTCL1 requests per cycle. - - * - TCP_TAGRAM0_REQ - - Total L2 requests mapping to TagRAM 0 from this TCP to all TCCs. - - * - TCP_TAGRAM1_REQ - - Total L2 requests mapping to TagRAM 1 from this TCP to all TCCs. - - * - TCP_TAGRAM2_REQ - - Total L2 requests mapping to TagRAM 2 from this TCP to all TCCs. - - * - TCP_TAGRAM3_REQ - - Total L2 requests mapping to TagRAM 3 from this TCP to all TCCs. - - * - TCP_TCP_LATENCY - - Total TCP wave latency (from the first clock of wave entering to the first clock of wave leaving). Divide by TA_TCP_STATE_READ to find average wave latency. - - * - TCP_TCC_READ_REQ_LATENCY - - Total TCP to TCC request latency for reads and atomics with return. Not Windowed. - - * - TCP_TCC_WRITE_REQ_LATENCY - - Total TCP to TCC request latency for writes and atomics without return. Not Windowed. - - * - TCP_TCC_WRITE_REQ_HOLE_LATENCY - - Total TCP req to TCC hole latency for writes and atomics. Not Windowed. - -Texture cache per channel (TCC) counters -========================================= - -.. list-table:: - :header-rows: 1 - - * - Hardware counter - - Definition - - * - TCC_READ_SECTORS - - Total number of 32B data sectors in read requests. - - * - TCC_WRITE_SECTORS - - Total number of 32B data sectors in write requests. - - * - TCC_ATOMIC_SECTORS - - Total number of 32B data sectors in atomic requests. - - * - TCC_BYPASS_REQ - - Number of bypass requests. This is measured at the tag block. - - * - TCC_LATENCY_FIFO_FULL - - Number of cycles when the latency FIFO is full. - - * - TCC_SRC_FIFO_FULL - - Number of cycles when the SRC FIFO is assumed to be full as measured at the IB block. - - * - TCC_EA0_RDREQ_64B - - Number of 64-byte TCC/EA read requests. - - * - TCC_EA0_RDREQ_128B - - Number of 128-byte TCC/EA read requests. - - * - TCC_IB_REQ - - Number of requests through the IB. This measures the number of raw requests from graphics clients to this TCC. - - * - TCC_IB_STALL - - Number of cycles when the IB output is stalled. - - * - TCC_EA0_WRREQ_WRITE_DRAM - - Number of TCC/EA write requests (32-byte or 64-byte) destined for DRAM (MC). - - * - TCC_EA0_WRREQ_ATOMIC_DRAM - - Number of TCC/EA atomic requests (32-byte or 64-byte) destined for DRAM (MC). - - * - TCC_EA0_RDREQ_DRAM_32B - - Number of 32-byte TCC/EA read requests due to DRAM traffic. One 64-byte request is counted as two and one 128-byte as four. - - * - TCC_EA0_RDREQ_GMI_32B - - Number of 32-byte TCC/EA read requests due to GMI traffic. One 64-byte request is counted as two and one 128-byte as four. - - * - TCC_EA0_RDREQ_IO_32B - - Number of 32-byte TCC/EA read requests due to IO traffic. One 64-byte request is counted as two and one 128-byte as four. - - * - TCC_EA0_WRREQ_WRITE_DRAM_32B - - Number of 32-byte TCC/EA write requests due to DRAM traffic. One 64-byte request is counted as two. - - * - TCC_EA0_WRREQ_ATOMIC_DRAM_32B - - Number of 32-byte TCC/EA atomic requests due to DRAM traffic. One 64-byte request is counted as two. - - * - TCC_EA0_WRREQ_WRITE_GMI_32B - - Number of 32-byte TCC/EA write requests due to GMI traffic. One 64-byte request is counted as two. - - * - TCC_EA0_WRREQ_ATOMIC_GMI_32B - - Number of 32-byte TCC/EA atomic requests due to GMI traffic. One 64-byte request is counted as two. - - * - TCC_EA0_WRREQ_WRITE_IO_32B - - Number of 32-byte TCC/EA write requests due to IO traffic. One 64-byte request is counted as two. - - * - TCC_EA0_WRREQ_ATOMIC_IO_32B - - Number of 32-byte TCC/EA atomic requests due to IO traffic. One 64-byte request is counted as two. diff --git a/docs/conceptual/gpu-isolation.md b/docs/conceptual/gpu-isolation.md deleted file mode 100644 index 35a221779..000000000 --- a/docs/conceptual/gpu-isolation.md +++ /dev/null @@ -1,116 +0,0 @@ - - - - - - -# GPU isolation techniques - -Restricting the access of applications to a subset of GPUs, aka isolating -GPUs allows users to hide GPU resources from programs. The programs by default -will only use the "exposed" GPUs ignoring other (hidden) GPUs in the system. - -There are multiple ways to achieve isolation of GPUs in the ROCm software stack, -differing in which applications they apply to and the security they provide. -This page serves as an overview of the techniques. - -## Environment variables - -The runtimes in the ROCm software stack read these environment variables to -select the exposed or default device to present to applications using them. - -Environment variables shouldn't be used for isolating untrusted applications, -as an application can reset them before initializing the runtime. - -### `ROCR_VISIBLE_DEVICES` - -A list of device indices or {abbr}`UUID (universally unique identifier)`s -that will be exposed to applications. - -Runtime -: ROCm Software Runtime. Applies to all applications using the user mode ROCm - software stack. - -```{code-block} shell -:caption: Example to expose the 1. device and a device based on UUID. -export ROCR_VISIBLE_DEVICES="0,GPU-DEADBEEFDEADBEEF" -``` - -### `GPU_DEVICE_ORDINAL` - -Devices indices exposed to OpenCL and HIP applications. - -Runtime -: ROCm Compute Language Runtime (`ROCclr`). Applies to applications and runtimes - using the `ROCclr` abstraction layer including HIP and OpenCL applications. - -```{code-block} shell -:caption: Example to expose the 1. and 3. device in the system. -export GPU_DEVICE_ORDINAL="0,2" -``` - -(hip_visible_devices)= - -### `HIP_VISIBLE_DEVICES` - -Device indices exposed to HIP applications. - -Runtime: HIP runtime. Applies only to applications using HIP on the AMD platform. - -```{code-block} shell -:caption: Example to expose the 1. and 3. devices in the system. -export HIP_VISIBLE_DEVICES="0,2" -``` - -### `CUDA_VISIBLE_DEVICES` - -Provided for CUDA compatibility, has the same effect as `HIP_VISIBLE_DEVICES` -on the AMD platform. - -Runtime -: HIP or CUDA Runtime. Applies to HIP applications on the AMD or NVIDIA platform - and CUDA applications. - -### `OMP_DEFAULT_DEVICE` - -Default device used for OpenMP target offloading. - -Runtime -: OpenMP Runtime. Applies only to applications using OpenMP offloading. - -```{code-block} shell -:caption: Example on setting the default device to the third device. -export OMP_DEFAULT_DEVICE="2" -``` - -## Docker - -Docker uses Linux kernel namespaces to provide isolated environments for -applications. This isolation applies to most devices by default, including -GPUs. To access them in containers explicit access must be granted, please see -{ref}`docker-access-gpus-in-container` for details. -Specifically refer to {ref}`docker-restrict-gpus` on exposing just a subset -of all GPUs. - -Docker isolation is more secure than environment variables, and applies -to all programs that use the `amdgpu` kernel module interfaces. -Even programs that don't use the ROCm runtime, like graphics applications -using OpenGL or Vulkan, can only access the GPUs exposed to the container. - -## GPU passthrough to virtual machines - -Virtual machines achieve the highest level of isolation, because even the kernel -of the virtual machine is isolated from the host. Devices physically installed -in the host system can be passed to the virtual machine using PCIe passthrough. -This allows for using the GPU with a different operating systems like a Windows -guest from a Linux host. - -Setting up PCIe passthrough is specific to the hypervisor used. ROCm officially -supports [VMware ESXi](https://www.vmware.com/products/esxi-and-esx.html) -for select GPUs. - - diff --git a/docs/conf.py b/docs/conf.py index b1bcd8590..5c29f3e25 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -22,72 +22,72 @@ def copy_rtd_file(src_path: Path, dest_path: Path): print(f"📁 Copied {src_path} → {dest_path}") -compat_matrix_src = DOCS_DIR / "compatibility" / "compatibility-matrix-historical-6.0.csv" # fmt: skip -compat_matrix_dest = ROOT_DIR / "_readthedocs" / "html" / "downloads" / "compatibility-matrix-historical-6.0.csv" # fmt: skip -copy_rtd_file(compat_matrix_src, compat_matrix_dest) +# compat_matrix_src = DOCS_DIR / "compatibility" / "compatibility-matrix-historical-6.0.csv" # fmt: skip +# compat_matrix_dest = ROOT_DIR / "_readthedocs" / "html" / "downloads" / "compatibility-matrix-historical-6.0.csv" # fmt: skip +# copy_rtd_file(compat_matrix_src, compat_matrix_dest) gh_release_path = ROOT_DIR / "RELEASE.md" rtd_release_path = DOCS_DIR / "about" / "release-notes.md" copy_rtd_file(gh_release_path, rtd_release_path) -gh_changelog_path = ROOT_DIR / "CHANGELOG.md" -rtd_changelog_path = DOCS_DIR / "release" / "changelog.md" -copy_rtd_file(gh_changelog_path, rtd_changelog_path) +# gh_changelog_path = ROOT_DIR / "CHANGELOG.md" +# rtd_changelog_path = DOCS_DIR / "release" / "changelog.md" +# copy_rtd_file(gh_changelog_path, rtd_changelog_path) # Mark the consolidated changelog as orphan to prevent Sphinx from warning about missing toctree entries -with open(rtd_changelog_path, "r+", encoding="utf-8") as file: - content = file.read() - file.seek(0) - file.write(":orphan:\n" + content) +# with open(rtd_changelog_path, "r+", encoding="utf-8") as file: +# content = file.read() +# file.seek(0) +# file.write(":orphan:\n" + content) +# +# # Replace GitHub-style [!ADMONITION]s with Sphinx-compatible ```{admonition} blocks +# with open(rtd_changelog_path, "r", encoding="utf-8") as file: +# lines = file.readlines() +# +# modified_lines = [] +# in_admonition_section = False +# +# # Map for matching the specific admonition type to its corresponding Sphinx markdown syntax +# admonition_types = { +# "> [!NOTE]": "```{note}", +# "> [!TIP]": "```{tip}", +# "> [!IMPORTANT]": "```{important}", +# "> [!WARNING]": "```{warning}", +# "> [!CAUTION]": "```{caution}", +# } +# +# for line in lines: +# if any(line.startswith(k) for k in admonition_types): +# for key in admonition_types: +# if line.startswith(key): +# modified_lines.append(admonition_types[key] + "\n") +# break +# in_admonition_section = True +# elif in_admonition_section: +# if line.strip() == "": +# # If we encounter an empty line, close the admonition section +# modified_lines.append("```\n\n") # Close the admonition block +# in_admonition_section = False +# else: +# modified_lines.append(line.lstrip("> ")) +# else: +# modified_lines.append(line) +# +# # In case the file ended while still in a admonition section, close it +# if in_admonition_section: +# modified_lines.append("```") +# +# file.close() +# +# with open(rtd_changelog_path, "w", encoding="utf-8") as file: +# file.writelines(modified_lines) -# Replace GitHub-style [!ADMONITION]s with Sphinx-compatible ```{admonition} blocks -with open(rtd_changelog_path, "r", encoding="utf-8") as file: - lines = file.readlines() - - modified_lines = [] - in_admonition_section = False - - # Map for matching the specific admonition type to its corresponding Sphinx markdown syntax - admonition_types = { - "> [!NOTE]": "```{note}", - "> [!TIP]": "```{tip}", - "> [!IMPORTANT]": "```{important}", - "> [!WARNING]": "```{warning}", - "> [!CAUTION]": "```{caution}", - } - - for line in lines: - if any(line.startswith(k) for k in admonition_types): - for key in admonition_types: - if line.startswith(key): - modified_lines.append(admonition_types[key] + "\n") - break - in_admonition_section = True - elif in_admonition_section: - if line.strip() == "": - # If we encounter an empty line, close the admonition section - modified_lines.append("```\n\n") # Close the admonition block - in_admonition_section = False - else: - modified_lines.append(line.lstrip("> ")) - else: - modified_lines.append(line) - - # In case the file ended while still in a admonition section, close it - if in_admonition_section: - modified_lines.append("```") - - file.close() - - with open(rtd_changelog_path, "w", encoding="utf-8") as file: - file.writelines(modified_lines) - -matrix_path = os.path.join("compatibility", "compatibility-matrix-historical-6.0.csv") -rtd_path = os.path.join("..", "_readthedocs", "html", "downloads") -if not os.path.exists(rtd_path): - os.makedirs(rtd_path) -shutil.copy2(matrix_path, rtd_path) +# matrix_path = os.path.join("compatibility", "compatibility-matrix-historical-6.0.csv") +# rtd_path = os.path.join("..", "_readthedocs", "html", "downloads") +# if not os.path.exists(rtd_path): +# os.makedirs(rtd_path) +# shutil.copy2(matrix_path, rtd_path) latex_engine = "xelatex" latex_elements = { @@ -122,13 +122,13 @@ external_toc_path = "./sphinx/_toc.yml" # Register Sphinx extensions and static assets sys.path.append(str(DOCS_DIR / "extension")) -html_static_path = ["sphinx/static/css", "extension/how-to/rocm-for-ai/inference"] -html_css_files = [ - "rocm_custom.css", - "rocm_rn.css", +# html_static_path = ["sphinx/static/css", "extension/how-to/rocm-for-ai/inference"] +# html_css_files = [ + # "rocm_custom.css", + # "rocm_rn.css", # "dynamic_picker.css", # "vllm-benchmark.css", -] +# ] templates_path = ["extension/rocm_docs_custom/templates", "extension/templates"] extensions = [ @@ -136,16 +136,16 @@ extensions = [ "rocm_docs_custom.selector", "rocm_docs_custom.table", "rocm_docs_custom.icon", - "sphinx_reredirects", - "sphinx_sitemap", - "sphinxcontrib.datatemplates", - "version-ref", - "csv-to-list-table", + # "sphinx_reredirects", + # "sphinx_sitemap", + # "sphinxcontrib.datatemplates", + # "version-ref", + # "csv-to-list-table", ] -compatibility_matrix_file = str( - DOCS_DIR / "compatibility/compatibility-matrix-historical-6.0.csv" -) +# compatibility_matrix_file = str( +# DOCS_DIR / "compatibility/compatibility-matrix-historical-6.0.csv" +# ) external_projects_current_project = "rocm" html_theme = "rocm_docs_theme" @@ -163,30 +163,30 @@ html_title = f"AMD ROCm {ROCM_VERSION}" numfig = False suppress_warnings = ["autosectionlabel.*"] -html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "https://rocm-stg.amd.com/") -html_context = { - "project_path": {project_path}, - "gpu_type": [ - ("AMD Instinct accelerators", "intrinsic"), - ("AMD gfx families", "gfx"), - ("NVIDIA families", "nvidia"), - ], - "atomics_type": [("HW atomics", "hw-atomics"), ("CAS emulation", "cas-atomics")], - "pcie_type": [("No PCIe atomics", "nopcie"), ("PCIe atomics", "pcie")], - "memory_type": [ - ("Device DRAM", "device-dram"), - ("Migratable Host DRAM", "migratable-host-dram"), - ("Pinned Host DRAM", "pinned-host-dram"), - ], - "granularity_type": [ - ("Coarse-grained", "coarse-grained"), - ("Fine-grained", "fine-grained"), - ], - "scope_type": [("Device", "device"), ("System", "system")], -} +# html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "https://rocm-stg.amd.com/") +# html_context = { +# "project_path": {project_path}, +# "gpu_type": [ +# ("AMD Instinct accelerators", "intrinsic"), +# ("AMD gfx families", "gfx"), +# ("NVIDIA families", "nvidia"), +# ], +# "atomics_type": [("HW atomics", "hw-atomics"), ("CAS emulation", "cas-atomics")], +# "pcie_type": [("No PCIe atomics", "nopcie"), ("PCIe atomics", "pcie")], +# "memory_type": [ +# ("Device DRAM", "device-dram"), +# ("Migratable Host DRAM", "migratable-host-dram"), +# ("Pinned Host DRAM", "pinned-host-dram"), +# ], +# "granularity_type": [ +# ("Coarse-grained", "coarse-grained"), +# ("Fine-grained", "fine-grained"), +# ], +# "scope_type": [("Device", "device"), ("System", "system")], +# } if os.environ.get("READTHEDOCS", "") == "True": html_context["READTHEDOCS"] = True # temporary settings to speed up docs build for faster iteration -external_projects_remote_repository = "" -external_toc_exclude_missing = True +# external_projects_remote_repository = "" +# external_toc_exclude_missing = True diff --git a/docs/contribute/building.md b/docs/contribute/building.md deleted file mode 100644 index d4b88b071..000000000 --- a/docs/contribute/building.md +++ /dev/null @@ -1,168 +0,0 @@ - - - - - - -# Building documentation - -## GitHub - -If you open a pull request and scroll down to the summary panel, -there is a commit status section. Next to the line -`docs/readthedocs.com:advanced-micro-devices-demo`, there is a `Details` link. -If you click this, it takes you to the Read the Docs build for your pull request. - -![GitHub PR commit status](../data/contribute/commit-status.png) - -If you don't see this line, click `Show all checks` to get an itemized view. - -## Command line - -You can build our documentation via the command line using Python. - -See the `build.tools.python` setting in the [Read the Docs configuration file](https://github.com/ROCm/ROCm/blob/develop/.readthedocs.yaml) for the Python version used by Read the Docs to build documentation. - -See the [Python requirements file](https://github.com/ROCm/ROCm/blob/develop/docs/sphinx/requirements.txt) for Python packages needed to build the documentation. - -Use the Python Virtual Environment (`venv`) and run the following commands from the project root: - -::::{tab-set} -:::{tab-item} Linux and WSL -:sync: linux - -```sh -python3 -mvenv .venv - -.venv/bin/python -m pip install -r docs/sphinx/requirements.txt -.venv/bin/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html -``` - -::: -:::{tab-item} Windows -:sync: windows - -```powershell -python -mvenv .venv - -.venv\Scripts\python.exe -m pip install -r docs/sphinx/requirements.txt -.venv\Scripts\python.exe -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html -``` - -::: -:::: - -Navigate to `_build/html/index.html` and open this file in a web browser. - -## Visual Studio Code - -With the help of a few extensions, you can create a productive environment to author and test -documentation locally using Visual Studio (VS) Code. Follow these steps to configure VS Code: - -1. Install the required extensions: - - * Python: `(ms-python.python)` - * Live Server: `(ritwickdey.LiveServer)` - -2. Add the following entries to `.vscode/settings.json`. - - ```json - { - "liveServer.settings.root": "/.vscode/build/html", - "liveServer.settings.wait": 1000, - "python.terminal.activateEnvInCurrentTerminal": true - } - ``` - - * `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed - alongside the `tasks.json` command. - * `liveServer.settings.wait`: Tells the live server to wait with the update in order to give Sphinx time to - regenerate the site contents and not refresh before the build is complete. - * `python.terminal.activateEnvInCurrentTerminal`: Activates the automatic virtual environment, so you - can build the site from the integrated terminal. - -3. Add the following tasks to `.vscode/tasks.json`. - - ```json - { - "version": "2.0.0", - "tasks": [ - { - "label": "Build Docs", - "type": "process", - "windows": { - "command": "${workspaceFolder}/.venv/Scripts/python.exe" - }, - "command": "${workspaceFolder}/.venv/bin/python3", - "args": [ - "-m", - "sphinx", - "-j", - "auto", - "-T", - "-b", - "html", - "-d", - "${workspaceFolder}/.vscode/build/doctrees", - "-D", - "language=en", - "${workspaceFolder}/docs", - "${workspaceFolder}/.vscode/build/html" - ], - "problemMatcher": [ - { - "owner": "sphinx", - "fileLocation": "absolute", - "pattern": { - "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$", - "file": 1, - "line": 2, - "severity": 3, - "message": 4 - } - }, - { - "owner": "sphinx", - "fileLocation": "absolute", - "pattern": { - "regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$", - "file": 1, - "severity": 2, - "message": 3 - } - } - ], - "group": { - "kind": "build", - "isDefault": true - } - } - ] - } - ``` - - > Implementation detail: two problem matchers were needed to be defined, - > because VS Code doesn't tolerate some problem information being potentially - > absent. While a single regex could match all types of errors, if a capture - > group remains empty (the line number doesn't show up in all warning/error - > messages) but the `pattern` references said empty capture group, VS Code - > discards the message completely. - -4. Configure the Python virtual environment (`venv`). - - From the Command Palette, run `Python: Create Environment`. Select `venv` environment and - `docs/sphinx/requirements.txt`. - -5. Build the docs. - - Launch the default build task using one of the following options: - - * A hotkey (the default is `Ctrl+Shift+B`) - * Issuing the `Tasks: Run Build Task` from the Command Palette - -6. Open the live preview. - - Navigate to the site output within VS Code: right-click on `.vscode/build/html/index.html` and - select `Open with Live Server`. The contents should update on every rebuild without having to - refresh the browser. diff --git a/docs/contribute/contributing.md b/docs/contribute/contributing.md deleted file mode 100644 index 4cdb86ea0..000000000 --- a/docs/contribute/contributing.md +++ /dev/null @@ -1,77 +0,0 @@ - - - - - - -# Contributing to the ROCm documentation - -The ROCm documentation, like all of ROCm, is open source and available on GitHub. You can contribute to the ROCm documentation by forking the appropriate repository, making your changes, and opening a pull request. - -To provide feedback on the ROCm documentation, including submitting an issue or suggesting a feature, see [Providing feedback about the ROCm documentation](./feedback.md). - -## The ROCm repositories - -The repositories for ROCm and all ROCm components are available on GitHub. - -| Module | Documentation location | -| --- | --- | -| ROCm framework | [https://github.com/ROCm/ROCm/tree/develop/docs](https://github.com/ROCm/ROCm/tree/develop/docs) | -| ROCm installation for Linux | [https://github.com/ROCm/rocm-install-on-linux/tree/develop/docs](https://github.com/ROCm/rocm-install-on-linux/tree/develop/docs) | -| ROCm HIP SDK installation for Windows | [https://github.com/ROCm/rocm-install-on-windows/tree/develop/docs](https://github.com/ROCm/rocm-install-on-windows/tree/develop/docs) | - -Individual components have their own repositories with their own documentation in their own `docs` folders. - -The sub-folders within the `docs` folders across ROCm are typically structured as follows: - -| Sub-folder name | Documentation type | -|-------|----------| -| `install` | Installation instructions, build instructions, and prerequisites | -| `conceptual` | Important concepts | -| `how-to` | How to implement specific use cases | -| `tutorials` | Tutorials | -| `reference` | API references and other reference resources | - -## Editing and adding to the documentation - -ROCm documentation follows the [Google developer documentation style guide](https://developers.google.com/style/highlights). - -Most topics in the ROCm documentation are written in [reStructuredText (rst)](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html), with some topics written in Markdown. Only use reStructuredText when adding new topics. Only use Markdown if the topic you are editing is already in Markdown. - -To edit or add to the documentation: - -1. Fork the repository you want to add to or edit. -2. Clone your fork locally. -3. Create a new local branch cut from the `develop` branch of the repository. -4. Make your changes to the documentation. - -5. Optionally, build the documentation locally before creating a pull request by running the following commands from within the `docs` folder: - - ```bash - pip3 install -r sphinx/requirements.txt # You only need to run this command once - python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html - ``` - - The output files will be located in the `docs/_build` folder. Open `docs/_build/html/index.html` to view the documentation. - - For more information on ROCm build tools, see [Documentation toolchain](toolchain.md). -6. Push your changes. A GitHub link will be returned in the output of the `git push` command. Open this link in a browser to create the pull request. - - The documentation is built as part of the checks on pull request, along with spell checking and linting. Scroll to the bottom of your pull request to view all the checks. - - Verify that the linting and spell checking have passed, and that the documentation was built successfully. New words or acronyms can be added to the [wordlist file](https://github.com/ROCm/rocm-docs-core/blob/develop/.wordlist.txt). The wordlist is subject to approval by the ROCm documentation team. - - The Read The Docs build of your pull request can be accessed by clicking on the Details link next to the Read The Docs build check. Verify that your changes are in the build and look as expected. - - ![The GitHub checks are collapsed by default and can be accessed by clicking on "Show All Checks".](../data/contribute/GitHubCheck-Highlight.png) - - ![The Read The Docs Build is accessed from the Details link in the Read The Docs check.](../data/contribute/GitHub-ReadThe-Docs-Highlight.png) - - Your pull request will be reviewed by a member of the ROCm documentation team. - -See the [GitHub documentation](https://docs.github.com/en) for information on how to fork and clone a repository, and how to create and push a local branch. - -```{important} -By creating a pull request (PR), you agree to allow your contribution to be licensed under the terms of the -LICENSE.txt file in the corresponding repository. Different repositories can use different licenses. -``` diff --git a/docs/contribute/feedback.md b/docs/contribute/feedback.md deleted file mode 100644 index 2adf8a78e..000000000 --- a/docs/contribute/feedback.md +++ /dev/null @@ -1,27 +0,0 @@ - - - - - - -# Providing feedback about the ROCm documentation - -Feedback about the ROCm documentation is welcome. You can provide feedback about the ROCm documentation either through GitHub Discussions or GitHub Issues. - -## Participating in discussions through GitHub Discussions - -You can ask questions, view announcements, suggest new features, and communicate with other members of the community through [GitHub Discussions](https://github.com/ROCm/ROCm/discussions). - -## Submitting issues through GitHub Issues - -You can submit issues through [GitHub Issues](https://github.com/ROCm/ROCm/issues). - -When creating a new issue, follow the following guidelines: - -1. Always do a search to see if the same issue already exists. If the issue already exists, upvote it, and comment or post to provide any additional details you might have. -2. If you find an issue that is similar to your issue, log your issue, then add a comment that includes a link to the similar issue, as well as its issue number. -3. Always provide as much information as possible. This helps reduce the time required to reproduce the issue. - -After creating your issue, make sure to check it regularly for any requests for additional information. - -For information about contributing content to the ROCm documentation, see [Contributing to the ROCm documentation](./contributing.md). diff --git a/docs/contribute/toolchain.md b/docs/contribute/toolchain.md deleted file mode 100644 index 33189ec24..000000000 --- a/docs/contribute/toolchain.md +++ /dev/null @@ -1,46 +0,0 @@ - - - - - - -# ROCm documentation toolchain - -The ROCm documentation relies on several open source toolchains and sites. - -## rocm-docs-core - -[rocm-docs-core](https://github.com/ROCm/rocm-docs-core) is an AMD-maintained -project that applies customizations for the ROCm documentation. This project is the tool most ROCm repositories use as part of their documentation build pipeline. It is available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/). - -See the user and developer guides for rocm-docs-core at -{doc}`rocm-docs-core documentation`. - -## Sphinx - -[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator originally used for Python. It is now widely used in the open source community. - -### Sphinx External ToC - -[Sphinx External ToC](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html) is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left -based on a YAML file (`_toc.yml.in`) that contains the table of contents. - -### Sphinx-book-theme - -[Sphinx-book-theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme that defines the base appearance for ROCm documentation. ROCm documentation applies some customization, such as a custom header and footer, on top of the Sphinx Book Theme. - -### Sphinx Design - -[Sphinx design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design functionality. ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs. - -## Doxygen - -[Doxygen](https://www.doxygen.nl/) is a documentation generator that extracts information from in-code comments. It is used for API documentation. - -## Breathe - -[Breathe](https://www.breathe-doc.org/) is a Sphinx plugin for integrating Doxygen content. - -## Read the Docs - -[Read the Docs](https://docs.readthedocs.io/en/stable/) is the service that builds and hosts the HTML version of the ROCm documentation. diff --git a/docs/how-to/Bar-Memory.rst b/docs/how-to/Bar-Memory.rst deleted file mode 100644 index ba9f47b4a..000000000 --- a/docs/how-to/Bar-Memory.rst +++ /dev/null @@ -1,99 +0,0 @@ -.. meta:: - :description: Learn about BAR configuration in AMD GPUs and ways to troubleshoot physical addressing limit - :keywords: BAR memory, MMIO, GPU memory, Physical Addressing Limit, AMD, ROCm - -************************************** -Troubleshoot BAR access limitation -************************************** -Direct Memory Access (DMA) to PCIe devices using Base Address Registers (BARs) can be restricted due to physical addressing limits. These restrictions can result in data access failures between the system components. Peer-to-peer (P2P) DMA is used to access resources such as registers and memory between devices. PCIe devices need memory-mapped input/output (MMIO) space for DMA, and these MMIO spaces are defined in the PCIe BARs. - -These BARs are a set of 32-bit or 64-bit registers that are used to define the resources that PCIe devices provide. The CPU and other system devices also use these to access the resources of the PCIe devices. P2P DMA only works when one device can directly access the local BAR memory of another. If the memory address of a BAR memory exceeds the physical addressing limit of a device, the device will not be able to access that BAR. This could be the device's own BAR or the BAR of another device in the system. - -If the BAR memory exceeds than the physical addressing limit of the device, the device will not be able to access the remote BAR. - -To handle any BAR access issues that might occur, you need to be aware of the physical address limitations of the devices and understand the :ref:`BAR configuration of AMD GPUs `. This information is important when setting up additional MMIO apertures for PCIe devices in the system's physical address space. - -Handling physical address limitation -============================================= -When a system boots, the system BIOS allocates the physical address space for the components in the system, including system memory and MMIO apertures. On modern 64-bit platforms, there are generally two or more MMIO apertures: one located below 4 GB of physical address space for 32-bit compatibility, and one or more above 4 GB for devices needing more space. - -You can control the memory address of the high MMIO aperture from the system BIOS configuration options. This lets you configure the additional MMIO space to align with the physical addressing limit and allows P2P DMA between the devices. For example, if a PCIe device is limited to 44-bit of physical addressing, you should ensure that the MMIO aperture is set below 44-bit in the system physical address space. - -There are two ways to handle this: - -* Ensure that the high MMIO aperture is within the physical addressing limits of the devices in the system. For example, if the devices have a 44-bit physical addressing limit, set the ``MMIO High Base`` and ``MMIO High size`` options in the BIOS such that the aperture is within the 44-bit address range, and ensure that the ``Above 4G Decoding`` option is Enabled. - -* Enable the Input-Output Memory Management Unit (IOMMU). When the IOMMU is enabled in non-passthrough mode, it will create a virtual I/O address space for each device on the system. It also ensures that all virtual addresses created in that space are within the physical addressing limits of the device. For more information on IOMMU, see `Input-Output Memory Management Unit (IOMMU) `_. - -.. _bar-configuration: - -BAR configuration for AMD GPUs -================================================ - -The following table shows how the BARs are configured for AMD GPUs. - - -.. list-table:: - :widths: 25 25 50 - :header-rows: 1 - - * - BAR Type - - Value - - Description - * - BAR0-1 registers - - 64-bit, Prefetchable, GPU memory - - 8 GB or 16 GB depending on GPU. Set to less than 2^44 to support P2P access from other GPUs with a 44-bit physical address limit. Prefetchable memory enables faster read operation for high-performance computing (HPC) by fetching the contiguous data from the same data source even before requested as an anticipation of a future request. - * - BAR2-3 registers - - 64-bit, Prefetchable, Doorbell - - Set to less than 2^44 to support P2P access from other GPUs with a 44-bit physical address limit. As a Doorbell BAR, it indicates to the GPU that a new operation is in its queue to be processed. - * - BAR4 register - - Optional - - Not a boot device - * - BAR5 register - - 32-bit, Non-prefetchable, MMIO - - Is set to less than 4 GB. - -Example of BAR usage on AMD GPUs -------------------------------------- -Following is an example configuration of BARs set by the system BIOS on GFX8 GPUs with the 40-bit physical addressing limit: - -.. code:: shell - - 11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO - Series] (rev c1) - - Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35 - - Flags: bus master, fast devsel, latency 0, IRQ 119 - - Memory at bf40000000 (64-bit, prefetchable) [size=256M] - - Memory at bf50000000 (64-bit, prefetchable) [size=2M] - - I/O ports at 3000 [size=256] - - Memory at c7400000 (32-bit, non-prefetchable) [size=256K] - - Expansion ROM at c7440000 [disabled] [size=128K] - -Details of the BARs configured in the example are: - -**GPU Frame Buffer BAR:** ``Memory at bf40000000 (64-bit, prefetchable) [size=256M]`` - -The size of the BAR in the example is 256 MB. Generally, it will be the size of the GPU memory (typically 4 GB+). Depending upon the physical address limit and generation of AMD GPUs, the BAR can be set below 2^40, 2^44, or 2^48. - -**Doorbell BAR:** ``Memory at bf50000000 (64-bit, prefetchable) [size=2M]`` - -The size of the BAR should typically be less than 10 MB for this generation of GPUs and has been set to 2 MB in the example. This BAR is placed less than 2^40 to allow peer-to-peer access from other generations of AMD GPUs. - -**I/O BAR:** ``I/O ports at 3000 [size=256]`` - -This is for legacy VGA and boot device support. Because the GPUs used are not connected to a display (VGA devices), this is not a concern, even if it isn't set up in the system BIOS. - -**MMIO BAR:** ``Memory at c7400000 (32-bit, non-prefetchable) [size=256K]`` - -The AMD Driver requires this to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32-bit), this is set less than 4 GB. In the example, it is fixed at 256 KB. - -**Expansion ROM:** ``Expansion ROM at c7440000 [disabled] [size=128K]`` - -This is required by the AMD Driver to access the GPU video-BIOS. In the example, it is fixed at 128 KB. \ No newline at end of file diff --git a/docs/how-to/build-rocm.rst b/docs/how-to/build-rocm.rst deleted file mode 100644 index 6067874ab..000000000 --- a/docs/how-to/build-rocm.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. meta:: - :description: Build ROCm from source - :keywords: build ROCm, source, ROCm source, ROCm, repo, make, makefile - - -.. _building-rocm: - -************************************************************* -Build ROCm from source -************************************************************* - -ROCm is an open-source stack from which you can build from source code. The source code is available from ``__. - - -The general steps to build ROCm are: - -#. Clone the ROCm source code -#. Prepare the build environment -#. Run the build command - -Because the ROCm stack is constantly evolving, the most current instructions are stored with the source code in GitHub. -For detailed build instructions, see `Getting and Building ROCm from Source `_. - - diff --git a/docs/how-to/deep-learning-rocm.rst b/docs/how-to/deep-learning-rocm.rst deleted file mode 100644 index fb21328f8..000000000 --- a/docs/how-to/deep-learning-rocm.rst +++ /dev/null @@ -1,159 +0,0 @@ -.. meta:: - :description: How to install deep learning frameworks for ROCm - :keywords: deep learning, frameworks, ROCm, install, PyTorch, TensorFlow, JAX, MAGMA, DeepSpeed, ML, AI - -********************************** -Deep learning frameworks for ROCm -********************************** - -Deep learning frameworks provide environments for machine learning, training, fine-tuning, inference, and performance optimization. - -ROCm offers a complete ecosystem for developing and running deep learning applications efficiently. It also provides ROCm-compatible versions of popular frameworks and libraries, such as PyTorch, TensorFlow, JAX, and others. - -The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs and accelerators. - -The table below summarizes information about ROCm-enabled deep learning frameworks. It includes details on ROCm compatibility and third-party tool support, installation steps and options, and links to GitHub resources. For a complete list of supported framework versions on ROCm, see the :doc:`Compatibility matrix <../compatibility/compatibility-matrix>` topic. - -.. list-table:: - :header-rows: 1 - :widths: 5 3 6 3 - - * - Framework - - Installation - - Installation options - - GitHub - - * - `PyTorch `__ - - .. raw:: html - - - - - - `Docker image `__ - - `Wheels package `__ - - `ROCm Base Docker image `__ - - `Upstream Docker file `__ - - .. raw:: html - - - - * - `TensorFlow `__ - - .. raw:: html - - - - - - `Docker image `__ - - `Wheels package `__ - - - .. raw:: html - - - - * - `JAX `__ - - .. raw:: html - - - - - - `Docker image `__ - - .. raw:: html - - - - * - `verl `__ - - .. raw:: html - - - - - - `Docker image `__ - - .. raw:: html - - - - * - `Stanford Megatron-LM `__ - - .. raw:: html - - - - - - `Docker image `__ - - .. raw:: html - - - - * - `DGL `__ - - .. raw:: html - - - - - - `Docker image `__ - - .. raw:: html - - - - * - `Megablocks `__ - - .. raw:: html - - - - - - `Docker image `__ - - .. raw:: html - - - - * - `Taichi `__ - - .. raw:: html - - - - - - `Docker image `__ - - `Wheels package `__ - - - .. raw:: html - - - - * - `Ray `__ - - .. raw:: html - - - - - - `Docker image `__ - - `Wheels package `__ - - `ROCm Base Docker image `__ - - .. raw:: html - - - - * - `llama.cpp `__ - - .. raw:: html - - - - - - `Docker image `__ - - `ROCm Base Docker image `__ - - .. raw:: html - - - - * - `FlashInfer `__ - - .. raw:: html - - - - - - `Docker image `__ - - `ROCm Base Docker image `__ - - .. raw:: html - - - -Learn how to use your ROCm deep learning environment for training, fine-tuning, inference, and performance optimization -through the following guides. - -* :doc:`rocm-for-ai/index` - -* :doc:`Use ROCm for training ` - -* :doc:`Use ROCm for fine-tuning LLMs ` - -* :doc:`Use ROCm for AI inference ` - -* :doc:`Use ROCm for AI inference optimization ` - diff --git a/docs/how-to/gpu-performance/mi300x.rst b/docs/how-to/gpu-performance/mi300x.rst deleted file mode 100644 index 5c13ea177..000000000 --- a/docs/how-to/gpu-performance/mi300x.rst +++ /dev/null @@ -1,27 +0,0 @@ -.. meta:: - :description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance. - :keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning - -************************************** -AMD Instinct MI300X performance guides -************************************** - -The following performance guides provide essential guidance on the necessary -steps to properly `configure your system for AMD Instinct™ MI300X accelerators -`_. -They include detailed instructions on system settings and application -:doc:`workload tuning ` to -help you leverage the maximum capabilities of these accelerators and achieve -superior performance. - -* `AMD Instinct MI300X system optimization `__ - covers essential system settings and system management practices to configure - your AMD Instinct MI300X system for performance. - -* :doc:`/how-to/rocm-for-ai/inference-optimization/workload` covers steps to - optimize the performance of AMD Instinct MI300X series accelerators for HPC - and deep learning operations. - -* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm` introduces a preconfigured - environment for LLM inference, designed to help you test performance with - popular models on AMD Instinct MI300X series accelerators. diff --git a/docs/how-to/programming_guide.rst b/docs/how-to/programming_guide.rst deleted file mode 100644 index 54ebc93e9..000000000 --- a/docs/how-to/programming_guide.rst +++ /dev/null @@ -1,39 +0,0 @@ -:orphan: - -.. meta:: - :description: Programming guide - :keywords: HIP, programming guide, heterogeneous programming, AMD GPU programming - -.. _hip-programming-guide: - -******************************************************************************** -Programming guide -******************************************************************************** - -ROCm provides a robust environment for heterogeneous programs running on CPUs -and AMD GPUs. ROCm supports various programming languages and frameworks to -help developers access the power of AMD GPUs. The natively supported programming -languages are HIP (Heterogeneous-Compute Interface for Portability) and -OpenCL, but HIP bindings are available for Python and Fortran. - -HIP is an API based on C++ that provides a runtime and kernel language for GPU -programming and is the essential ROCm programming language. HIP is also designed -to be a marshalling language, allowing code written for NVIDIA CUDA to be -easily ported to run on AMD GPUs. Developers can use HIP to write kernels that -execute on AMD GPUs while maintaining compatibility with CUDA-based systems. - -OpenCL (Open Computing Language) is an open standard for cross-platform, -parallel programming of diverse processors. ROCm supports OpenCL for developers -who want to use standard frameworks across different hardware platforms, -including CPUs, GPUs, and other accelerators. For more information, see -`OpenCL `_. - -Python bindings can be found at https://github.com/ROCm/hip-python. -Python is popular in AI and machine learning applications due to available -frameworks like TensorFlow and PyTorch. - -Fortran bindings can be found at https://github.com/ROCm/hipfort. -It enables scientific, academic, and legacy applications, particularly those in -high-performance computing, to run on AMD GPUs via HIP. - -For a complete description of the HIP programming language, see the :doc:`HIP programming guide`. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst deleted file mode 100644 index 3542706db..000000000 --- a/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. meta:: - :description: How to fine-tune models with ROCm - :keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial, deep learning, PyTorch, TensorFlow, JAX - -************************* -Fine-tuning and inference -************************* - -Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries ` and -:doc:`tools ` to optimize and train deep learning models. ROCm provides a comprehensive -ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and -ROCm-aware versions of :doc:`deep learning frameworks <../../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX. - -Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for -smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately -sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`. - -Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are -typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of -massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/index.rst b/docs/how-to/rocm-for-ai/fine-tuning/index.rst deleted file mode 100644 index fc73f9240..000000000 --- a/docs/how-to/rocm-for-ai/fine-tuning/index.rst +++ /dev/null @@ -1,26 +0,0 @@ -.. meta:: - :description: How to fine-tune LLMs with ROCm - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, GPUs, Llama, accelerators - -******************************************* -Use ROCm for fine-tuning LLMs -******************************************* - -Fine-tuning is an essential technique in machine learning, where a pre-trained model, typically trained on a large-scale dataset, is further refined to achieve better performance and adapt to a particular task or dataset of interest. - -With AMD GPUs, the fine-tuning process benefits from the parallel processing capabilities and efficient resource management, ultimately leading to improved performance and faster model adaptation to the target domain. - -The ROCm™ software platform helps you optimize this fine-tuning process by supporting various optimization techniques tailored for AMD GPUs. It empowers the fine-tuning of large language models, making them accessible and efficient for specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks, models, and tools. - -Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language -model ` like Llama 2. In the -sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning. - -The AI Developer Hub contains `AMD ROCm tutorials `_ for -training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs. - -- :doc:`Conceptual overview of fine-tuning LLMs ` - -- :doc:`Fine-tuning and inference ` using a - :doc:`single-accelerator ` or - :doc:`multi-accelerator ` system. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst deleted file mode 100644 index 0d6ba2b9c..000000000 --- a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst +++ /dev/null @@ -1,236 +0,0 @@ -.. meta:: - :description: Model fine-tuning and inference on a multi-GPU system - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune - -***************************************************** -Fine-tuning and inference using multiple accelerators -***************************************************** - -This section explains how to fine-tune a model on a multi-accelerator system. See -:doc:`Single-accelerator fine-tuning ` for a single accelerator or GPU setup. - -.. _fine-tuning-llms-multi-gpu-env: - -Environment setup -================= - -This section was tested using the following hardware and software environment. - -.. list-table:: - :stub-columns: 1 - - * - Hardware - - 4 AMD Instinct MI300X accelerators - - * - Software - - ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10 - - * - Libraries - - ``transformers`` ``datasets`` ``accelerate`` ``huggingface-hub`` ``peft`` ``trl`` ``scipy`` - - * - Base model - - ``meta-llama/Llama-2-7b-chat-hf`` - -.. _fine-tuning-llms-multi-gpu-env-setup: - -Setting up the base implementation environment ----------------------------------------------- - -#. Install PyTorch for ROCm. Refer to the - :doc:`PyTorch installation guide `. For consistent - installation, it’s recommended to use official ROCm prebuilt Docker images with the framework pre-installed. - -#. In the Docker container, check the availability of ROCM-capable accelerators using the following command. - - .. code-block:: shell - - rocm-smi --showproductname - -#. Check that your accelerators are available to PyTorch. - - .. code-block:: python - - import torch - print("Is a ROCm-GPU detected? ", torch.cuda.is_available()) - print("How many ROCm-GPUs are detected? ", torch.cuda.device_count()) - - If successful, your output should look like this: - - .. code-block:: shell - - >>> print("Is a ROCm-GPU detected? ", torch.cuda.is_available()) - Is a ROCm-GPU detected? True - >>> print("How many ROCm-GPUs are detected? ", torch.cuda.device_count()) - How many ROCm-GPUs are detected? 4 - -.. tip:: - - During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal. - This tool helps you see shows which accelerators or GPUs are involved. - - -.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate: - -Hugging Face Accelerate for fine-tuning and inference -=========================================================== - -`Hugging Face Accelerate `_ is a library that simplifies turning raw -PyTorch code for a single accelerator into code for multiple accelerators for LLM fine-tuning and inference. It is -integrated with `Transformers `_ allowing you to scale your PyTorch -code while maintaining performance and flexibility. - -As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama -2 7B model. - -Here, let's reuse the code in :ref:`Single-accelerator fine-tuning ` -to load the base model and tokenizer. - -Now, it's important to adjust how you load the model. Add the ``device_map`` parameter to your base model configuration. - -.. code-block:: python - - ... - base_model_name = "meta-llama/Llama-2-7b-chat-hf" - - # Load base model to GPU memory - base_model = AutoModelForCausalLM.from_pretrained( - base_model_name, - device_map = "auto", - trust_remote_code = True) - ... - # Run training - sft_trainer.train() - -.. note:: - - You can let Accelerate handle the device map computation by setting ``device_map`` to one of the supported options - (``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``). - - It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and - efficiently allocate the model given the available resources (4 accelerators in this case). - - When you have more GPU memory available than the model size, here is the difference between each ``device_map`` - option: - - * ``"auto"`` and ``"balanced"`` evenly split the model on all available GPUs, making it possible for you to use a - batch size greater than 1. - - * ``"balanced_low_0"`` evenly splits the model on all GPUs except the first - one, and only puts on GPU 0 what does not fit on the others. This - option is great when you need to use GPU 0 for some processing of the - outputs, like when using the generate function for Transformers - models. - - * ``"sequential"`` will fit what it can on GPU 0, then move on GPU 1 and so forth. Not all GPUs might be used. - -After loading the model in this way, the model is fully ready to use the resources available to it. - -.. _fine-tuning-llms-multi-gpu-torchtune: - -torchtune for fine-tuning and inference -============================================= - -`torchtune `_ is a PyTorch-native library for easy single and multi-accelerator or -GPU model fine-tuning and inference with LLMs. - -#. Install torchtune using pip. - - .. code-block:: shell - - # Install torchtune with PyTorch release 2.2.2+ - pip install torchtune - - # To confirm that the package is installed correctly - tune --help - - The output should look like this: - - .. code-block:: shell - - usage: tune [-h] {download,ls,cp,run,validate} ... - - Welcome to the TorchTune CLI! - - options: - -h, --help show this help message and exit - - subcommands: - {download,ls,cp,run,validate} - -#. torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction - getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes. - - .. code-block:: shell - - RECIPE CONFIG - full_finetune_single_device llama2/7B_full_low_memory - llama3/8B_full_single_device - mistral/7B_full_low_memory - full_finetune_distributed llama2/7B_full - llama2/13B_full - llama3/8B_full - mistral/7B_full - gemma/2B_full - lora_finetune_single_device llama2/7B_lora_single_device - llama2/7B_qlora_single_device - llama3/8B_lora_single_device - llama3/8B_qlora_single_device - llama2/13B_qlora_single_device - mistral/7B_lora_single_device - - The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning - techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training, - evaluation, quantization, or inference recipes. - - The snippet shows the architecture of a model's YAML configuration file: - - .. code-block:: yaml - - # Model arguments - model: - _component_: torchtune.models.llama2.lora_llama2_7b - lora_attn_modules: ['q_proj', 'v_proj'] - apply_lora_to_mlp: False - apply_lora_to_output: False - lora_rank: 8 - lora_alpha: 16 - - tokenizer: - _component_: torchtune.models.llama2.llama2_tokenizer - path: /tmp/Llama-2-7b-hf/tokenizer.model - - # Dataset and sampler - dataset: - _component_: torchtune.datasets.alpaca_cleaned_dataset - train_on_input: True - -#. This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler, - and training data type. To download the base model for fine-tuning, run the following command: - - .. code-block:: shell - - tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token - - The output directory argument for ``--output-dir`` should map the model path specified in YAML config file. - -#. To launch ``lora_finetune_distributed`` on four devices, run the following - command: - - .. code-block:: shell - - tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora - - If successful, you should something like the following output: - - .. code-block:: shell - - INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ... - INFO:torchtune.utils.logging:Model instantiation took 7.32 secs - INFO:torchtune.utils.logging:Memory Stats after model init: - {'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448} - INFO:torchtune.utils.logging:Optimizer and loss are initialized. - INFO:torchtune.utils.logging:Dataset and Sampler are initialized. - INFO:torchtune.utils.logging:Learning rate scheduler is initialized. - 1|111|Loss: 1.5790324211120605: 7%|█ | 114/1618 - -Read more about inference frameworks in :doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/overview.rst b/docs/how-to/rocm-for-ai/fine-tuning/overview.rst deleted file mode 100644 index f5dea82a4..000000000 --- a/docs/how-to/rocm-for-ai/fine-tuning/overview.rst +++ /dev/null @@ -1,104 +0,0 @@ -.. meta:: - :description: Conceptual overview of fine-tuning LLMs - :keywords: ROCm, LLM, Llama, fine-tuning, usage, tutorial, optimzation, LoRA, walkthrough, PEFT, Reinforcement - -*************************************** -Conceptual overview of fine-tuning LLMs -*************************************** - -Large language models (LLMs) are trained on massive amounts of text data to generate coherent and fluent text. The -underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers -enable LLMs to understand and generate text by capturing contextual relationships and long-range dependencies. To better -understand the philosophy of the transformer architecture, review the foundational -`Attention is all you need `_ paper. - -By further training pre-trained LLMs, the fine-tuned model can gain knowledge related to specific fields or tasks, -thereby significantly improving its performance in that field or task. The core idea of fine-tuning is to use the -parameters of the pre-trained model as the starting point for new tasks and shape it through a small amount of -specific domain or task data, expanding the original model's capability to new tasks or datasets. - -Fine-tuning can effectively improve the performance of existing pre-trained models in specific application scenarios. -Continuous training and adjustment of the parameters of the base model in the target domain or task can better capture -the semantic characteristics and patterns in specific scenarios, thereby significantly improving the key indicators of -the model in that domain or task. For example, by fine-tuning the Llama 2 model, its performance in certain applications -can be improve over the base model. - -.. _fine-tuning-llms-concept-challenge: - -The challenge of fine-tuning models -=================================== - -However, the computational cost of fine-tuning is still high, especially for complex models and large datasets, which -poses distinct challenges related to substantial computational and memory requirements. This might be a barrier for -accelerators or GPUs with low computing power or limited device memory resources. - -For example, suppose we have a language model with 7 billion (7B) parameters, represented by a weight matrix :math:`W`. -During backpropagation, the model needs to learn a :math:`ΔW` matrix, which updates the original weights to minimize the -value of the loss function. - -The weight update is as follows: :math:`W_{updated} = W + ΔW`. - -If the weight matrix :math:`W` contains 7B parameters, then the weight update matrix :math:`ΔW` should also -contain 7B parameters. Therefore, the :math:`ΔW` calculation is computationally and memory intensive. - -.. figure:: ../../../data/how-to/llm-fine-tuning-optimization/weight-update.png - :alt: Weight update diagram - - (a) Weight update in regular fine-tuning. (b) Weight update in LoRA where the product of matrix A (:math:`M\times K`) - and matrix B (:math:`K\times N`) is :math:`ΔW(M\times N)`; dimension K is a hyperparameter. By representing - :math:`ΔW` as the product of two smaller matrices (A and B) with a lower rank K, the number of trainable parameters - is significantly reduced. - -.. _fine-tuning-llms-concept-optimizations: - -Optimizations for model fine-tuning -=================================== - -Low-Rank Adaptation (LoRA) is a technique allowing fast and cost-effective fine-tuning of state-of-the-art LLMs that can -overcome this issue of high memory consumption. - -LoRA accelerates the adjustment process and reduces related memory costs. To be precise, LoRA decomposes the portion of -weight changes :math:`ΔW` into high-precision low-rank representations, which do not require the calculations of all -:math:`ΔW`. It learns the decomposition representation of :math:`ΔW` during training, as shown in -the :ref:`weight update diagram `. This is how LoRA saves on -computing resources. - -LoRA is integrated into the `Hugging Face Parameter-Efficient Fine-Tuning (PEFT) -`_ library, as well as other computation and memory efficiency optimization -variants for model fine-tuning such as `AdaLoRA `_. This -library efficiently adapts large pre-trained models to various downstream applications without fine-tuning all model -parameters. PEFT methods only fine-tune a few model parameters, significantly decreasing computational and storage -costs while yielding performance comparable to a fully fine-tuned model. PEFT is integrated with the `Hugging Face -Transformers `_ library, providing a faster and easier way to load, -train, and use large models for inference. - -To simplify running a fine-tuning implementation, the `Transformer Reinforcement Learning (TRL) -`_ library provides a set of tools to train transformer language models with -reinforcement learning, from the Supervised Fine-Tuning step (SFT), Reward Modeling step (RM), to the Proximal Policy -Optimization (PPO) step. The ``SFTTrainer`` API in TRL encapsulates these PEFT optimizations so you can easily import -their custom training configuration and run the training process. - -.. _fine-tuning-llms-walkthrough-desc: - -Walkthrough -=========== - -To demonstrate the benefits of LoRA and the ideal compute compatibility of using PEFT and TRL libraries on AMD -ROCm-compatible accelerators and GPUs, let's step through a comprehensive implementation of the fine-tuning process -using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X accelerators. - -Before starting, review and understand the key components of this walkthrough: - -- `Llama 2 `_: a family of large language models developed and publicly released by - Meta. Its variants range in scale from 7 billion to 70 billion parameters. - -- Fine-tuning: a critical process that refines LLMs for specialized tasks and optimizes performance. - -- LoRA: a memory-efficient implementation of LLM fine-tuning that significantly reduces the number of trainable - parameters. - -- `SFTTrainer `_: an optimized - trainer with a simple interface to easily fine-tune pre-trained models with PEFT adapters, for example, LoRA, for - memory efficiency purposes on a custom dataset. - -Continue the walkthrough in :doc:`Fine-tuning and inference `. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst deleted file mode 100644 index dc66d9a75..000000000 --- a/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst +++ /dev/null @@ -1,510 +0,0 @@ -.. meta:: - :description: Model fine-tuning and inference on a single-GPU system - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference, SFTTrainer - -**************************************************** -Fine-tuning and inference using a single accelerator -**************************************************** - -This section explains model fine-tuning and inference techniques on a single-accelerator system. See -:doc:`Multi-accelerator fine-tuning ` for a setup with multiple accelerators or -GPUs. - -.. _fine-tuning-llms-single-gpu-env: - -Environment setup -================= - -This section was tested using the following hardware and software environment. - -.. list-table:: - :stub-columns: 1 - - * - Hardware - - AMD Instinct MI300X accelerator - - * - Software - - ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10 - - * - Libraries - - ``transformers`` ``datasets`` ``huggingface-hub`` ``peft`` ``trl`` ``scipy`` - - * - Base model - - ``meta-llama/Llama-2-7b-chat-hf`` - -.. _fine-tuning-llms-single-gpu-env-setup: - -Setting up the base implementation environment ----------------------------------------------- - -#. Install PyTorch for ROCm. Refer to the - :doc:`PyTorch installation guide `. For a consistent - installation, it’s recommended to use official ROCm prebuilt Docker images with the framework pre-installed. - -#. In the Docker container, check the availability of ROCm-capable accelerators using the following command. - - .. code-block:: shell - - rocm-smi --showproductname - - Your output should look like this: - - .. code-block:: shell - - ============================ ROCm System Management Interface ============================ - ====================================== Product Info ====================================== - GPU[0] : Card series: AMD Instinct MI300X OAM - GPU[0] : Card model: 0x74a1 - GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI] - GPU[0] : Card SKU: MI3SRIOV - ========================================================================================== - ================================== End of ROCm SMI Log =================================== - -#. Check that your accelerators are available to PyTorch. - - .. code-block:: python - - import torch - print("Is a ROCm-GPU detected? ", torch.cuda.is_available()) - print("How many ROCm-GPUs are detected? ", torch.cuda.device_count()) - - If successful, your output should look like this: - - .. code-block:: shell - - >>> print("Is a ROCm-GPU detected? ", torch.cuda.is_available()) - Is a ROCm-GPU detected? True - >>> print("How many ROCm-GPUs are detected? ", torch.cuda.device_count()) - How many ROCm-GPUs are detected? 4 - -#. Install the required dependencies. - - bitsandbytes is a library that facilitates quantization to improve the efficiency of deep learning models. Learn more - about its use in :doc:`../inference-optimization/model-quantization`. - - See the :ref:`Optimizations for model fine-tuning ` for a brief discussion on - PEFT and TRL. - - .. code-block:: shell - - # Install `bitsandbytes` for ROCm 6.0+. - # Use -DBNB_ROCM_ARCH to target a specific GPU architecture. - git clone --recurse https://github.com/ROCm/bitsandbytes.git - cd bitsandbytes - git checkout rocm_enabled_multi_backend - pip install -r requirements-dev.txt - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S . - python setup.py install - - # To leverage the SFTTrainer in TRL for model fine-tuning. - pip install trl - - # To leverage PEFT for efficiently adapting pre-trained language models . - pip install peft - - # Install the other dependencies. - pip install transformers datasets huggingface-hub scipy - -#. Check that the required packages can be imported. - - .. code-block:: python - - import torch - from datasets import load_dataset - from transformers import ( - AutoModelForCausalLM, - AutoTokenizer, - TrainingArguments - ) - from peft import LoraConfig - from trl import SFTTrainer - -.. _fine-tuning-llms-single-gpu-download-model-dataset: - -Download the base model and fine-tuning dataset ------------------------------------------------ - -#. Request to access to download the `Meta's official Llama model `_ from Hugging - Face. After permission is granted, log in with the following command using your personal access tokens: - - .. code-block:: shell - - huggingface-cli login - - .. note:: - - You can also use the `NousResearch Llama-2-7b-chat-hf `_ - as a substitute. It has the same model weights as the original. - -#. Run the following code to load the base model and tokenizer. - - .. code-block:: python - - # Base model and tokenizer names. - base_model_name = "meta-llama/Llama-2-7b-chat-hf" - - # Load base model to GPU memory. - device = "cuda:0" - base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True).to(device) - - # Load tokenizer. - tokenizer = AutoTokenizer.from_pretrained( - base_model_name, - trust_remote_code = True) - tokenizer.pad_token = tokenizer.eos_token - tokenizer.padding_side = "right" - -#. Now, let's fine-tune the base model for a question-and-answer task using a small dataset called - `mlabonne/guanaco-llama2-1k `_, which is a 1000 sample - subset of the `timdettmers/openassistant-guanaco `_ dataset. - - .. code-block:: - - # Dataset for fine-tuning. - training_dataset_name = "mlabonne/guanaco-llama2-1k" - training_dataset = load_dataset(training_dataset_name, split = "train") - - # Check the data. - print(training_dataset) - - # Dataset 11 is a QA sample in English. - print(training_dataset[11]) - -#. With the base model and the dataset, let's start fine-tuning! - -.. _fine-tuning-llms-single-gpu-configure-params: - -Configure fine-tuning parameters --------------------------------- - -To set up ``SFTTrainer`` parameters, you can use the following code as reference. - -.. code-block:: python - - # Training parameters for SFTTrainer. - training_arguments = TrainingArguments( - output_dir = "./results", - num_train_epochs = 1, - per_device_train_batch_size = 4, - gradient_accumulation_steps = 1, - optim = "paged_adamw_32bit", - save_steps = 50, - logging_steps = 50, - learning_rate = 4e-5, - weight_decay = 0.001, - fp16=False, - bf16=False, - max_grad_norm = 0.3, - max_steps = -1, - warmup_ratio = 0.03, - group_by_length = True, - lr_scheduler_type = "constant", - report_to = "tensorboard" - ) - -.. _fine-tuning-llms-single-gpu-start: - -Fine-tuning -=========== - -In this section, you'll see two ways of training: with the LoRA technique and without. See :ref:`Optimizations for model -fine-tuning ` for an introduction to LoRA. Training with LoRA uses the -``SFTTrainer`` API with its PEFT integration. Training without LoRA forgoes these benefits. - -Compare the number of trainable parameters and training time under the two different methodologies. - -.. tab-set:: - - .. tab-item:: Fine-tuning with LoRA and PEFT - :sync: with - - 1. Configure LoRA using the following code snippet. - - .. code-block:: python - - peft_config = LoraConfig( - lora_alpha = 16, - lora_dropout = 0.1, - r = 64, - bias = "none", - task_type = "CAUSAL_LM" - ) - # View the number of trainable parameters. - from peft import get_peft_model - peft_model = get_peft_model(base_model, peft_config) - peft_model.print_trainable_parameters() - - The output should look like this. Compare the number of trainable parameters to that when fine-tuning without - LoRA and PEFT. - - .. code-block:: shell - - trainable params: 33,554,432 || all params: 6,771,970,048 || trainable%: 0.49548996469513035 - - 2. Initialize ``SFTTrainer`` with a PEFT LoRA configuration and run the trainer. - - .. code-block:: python - - # Initialize an SFT trainer. - sft_trainer = SFTTrainer( - model = base_model, - train_dataset = training_dataset, - peft_config = peft_config, - dataset_text_field = "text", - tokenizer = tokenizer, - args = training_arguments - ) - - # Run the trainer. - sft_trainer.train() - - The output should look like this: - - .. code-block:: shell - - {'loss': 1.5973, 'grad_norm': 0.25271978974342346, 'learning_rate': 4e-05, 'epoch': 0.16} - {'loss': 2.0519, 'grad_norm': 0.21817368268966675, 'learning_rate': 4e-05, 'epoch': 0.32} - {'loss': 1.6147, 'grad_norm': 0.3046981394290924, 'learning_rate': 4e-05, 'epoch': 0.48} - {'loss': 1.4124, 'grad_norm': 0.11534837633371353, 'learning_rate': 4e-05, 'epoch': 0.64} - {'loss': 1.5627, 'grad_norm': 0.09108350425958633, 'learning_rate': 4e-05, 'epoch': 0.8} - {'loss': 1.417, 'grad_norm': 0.2536439299583435, 'learning_rate': 4e-05, 'epoch': 0.96} - {'train_runtime': 197.4947, 'train_samples_per_second': 5.063, 'train_steps_per_second': 0.633, 'train_loss': 1.6194254455566406, 'epoch': 1.0} - 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [03:17<00:00, 1.58s/it] - - .. tab-item:: Fine-tuning without LoRA and PEFT - :sync: without - - 1. Use the following code to get started. - - .. code-block:: python - - def print_trainable_parameters(model): - # Prints the number of trainable parameters in the model. - trainable_params = 0 - all_param = 0 - for _, param in model.named_parameters(): - all_param += param.numel() - if param.requires_grad: - trainable_params += param.numel() - print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}") - - sft_trainer.peft_config = None - print_trainable_parameters(sft_trainer.model) - - The output should look like this. Compare the number of trainable parameters to that when fine-tuning with LoRA - and PEFT. - - .. code-block:: shell - - trainable params: 6,738,415,616 || all params: 6,738,415,616 || trainable%: 100.00 - - - 2. Run the trainer. - - .. code-block:: python - - # Trainer without LoRA config. - trainer_full = SFTTrainer( - model = base_model, - train_dataset = training_dataset, - dataset_text_field = "text", - tokenizer = tokenizer, - args = training_arguments - ) - - # Training. - trainer_full.train() - - The output should look like this: - - .. code-block:: shell - - {'loss': 1.5975, 'grad_norm': 0.25113457441329956, 'learning_rate': 4e-05, 'epoch': 0.16} - {'loss': 2.0524, 'grad_norm': 0.2180655151605606, 'learning_rate': 4e-05, 'epoch': 0.32} - {'loss': 1.6145, 'grad_norm': 0.2949850261211395, 'learning_rate': 4e-05, 'epoch': 0.48} - {'loss': 1.4118, 'grad_norm': 0.11036080121994019, 'learning_rate': 4e-05, 'epoch': 0.64} - {'loss': 1.5595, 'grad_norm': 0.08962831646203995, 'learning_rate': 4e-05, 'epoch': 0.8} - {'loss': 1.4119, 'grad_norm': 0.25422757863998413, 'learning_rate': 4e-05, 'epoch': 0.96} - {'train_runtime': 419.5154, 'train_samples_per_second': 2.384, 'train_steps_per_second': 0.298, 'train_loss': 1.6171623611450194, 'epoch': 1.0} - 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [06:59<00:00, 3.36s/it] - -.. _fine-tuning-llms-single-gpu-saving: - -Saving adapters or fully fine-tuned models ------------------------------------------- - -PEFT methods freeze the pre-trained model parameters during fine-tuning and add a smaller number of trainable -parameters, namely the adapters, on top of it. The adapters are trained to learn specific task information. The adapters -trained with PEFT are usually an order of magnitude smaller than the full base model, making them convenient to share, -store, and load. - -.. tab-set:: - - .. tab-item:: Saving a PEFT adapter - :sync: with - - If you're using LoRA and PEFT, use the following code to save a PEFT adapter to your system once the fine-tuning - is completed. - - .. code-block:: python - - # PEFT adapter name. - adapter_name = "llama-2-7b-enhanced-adapter" - - # Save PEFT adapter. - sft_trainer.model.save_pretrained(adapter_name) - - The saved PEFT adapter should look like this on your system: - - .. code-block:: shell - - # Access adapter directory. - cd llama-2-7b-enhanced-adapter - - # List all adapter files. - README.md adapter_config.json adapter_model.safetensors - - .. tab-item:: Saving a fully fine-tuned model - :sync: without - - If you're not using LoRA and PEFT so there is no PEFT LoRA configuration used for training, use the following code - to save your fine-tuned model to your system. - - .. code-block:: python - - # Fully fine-tuned model name. - new_model_name = "llama-2-7b-enhanced" - - # Save the fully fine-tuned model. - full_trainer.model.save_pretrained(new_model_name) - - The saved new full model should look like this on your system: - - .. code-block:: shell - - # Access new model directory. - cd llama-2-7b-enhanced - - # List all model files. - config.json model-00002-of-00006.safetensors model-00005-of-00006.safetensors - generation_config.json model-00003-of-00006.safetensors model-00006-of-00006.safetensors - model-00001-of-00006.safetensors model-00004-of-00006.safetensors model.safetensors.index.json - -.. note:: - - PEFT adapters can’t be loaded by ``AutoModelForCausalLM`` from the Transformers library as they do not contain - full model parameters and model configurations, for example, ``config.json``. To use it as a normal transformer - model, you need to merge them into the base model. - -Basic model inference -===================== - -A trained model can be classified into one of three types: - -* A PEFT adapter - -* A pre-trained language model in Hugging Face - -* A fully fine-tuned model not using PEFT - -Let's look at achieving model inference using these types of models. - -.. tab-set:: - - .. tab-item:: Inference using PEFT adapters - - To use PEFT adapters like a normal transformer model, you can run the generation by loading a base model along with PEFT - adapters as follows. - - .. code-block:: python - - from peft import PeftModel - from transformers import AutoModelForCausalLM - - # Set the path of the model or the name on Hugging face hub - base_model_name = "meta-llama/Llama-2-7b-chat-hf" - - # Set the path of the adapter - adapter_name = "Llama-2-7b-enhanced-adpater" - - # Load base model - base_model = AutoModelForCausalLM.from_pretrained(base_model_name) - - # Adapt the base model with the adapter - new_model = PeftModel.from_pretrained(base_model, adapter_name) - - # Then, run generation as the same with a normal model outlined in 2.1 - - The PEFT library provides a ``merge_and_unload`` method, which merges the adapter layers into the base model. This is - needed if someone wants to save the adapted model into local storage and use it as a normal standalone model. - - .. code-block:: python - - # Load base model - base_model = AutoModelForCausalLM.from_pretrained(base_model_name) - - # Adapt the base model with the adapter - new_model = PeftModel.from_pretrained(base_model, adapter_name) - - # Merge adapter - model = model.merge_and_unload() - - # Save the merged model into local - model.save_pretrained("merged_adpaters") - - .. tab-item:: Inference using pre-trained or fully fine-tuned models - - If you have a fully fine-tuned model not using PEFT, you can load it like any other pre-trained language model in - `Hugging Face Hub `_ using the `Transformers - `_ library. - - .. code-block:: python - - # Import relevant class for loading model and tokenizer - from transformers import AutoTokenizer, AutoModelForCausalLM - - # Set the pre-trained model name on Hugging face hub - model_name = "meta-llama/Llama-2-7b-chat-hf" - - # Set device type - device = "cuda:0" - - # Load model and tokenizer - model = AutoModelForCausalLM.from_pretrained(model_name).to(device) - tokenizer = AutoTokenizer.from_pretrained(model_name) - - # Input prompt encoding - query = "What is a large language model?" - inputs = tokenizer.encode(query, return_tensors="pt").to(device) - - # Token generation - outputs = model.generate(inputs) - - # Outputs decoding - print(tokenizer.decode(outputs[0])) - - In addition, pipelines from Transformers offer simple APIs to use pre-trained models for different tasks, including - sentiment analysis, feature extraction, question answering and so on. You can use the pipeline abstraction to achieve - model inference easily. - - .. code-block:: python - - # Import relevant class for loading model and tokenizer - from transformers import pipeline - - # Set the path of your model or the name on Hugging face hub - model_name_or_path = "meta-llama/Llama-2-7b-chat-hf" - - # Set pipeline - # A positive device value will run the model on associated CUDA device id - pipe = pipeline("text-generation", model=model_name_or_path, device=0) - - # Token generation - print(pipe("What is a large language model?")[0]["generated_text"]) - -If using multiple accelerators, see -:ref:`Multi-accelerator fine-tuning and inference ` to explore -popular libraries that simplify fine-tuning and inference in a multi-accelerator system. - -Read more about inference frameworks like vLLM and Hugging Face TGI in -:doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`. diff --git a/docs/how-to/rocm-for-ai/index.rst b/docs/how-to/rocm-for-ai/index.rst deleted file mode 100644 index 3a97d75d6..000000000 --- a/docs/how-to/rocm-for-ai/index.rst +++ /dev/null @@ -1,30 +0,0 @@ -.. meta:: - :description: Learn how to use ROCm for AI. - :keywords: ROCm, AI, machine learning, LLM, usage, tutorial - -************************** -Use ROCm for AI -************************** - -ROCm is an open-source software platform that enables high-performance computing and machine learning applications. It features the ability to accelerate training, fine-tuning, and inference for AI application development. With ROCm, you can access the full power of AMD GPUs, which can significantly improve the performance and efficiency of AI workloads. - -You can use ROCm to perform distributed training, which enables you to train models across multiple GPUs or nodes simultaneously. Additionally, ROCm supports mixed-precision training, which can help reduce the memory and compute requirements of training workloads. For fine-tuning, ROCm provides access to various algorithms and optimization techniques. In terms of inference, ROCm provides several techniques that can help you optimize your models for deployment, such as quantization, GEMM tuning, and optimization with composable kernel. - -Overall, ROCm can be used to improve the performance and efficiency of your AI applications. With its training, fine-tuning, and inference support, ROCm provides a complete solution for optimizing AI workflows and achieving the optimum results possible on AMD GPUs. - -The AI Developer Hub contains `AMD ROCm tutorials `_ for -training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs. - -In this guide, you'll learn how to use ROCm for AI: - -- :doc:`Training ` - -- :doc:`Fine-tuning LLMs ` - -- :doc:`Inference ` - -- :doc:`Inference optimization ` - - -To learn about ROCm for HPC applications and scientific computing, see -:doc:`../rocm-for-hpc/index`. diff --git a/docs/how-to/rocm-for-ai/inference-optimization/index.rst b/docs/how-to/rocm-for-ai/inference-optimization/index.rst deleted file mode 100644 index 80f932a6c..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/index.rst +++ /dev/null @@ -1,36 +0,0 @@ -.. meta:: - :description: How to Use ROCm for AI inference optimization - :keywords: ROCm, LLM, AI inference, Optimization, GPUs, usage, tutorial - -******************************************* -Use ROCm for AI inference optimization -******************************************* - -AI inference optimization is the process of improving the performance of machine learning models and speeding up the inference process. It includes: - -- **Quantization**: This involves reducing the precision of model weights and activations while maintaining acceptable accuracy levels. Reduced precision improves inference efficiency because lower precision data requires less storage and better utilizes the hardware's computation power. - -- **Kernel optimization**: This technique involves optimizing computation kernels to exploit the underlying hardware capabilities. For example, the kernels can be optimized to use multiple GPU cores or utilize specialized hardware like tensor cores to accelerate the computations. - -- **Libraries**: Libraries such as Flash Attention, xFormers, and PyTorch TunableOp are used to accelerate deep learning models and improve the performance of inference workloads. - -- **Hardware acceleration**: Hardware acceleration techniques, like GPUs for AI inference, can significantly improve performance due to their parallel processing capabilities. - -- **Pruning**: This involves removing unnecessary connections, layers, or weights from a pre-trained model while maintaining acceptable accuracy levels, resulting in a smaller model that requires fewer computational resources to run inference. - -Utilizing these optimization techniques with the ROCm™ software platform can significantly reduce inference time, improve performance, and reduce the cost of your AI applications. - -Throughout the following topics, this guide discusses optimization techniques for inference workloads. - -- :doc:`Model quantization ` - -- :doc:`Model acceleration libraries ` - -- :doc:`Optimizing with Composable Kernel ` - -- :doc:`Optimizing Triton kernels ` - -- :doc:`Profiling and debugging ` - -- :doc:`Workload tuning ` - diff --git a/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst b/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst deleted file mode 100644 index 2136efa84..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst +++ /dev/null @@ -1,537 +0,0 @@ -.. meta:: - :description: How to use model acceleration techniques and libraries to improve memory efficiency and performance. - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, Flash Attention, Hugging Face, xFormers, vLLM, PyTorch - -**************************** -Model acceleration libraries -**************************** - -This section discusses model acceleration techniques and libraries to improve memory efficiency and performance. - -.. _acceleration-flash-attention: - -Flash Attention 2 -================= - -Flash Attention is a technique designed to reduce memory movements between GPU SRAM and high-bandwidth memory (HBM). By -using a tiling approach, Flash Attention 2 improves memory locality in the nested loops of query, key, and value -computations within the Attention modules of LLMs. These modules include Multi-Head Attention (MHA), Group-Query -Attention (GQA), and Multi-Query Attention (MQA). This reduction in memory movements significantly decreases the -time-to-first-token (TTFT) latency for large batch sizes and long prompt sequences, thereby enhancing overall -performance. - -.. image:: ../../../data/how-to/llm-fine-tuning-optimization/attention-module.png - :alt: Attention module of a large language module utilizing tiling - :align: center - -Installing Flash Attention 2 ----------------------------- - -ROCm provides two different implementations of Flash Attention 2 modules. They can be deployed interchangeably: - -* ROCm `Composable Kernel `_ - (CK) Flash Attention 2 - -* `OpenAI Triton `_ Flash Attention 2 - -.. tab-set:: - - .. tab-item:: CK Flash Attention 2 - - To install CK Flash Attention 2, use the following commands. - - .. code-block:: shell - - # Install from source - git clone https://github.com/ROCm/flash-attention.git - cd flash-attention/ - GPU_ARCHS=gfx942 python setup.py install #MI300 series - - Hugging Face Transformers can easily deploy the CK Flash Attention 2 module by passing an argument - ``attn_implementation="flash_attention_2"`` in the ``from_pretrained`` class. - - .. code-block:: python - - import torch - from transformers import AutoModelForCausalLM, AutoTokenizer - device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") - model_name = "NousResearch/Meta-Llama-3-8B" - - tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch.float16, use_fast=False) - inputs = tokenizer('Today is', return_tensors='pt').to(device) - - model_eager = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, attn_implementation="eager").cuda(device) - model_ckFAv2 = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda(device) - - print("eager GQA: ", tokenizer.decode(model_eager.generate(**inputs, max_new_tokens=10)[0], skip_special_tokens=True)) - print("ckFAv2 GQA: ", tokenizer.decode(model_ckFAv2.generate(**inputs, max_new_tokens=10)[0], skip_special_tokens=True)) - - # eager GQA: Today is the day of the Lord, and we are the - # ckFAv2 GQA: Today is the day of the Lord, and we are the - - .. tab-item:: Triton Flash Attention 2 - - The Triton Flash Attention 2 module is implemented in Python and uses OpenAI’s JIT compiler. This module has been - upstreamed into the vLLM serving toolkit, discussed in :doc:'llm-inference-frameworks'. - - 1. To install Triton Flash Attention 2 and run the benchmark, use the following commands. - - .. code-block:: shell - - # Install from the source - pip uninstall pytorch-triton-rocm triton -y - git clone https://github.com/ROCm/triton.git - cd triton/python - GPU_ARCHS=gfx942 python setup.py install #MI300 series - pip install matplotlib pandas - - 2. To test, run the Triton Flash Attention 2 performance benchmark. - - .. code-block:: shell - - # Test the triton FA v2 kernel - python https://github.com/ROCm/triton/blob/triton-mlir/python/perf-kernels/flash-attention.py - # Results (Okay to release TFLOPS number ???) - fused-attention-fwd-d128: - BATCH HQ HK N_CTX_Q N_CTX_K TFLOPS - 0 16.0 16.0 16.0 1024.0 1024.0 287.528411 - 1 8.0 16.0 16.0 2048.0 2048.0 287.490806 - 2 4.0 16.0 16.0 4096.0 4096.0 345.966031 - 3 2.0 16.0 16.0 8192.0 8192.0 361.369510 - 4 1.0 16.0 16.0 16384.0 16384.0 356.873720 - 5 2.0 48.0 48.0 1024.0 1024.0 216.916235 - 6 2.0 48.0 48.0 2048.0 1024.0 271.027578 - 7 2.0 48.0 48.0 4096.0 8192.0 337.367372 - 8 2.0 48.0 48.0 8192.0 4096.0 363.481649 - 9 2.0 48.0 48.0 16384.0 8192.0 375.013622 - 10 8.0 16.0 16.0 1989.0 15344.0 321.791333 - 11 4.0 16.0 16.0 4097.0 163.0 122.104888 - 12 2.0 16.0 16.0 8122.0 2159.0 337.060283 - 13 1.0 16.0 16.0 16281.0 7.0 5.234012 - 14 2.0 48.0 48.0 1021.0 1020.0 214.657425 - 15 2.0 48.0 48.0 2001.0 2048.0 314.429118 - 16 2.0 48.0 48.0 3996.0 9639.0 330.411368 - 17 2.0 48.0 48.0 8181.0 1021.0 324.614980 - -xFormers -======== - -xFormers also improves the performance of attention modules. Although xFormers attention performs very -similarly to Flash Attention 2 due to its tiling behavior of query, key, and value, it’s widely used for LLMs and -Stable Diffusion models with the Hugging Face Diffusers library. - -Installing CK xFormers ----------------------- - -Use the following commands to install CK xFormers. - -.. code-block:: shell - - # Install from source - git clone https://github.com/ROCm/xformers.git - cd xformers/ - git submodule update --init --recursive - PYTORCH_ROCM_ARCH=gfx942 python setup.py install #Instinct MI300-series - -PyTorch built-in acceleration -============================= - -`PyTorch compilation -mode `__ -synthesizes the model into a graph and then lowers it to prime -operators. These operators are compiled using TorchInductor, which uses -OpenAI Triton as a building block for GPU acceleration. One advantage of -PyTorch compilation mode is that its GPU kernels are written in Python, -making modifying and extending them easier. PyTorch compilation mode -often delivers higher performance, as model operations are fused before -runtime, which allows for easy deployment of high-performance kernels. - -PyTorch compilation -------------------- - -To utilize the PyTorch compilation mode, specific layers of the model -must be explicitly assigned as compilation targets. In the case of LLM, -where autoregressive token decoding generates dynamically changing -key/value sizes, limiting the key/value size to a static dimension, -``max_cache_length``, is necessary to utilize the performance benefits -of the PyTorch compilation. - -.. code-block:: python - - # Sample script to run LLM with the static key-value cache and PyTorch compilation - from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache - import torch - from typing import Optional - import os - device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") - os.environ["TOKENIZERS_PARALLELISM"] = "false" - model_name = "NousResearch/Meta-Llama-3-8B" - prompts = [] - - for b in range(1): - prompts.append("New york city is where " - ) - - tokenizer = AutoTokenizer.from_pretrained(model_name) - model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device).eval() - inputs = tokenizer(prompts, return_tensors="pt").to(model.device) - - def decode_one_tokens(model, cur_token, input_pos, cache_position): - logits = model(cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True)[0] - new_token = torch.argmax(logits[:, -1], dim=-1)[:, None] - return new_token - - batch_size, seq_length = inputs["input_ids"].shape - - # Static key-value cache - max_cache_length = 1024 - max_new_tokens = 10 - model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length) - cache_position = torch.arange(seq_length, device=device) - generated_ids = torch.zeros(batch_size, seq_length + max_new_tokens + 1, dtype=torch.int, device=device) - generated_ids[:, cache_position] = inputs["input_ids"].to(device).to(torch.int) - - logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0] - next_token = torch.argmax(logits[:, -1], dim=-1)[:, None] - - # torch compilation - decode_one_tokens = torch.compile(decode_one_tokens, mode="max-autotune-no-cudagraphs",fullgraph=True) - - generated_ids[:, seq_length] = next_token[:, 0] - cache_position = torch.tensor([seq_length + 1], device=device) - - with torch.no_grad(): - for _ in range(1, max_new_tokens): - with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True): - next_token = decode_one_tokens(model, next_token.clone(), None, cache_position) - generated_ids[:, cache_position] = next_token.int() - cache_position += 1 - -.. _fine-tuning-llms-pytorch-tunableop: - -PyTorch TunableOp ------------------- - -ROCm PyTorch (2.2.0 and later) allows users to use high-performance ROCm -GEMM kernel libraries through PyTorch's built-in TunableOp options. -This enables users to automatically pick up the best-performing GEMM -kernels from :doc:`rocBLAS ` and :doc:`hipBLASLt ` libraries during runtime. - -During warm-up runs or offline profiling steps, users can create a GEMM Table -that enumerates the kernel information. During the model's run, the best-performing kernel substitutes -``torch.nn.functional.linear(input, weight, bias=None)`` with the kernel specified in the GEMM table. The -`Tunable GitHub `_ -page describes the options. - -.. code-block:: python - - # To turn on TunableOp, simply set this environment variable - export PYTORCH_TUNABLEOP_ENABLED=1 - - # Python - import torch - import torch.nn as nn - import torch.nn.functional as F - A = torch.rand(100, 20, device="cuda") - W = torch.rand(200, 20, device="cuda") - Out = F.linear(A, W) - print(Out.size()) - - # tunableop_results0.csv - Validator,PT_VERSION,2.4.0 - Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c - Validator,HIPBLASLT_VERSION,0.7.0-1549b021 - Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- - Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty - GemmTunableOp_float_TN,tn_200_100_20,Gemm_Rocblas_32323,0.00669595 - -.. image:: ../../../data/how-to/llm-fine-tuning-optimization/tunableop.png - :alt: GEMM and TunableOp - :align: center - -Learn more about optimizing kernels with TunableOp in -:ref:`Optimizing Triton kernels `. - - -FBGEMM and FBGEMM_GPU -===================== - -FBGEMM (Facebook General Matrix Multiplication) is a low-precision, high-performance CPU kernel library -for matrix-matrix multiplications and convolutions. It is used for server-side inference -and as a back end for PyTorch quantized operators. FBGEMM offers optimized on-CPU performance for reduced precision calculations, -strong performance on native tensor formats, and the ability to generate -high-performance shape- and size-specific kernels at runtime. - -FBGEMM_GPU collects several high-performance PyTorch GPU operator libraries -for use in training and inference. It provides efficient table-batched embedding functionality, -data layout transformation, and quantization support. - -For more information about FBGEMM and FBGEMM_GPU, see the `PyTorch FBGEMM GitHub `_ -and the `PyTorch FBGEMM documentation `_. -The `Meta blog post about FBGEMM `_ provides -additional background about the library. - -Installing FBGEMM_GPU ----------------------- - -Installing FBGEMM_GPU consists of the following steps: - -* Set up an isolated Miniconda environment -* Install ROCm using Docker or the :doc:`package manager ` -* Install the nightly `PyTorch `_ build -* Complete the pre-build and build tasks - -.. note:: - - FBGEMM_GPU doesn't require the installation of FBGEMM. To optionally install - FBGEMM, see the `FBGEMM install instructions `_. - -Set up the Miniconda environment -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -To install Miniconda, use the following commands. - -#. Install a `Miniconda environment `_ for reproducible builds. - All subsequent commands run inside this environment. - - .. code-block:: shell - - export PLATFORM_NAME="$(uname -s)-$(uname -m)" - - # Set the Miniconda prefix directory - miniconda_prefix=$HOME/miniconda - - # Download the Miniconda installer - wget -q "https://repo.anaconda.com/miniconda/Miniconda3-latest-${PLATFORM_NAME}.sh" -O miniconda.sh - - # Run the installer - bash miniconda.sh -b -p "$miniconda_prefix" -u - - # Load the shortcuts - . ~/.bashrc - - # Run updates - conda update -n base -c defaults -y conda - -#. Create a Miniconda environment with Python 3.12: - - .. code-block:: shell - - env_name= - python_version=3.12 - - # Create the environment - conda create -y --name ${env_name} python="${python_version}" - - # Upgrade PIP and pyOpenSSL package - conda run -n ${env_name} pip install --upgrade pip - conda run -n ${env_name} python -m pip install pyOpenSSL>22.1.0 - -#. Install additional build tools: - - .. code-block:: shell - - conda install -n ${env_name} -y \ - click \ - cmake \ - hypothesis \ - jinja2 \ - make \ - ncurses \ - ninja \ - numpy \ - scikit-build \ - wheel - -Install the ROCm components -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -FBGEMM_GPU can run in a ROCm Docker container or in conjunction with the full ROCm installation. -The Docker method is recommended because it requires fewer steps and provides a stable environment. - -To run FBGEMM_GPU in the Docker container, pull the `Minimal Docker image for ROCm `_. -This image includes all preinstalled ROCm packages required to integrate FBGEMM. To pull -and run the ROCm Docker image, use this command: - -.. code-block:: shell - - # Run for ROCm 6.2.0 - docker run -it --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri --group-add video \ - --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host rocm/rocm-terminal:6.2 /bin/bash - -.. note:: - - The `Full Docker image for ROCm `_, which includes all - ROCm packages, can also be used. However, it results in a very large container, so the minimal - Docker image is recommended. - -You can also install ROCm using the package manager. FBGEMM_GPU requires the installation of the full ROCm package. -For more information, see :doc:`the ROCm installation guide `. -The ROCm package also requires the :doc:`MIOpen ` component as a dependency. -To install MIOpen, use the ``apt install`` command. - -.. code-block:: shell - - apt install hipify-clang miopen-hip miopen-hip-dev - -Install PyTorch -^^^^^^^^^^^^^^^^^^^^^^^ - -Install `PyTorch `_ using ``pip`` for the most reliable and consistent results. - -#. Install the nightly PyTorch build using ``pip``. - - .. code-block:: shell - - # Install the latest nightly, ROCm variant - conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2/ - -#. Ensure PyTorch loads correctly. Verify the version and variant of the installation using an ``import`` test. - - .. code-block:: shell - - # Ensure that the package loads properly - conda run -n ${env_name} python -c "import torch.distributed" - - # Verify the version and variant of the installation - conda run -n ${env_name} python -c "import torch; print(torch.__version__)" - -Perform the prebuild and build -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -#. Clone the FBGEMM repository and the relevant submodules. Use ``pip`` to install the - components in ``requirements.txt``. Run the following commands inside the Miniconda environment. - - .. code-block:: shell - - # Select a version tag - FBGEMM_VERSION=v0.8.0 - - # Clone the repo along with its submodules - git clone https://github.com/pytorch/FBGEMM.git --branch=v0.8.0 --recursive fbgemm_${FBGEMM_VERSION} - - # Install additional required packages for building and testing - cd fbgemm_${FBGEMM_VERSION}/fbgemm_gpu - pip install requirements.txt - -#. Clear the build cache to remove stale build information. - - .. code-block:: shell - - # !! Run in fbgemm_gpu/ directory inside the Conda environment !! - - python setup.py clean - -#. Set the wheel build variables, including the package name, Python version tag, and Python platform name. - - .. code-block:: shell - - # Set the package name depending on the build variant - export package_name=fbgemm_gpu_rocm - - # Set the Python version tag. It should follow the convention `py`, - # for example, Python 3.12 --> py312 - export python_tag=py312 - - # Determine the processor architecture - export ARCH=$(uname -m) - - # Set the Python platform name for the Linux case - export python_plat_name="manylinux2014_${ARCH}" - -#. Build FBGEMM_GPU for the ROCm platform. Set ``ROCM_PATH`` to the path to your ROCm installation. - Run these commands from the ``fbgemm_gpu/`` directory inside the Miniconda environment. - - .. code-block:: shell - - # !! Run in the fbgemm_gpu/ directory inside the Conda environment !! - - export ROCM_PATH= - - # Build for the target architecture of the ROCm device installed on the machine (for example, 'gfx942;gfx90a') - # See :doc:`The Linux system requirements <../../reference/system-requirements>` for a list of supported GPUs. - export PYTORCH_ROCM_ARCH=$(${ROCM_PATH}/bin/rocminfo | grep -o -m 1 'gfx.*') - - # Build the wheel artifact only - python setup.py bdist_wheel \ - --package_variant=rocm \ - --python-tag="${python_tag}" \ - --plat-name="${python_plat_name}" \ - -DHIP_ROOT_DIR="${ROCM_PATH}" \ - -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \ - -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA" - - # Build and install the library into the Conda environment - python setup.py install \ - --package_variant=rocm \ - -DHIP_ROOT_DIR="${ROCM_PATH}" \ - -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \ - -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA" - -Post-build validation ----------------------- - -After building FBGEMM_GPU, run some verification checks to ensure the build is correct. Continue -to run all commands inside the ``fbgemm_gpu/`` directory inside the Miniconda environment. - -#. The build process generates many build artifacts and C++ templates, so - it is important to confirm no undefined symbols remain. - - .. code-block:: shell - - # !! Run in fbgemm_gpu/ directory inside the Conda environment !! - - # Locate the built .SO file - fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so) - - # Check that the undefined symbols don't include fbgemm_gpu-defined functions - nm -gDCu "${fbgemm_gpu_lib_path}" | sort - -#. Verify the referenced version number of ``GLIBCXX`` and the presence of certain function symbols: - - .. code-block:: shell - - # !! Run in fbgemm_gpu/ directory inside the Conda environment !! - - # Locate the built .SO file - fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so) - - # Note the versions of GLIBCXX referenced by the .SO - # The libstdc++.so.6 available on the install target must support these versions - objdump -TC "${fbgemm_gpu_lib_path}" | grep GLIBCXX | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat - - # Test for the existence of a given function symbol in the .SO - nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::merge_pooled_embeddings(" - nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::jagged_2d_to_dense(" - -Testing FBGEMM ----------------------- - -FBGEMM includes tests and benchmarks to validate performance. To run these tests, -you must use ROCm 5.7 or a more recent version on the host and container. To run FBGEMM tests, -follow these instructions: - -.. code-block:: shell - - # !! Run inside the Conda environment !! - - # From the /fbgemm_gpu/ directory - cd test - - export FBGEMM_TEST_WITH_ROCM=1 - # Enable for debugging failed kernel executions - export HIP_LAUNCH_BLOCKING=1 - - # Run the test - python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py - -To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and -more recent accelerators. - -.. code-block:: shell - - # Run this inside the Conda environment from the /fbgemm_gpu/ directory - export HSA_XNACK=1 - cd test - - python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning ./uvm/uvm_test.py diff --git a/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst b/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst deleted file mode 100644 index 8a14f42cf..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst +++ /dev/null @@ -1,424 +0,0 @@ -.. meta:: - :description: How to use model quantization techniques to speed up inference. - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, quantization, Quark, GPTQ, transformers, bitsandbytes - -***************************** -Model quantization techniques -***************************** - -Quantization reduces the model size compared to its native full-precision version, making it easier to fit large models -onto accelerators or GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ -and bitsandbytes on AMD Instinct hardware. - -.. _quantize-llms-quark: - -AMD Quark -========= - -`AMD Quark `_ offers the leading efficient and scalable quantization solution tailored to AMD Instinct GPUs. It supports ``FP8`` and ``INT8`` quantization for activations, weights, and KV cache, -including ``FP8`` attention. For very large models, it employs a two-level ``INT4-FP8`` scheme—storing weights in ``INT4`` while computing with ``FP8``—for nearly 4× compression without sacrificing accuracy. -Quark scales efficiently across multiple GPUs, efficiently handling ultra-large models like Llama-3.1-405B. Quantized ``FP8`` models like Llama, Mixtral, and Grok-1 are available under the `AMD organization on Hugging Face `_, and can be deployed directly via `vLLM `_. - -Installing Quark -------------------- - -The latest release of Quark can be installed with pip - -.. code-block:: shell - - pip install amd-quark - -For detailed installation instructions, refer to the `Quark documentation `_. - - -Using Quark for quantization ------------------------------ - -#. First, load the pre-trained model and its corresponding tokenizer using the Hugging Face ``transformers`` library. - - .. code-block:: python - - from transformers import AutoTokenizer, AutoModelForCausalLM - - MODEL_ID = "meta-llama/Llama-2-70b-chat-hf" - MAX_SEQ_LEN = 512 - - model = AutoModelForCausalLM.from_pretrained( - MODEL_ID, device_map="auto", torch_dtype="auto", - ) - model.eval() - - tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN) - tokenizer.pad_token = tokenizer.eos_token - -#. Prepare the calibration DataLoader (static quantization requires calibration data). - - .. code-block:: python - - from datasets import load_dataset - from torch.utils.data import DataLoader - - BATCH_SIZE = 1 - NUM_CALIBRATION_DATA = 512 - - dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation") - text_data = dataset["text"][:NUM_CALIBRATION_DATA] - - tokenized_outputs = tokenizer( - text_data, return_tensors="pt", padding=True, truncation=True, max_length=MAX_SEQ_LEN - ) - calib_dataloader = DataLoader( - tokenized_outputs['input_ids'], batch_size=BATCH_SIZE, drop_last=True - ) - -#. Define the quantization configuration. See the comments in the following code snippet for descriptions of each configuration option. - - .. code-block:: python - - from quark.torch.quantization import (Config, QuantizationConfig, - FP8E4M3PerTensorSpec) - - # Define fp8/per-tensor/static spec. - FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max", - is_dynamic=False).to_quantization_spec() - - # Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC. - global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, - weight=FP8_PER_TENSOR_SPEC) - - # Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC. - KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC - kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"] - kv_cache_quant_config = {name : - QuantizationConfig(input_tensors=global_quant_config.input_tensors, - weight=global_quant_config.weight, - output_tensors=KV_CACHE_SPEC) - for name in kv_cache_layer_names_for_llama} - layer_quant_config = kv_cache_quant_config.copy() - - EXCLUDE_LAYERS = ["lm_head"] - quant_config = Config( - global_quant_config=global_quant_config, - layer_quant_config=layer_quant_config, - kv_cache_quant_config=kv_cache_quant_config, - exclude=EXCLUDE_LAYERS) - -#. Quantize the model and export - - .. code-block:: python - - import torch - from quark.torch import ModelQuantizer, ModelExporter - from quark.torch.export import ExporterConfig, JsonExporterConfig - - # Apply quantization. - quantizer = ModelQuantizer(quant_config) - quant_model = quantizer.quantize_model(model, calib_dataloader) - - # Freeze quantized model to export. - freezed_model = quantizer.freeze(model) - - # Define export config. - LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"] - export_config = ExporterConfig(json_export_config=JsonExporterConfig()) - export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP - - EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor" - exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR) - with torch.no_grad(): - exporter.export_safetensors_model(freezed_model, - quant_config=quant_config, tokenizer=tokenizer) - -Evaluating the quantized model with vLLM ----------------------------------------- - -The exported Quark-quantized model can be loaded directly by vLLM for inference. You need to specify the model path and inform vLLM about the quantization method (``quantization='quark'``) and the KV cache data type (``kv_cache_dtype='fp8'``). -Use the ``LLM`` interface to load the model: - -.. code-block:: python - - from vllm import LLM, SamplingParamsinterface - - # Sample prompts. - prompts = [ - "Hello, my name is", - "The president of the United States is", - "The capital of France is", - "The future of AI is", - ] - # Create a sampling params object. - sampling_params = SamplingParams(temperature=0.8, top_p=0.95) - - # Create an LLM. - llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor", - kv_cache_dtype='fp8',quantization='quark') - # Generate texts from the prompts. The output is a list of RequestOutput objects - # that contain the prompt, generated text, and other information. - outputs = llm.generate(prompts, sampling_params) - # Print the outputs. - print("\nGenerated Outputs:\n" + "-" * 60) - for output in outputs: - prompt = output.prompt - generated_text = output.outputs[0].text - print(f"Prompt: {prompt!r}") - print(f"Output: {generated_text!r}") - print("-" * 60) - -You can also evaluate the quantized model's accuracy on standard benchmarks using the `lm-evaluation-harness `_. Pass the necessary vLLM arguments to ``lm_eval`` via ``--model_args``. - -.. code-block:: shell - - lm_eval --model vllm \ - --model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor,kv_cache_dtype='fp8',quantization='quark' \ - --tasks gsm8k - -This provides a standardized way to measure the performance impact of quantization. -.. _fine-tune-llms-gptq: - -GPTQ -==== - -GPTQ is a post-training quantization technique where each row of the weight matrix is quantized independently to find a -version of the weights that minimizes error. These weights are quantized to ``int4`` but are restored to ``fp16`` on the -fly during inference. This can save your memory usage by a factor of four. A speedup in inference is expected because -inference of GPTQ models uses a lower bit width, which takes less time to communicate. - -Before setting up the GPTQ configuration in Transformers, ensure the `AutoGPTQ `_ library -is installed. - -Installing AutoGPTQ -------------------- - -The AutoGPTQ library implements the GPTQ algorithm. - -#. Use the following command to install the latest stable release of AutoGPTQ from pip. - - .. code-block:: shell - - # This will install pre-built wheel for a specific ROCm version. - - pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ - - Or, install AutoGPTQ from source for the appropriate ROCm version (for example, ROCm 6.1). - - .. code-block:: shell - - # Clone the source code. - git clone https://github.com/AutoGPTQ/AutoGPTQ.git - cd AutoGPTQ - - # Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device. - PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install . - - # Show the package after the installation - -#. Run ``pip show auto-gptq`` to print information for the installed ``auto-gptq`` package. Its output should look like - this: - - .. code-block:: shell - - Name: auto-gptq - Version: 0.8.0.dev0+rocm6.1 - ... - -Using GPTQ with AutoGPTQ ------------------------- - -#. Run the following code snippet. - - .. code-block:: python - - from transformers import AutoTokenizer, TextGenerationPipeline - from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig - base_model_name = "NousResearch/Llama-2-7b-hf" - quantized_model_name = "llama-2-7b-hf-gptq" - tokenizer = AutoTokenizer.from_pretrained(base_model_name, use_fast=True) - examples = [ - tokenizer( - "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm." - ) - ] - print(examples) - - The resulting examples should be a list of dictionaries whose keys are ``input_ids`` and ``attention_mask``. - -#. Set up the quantization configuration using the following snippet. - - .. code-block:: python - - quantize_config = BaseQuantizeConfig( - bits=4, # quantize model to 4-bit - group_size=128, # it is recommended to set the value to 128 - desc_act=False, - ) - -#. Load the non-quantized model using the AutoGPTQ class and run the quantization. - - .. code-block:: python - - # Import auto_gptq class. - from auto_gptq import AutoGPTQForCausalLM - - # Load non-quantized model. - base_model = AutoGPTQForCausalLM.from_pretrained(base_model_name, quantize_config, device_map = "auto") - base_model.quantize(examples) - - # Save quantized model. - base_model.save_quantized(quantized_model_name) - -Using GPTQ with Hugging Face Transformers ------------------------------------------- - -#. To perform a GPTQ quantization using Hugging Face Transformers, you need to create a ``GPTQConfig`` instance and set the - number of bits to quantize to, and a dataset to calibrate the weights. - - .. code-block:: python - - from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig - - base_model_name = " NousResearch/Llama-2-7b-hf" - tokenizer = AutoTokenizer.from_pretrained(base_model_name) - gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer) - -#. Load a model to quantize using ``AutoModelForCausalLM`` and pass the - ``gptq_config`` to its ``from_pretained`` method. Set ``device_map=”auto”`` to - automatically offload the model to available GPU resources. - - .. code-block:: python - - quantized_model = AutoModelForCausalLM.from_pretrained( - base_model_name, - device_map="auto", - quantization_config=gptq_config) - -#. Once the model is quantized, you can push the model and tokenizer to Hugging Face Hub for easy share and access. - - .. code-block:: python - - quantized_model.push_to_hub("llama-2-7b-hf-gptq") - tokenizer.push_to_hub("llama-2-7b-hf-gptq") - - Or, you can save the model locally using the following snippet. - - .. code-block:: python - - quantized_model.save_pretrained("llama-2-7b-gptq") - tokenizer.save_pretrained("llama-2-7b-gptq") - -ExLlama-v2 support ------------------- - -ExLlama is a Python/C++/CUDA implementation of the Llama model that is -designed for faster inference with 4-bit GPTQ weights. The ExLlama -kernel is activated by default when users create a ``GPTQConfig`` object. To -boost inference speed even further on Instinct accelerators, use the ExLlama-v2 -kernels by configuring the ``exllama_config`` parameter as the following. - -.. code-block:: python - - from transformers import AutoModelForCausalLM, GPTQConfig - #pretrained_model_dir = "meta-llama/Llama-2-7b" - base_model_name = "NousResearch/Llama-2-7b-hf" - gptq_config = GPTQConfig(bits=4, dataset="c4", exllama_config={"version":2}) - quantized_model = AutoModelForCausalLM.from_pretrained( - base_model_name, - device_map="auto", - quantization_config=gptq_config) - -bitsandbytes -============ - -The `ROCm-aware bitsandbytes `_ library is -a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizer, matrix multiplication, and -8-bit and 4-bit quantization functions. The library includes quantization primitives for 8-bit and 4-bit operations -through ``bitsandbytes.nn.Linear8bitLt`` and ``bitsandbytes.nn.Linear4bit`` and 8-bit optimizers through the -``bitsandbytes.optim`` module. These modules are supported on AMD Instinct accelerators. - -Installing bitsandbytes ------------------------ - -#. To install bitsandbytes for ROCm 6.0 (and later), use the following commands. - - .. code-block:: shell - - # Clone the github repo - git clone --recurse https://github.com/ROCm/bitsandbytes.git - cd bitsandbytes - git checkout rocm_enabled_multi_backend - - # Install dependencies - pip install -r requirements-dev.txt - - # Use -DBNB_ROCM_ARCH to specify target GPU arch - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S . - - # Compile the project - make - - # Install - python setup.py install - -#. Run ``pip show bitsandbytes`` to show the information about the installed bitsandbytes package. Its output should - look like the following. - - .. code-block:: shell - - Name: bitsandbytes - Version: 0.44.0.dev0 - ... - -Using bitsandbytes primitives ------------------------------ - -To get started with bitsandbytes primitives, use the following code as reference. - -.. code-block:: python - - import bitsandbytes as bnb - - # Use Int8 Matrix Multiplication - bnb.matmul(..., threshold=6.0) - - # Use bitsandbytes 8-bit Optimizers - adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995)) - -Using bitsandbytes with Hugging Face Transformers -------------------------------------------------- - -To load a Transformers model in 4-bit, set ``load_in_4bit=true`` in ``BitsAndBytesConfig``. - -.. code-block:: python - - from transformers import AutoModelForCausalLM, BitsAndBytesConfig - - base_model_name = "NousResearch/Llama-2-7b-hf" - quantization_config = BitsAndBytesConfig(load_in_4bit=True) - bnb_model_4bit = AutoModelForCausalLM.from_pretrained( - base_model_name, - device_map="auto", - quantization_config=quantization_config) - - # Check the memory footprint with get_memory_footprint method - print(bnb_model_4bit.get_memory_footprint()) - -To load a model in 8-bit for inference, use the ``load_in_8bit`` option. - -.. code-block:: python - - from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - - base_model_name = "NousResearch/Llama-2-7b-hf" - - tokenizer = AutoTokenizer.from_pretrained(base_model_name) - quantization_config = BitsAndBytesConfig(load_in_8bit=True) - tokenizer = AutoTokenizer.from_pretrained(base_model_name) - bnb_model_8bit = AutoModelForCausalLM.from_pretrained( - base_model_name, - device_map="auto", - quantization_config=quantization_config) - - prompt = "What is a large language model?" - inputs = tokenizer(prompt, return_tensors="pt").to("cuda") - generated_ids = model.generate(**inputs) - outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) - diff --git a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst deleted file mode 100644 index 44633164f..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.rst +++ /dev/null @@ -1,29 +0,0 @@ -.. meta:: - :description: How to optimize Triton kernels for ROCm. - :keywords: ROCm, LLM, fine-tuning, usage, MI300X, tutorial, Triton, kernel, performance, optimization - -************************* -Optimizing Triton kernels -************************* - -This section introduces the general steps for -`Triton `_ kernel optimization. Broadly, -Triton kernel optimization is similar to :doc:`HIP ` -and CUDA kernel optimization. - -Refer to the -:ref:`Triton kernel performance optimization ` -section of the :doc:`workload` guide -for detailed information. - -Triton kernel performance optimization includes the following topics. - -* :ref:`mi300x-autotunable-kernel-config` - -* :ref:`mi300x-mlir-analysis` - -* :ref:`mi300x-assembly-analysis` - -* :ref:`mi300x-torchinductor-tuning` - -* :ref:`mi300x-compute-kernel-occ` diff --git a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md deleted file mode 100644 index cc68823f3..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md +++ /dev/null @@ -1,485 +0,0 @@ ---- -myst: - html_meta: - "description": "How to optimize machine learning workloads with Composable Kernel (CK)." - "keywords": "mixed, precision, kernel, inference, linear, algebra, ck, GEMM" ---- - -# Optimizing with Composable Kernel - -The AMD ROCm Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions. - -This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X accelerators using CK. - -## High-level overview: a CK GEMM instance - -GEMM is a fundamental block in linear algebra, machine learning, and deep neural networks. It is defined as the operation: -{math}`E = α \times (A \times B) + β \times (D)`, with A and B as matrix inputs, α and β as scalar inputs, and D as a pre-existing matrix. -Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct accelerators for the computation of GEMM. The implementation of the instance contains two phases: - -- [Template parameter definition](#template-parameter-definition) -- [Instantiating and running the templated kernel](#instantiating-and-running-the-templated-kernel) - -### Template parameter definition - -The template parameters of the instance are grouped into four parameter types: - -- [Parameters for determining matrix data precision](matrix-data-precision) -- [Parameters for determining matrix data layout](matrix-data-layout) -- [Parameters for determining extra operations on matrix elements](matrix-element-operation) -- [Performance-oriented tunable parameters](tunable-parameters) - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-template_parameters.jpg -The template parameters of the selected GEMM kernel are classified into four groups. These template parameter groups should be defined properly before running the instance. -``` - -(matrix-data-precision)= - -#### Matrix data precision - -A, B, D, and E are defined as half-precision floating-point datatypes. The multiply-add results of matrix A and B are added with a pre-existing matrix D (half-precision), and the final GEMM results are also half-precision floating-points. - -```c++ -using ADataType = F16; -using BDataType = F16; -using AccDataType = F32; -using CShuffleDataType = F16; -using DDataType = F16; -using EDataType = F16; -``` - -`ADataType` and `BDataType` denote the data precision of the A and B input matrices. `AccDataType` determines the data precision used for representing the multiply-add results of A and B elements. These results are stored in a `CShuffle` module in local data share (LDS), a low-latency and high-bandwidth explicitly-addressed memory used for synchronization within a workgroup LDS for later use. - -`CShuffleDataType` denotes the data precision of `CShuffle` in LDS. - -`DDataType` denotes the data precision of the pre-existing D matrix stored in GPU global memory, while `EDatatype` denotes the data precision of the final output. The CK kernel supports a fusion strategy so that `CShuffle` can be added with a single pre-existing matrix in the same GPU kernel for better performance. - -(matrix-data-layout)= - -#### Matrix data layout - -```c++ -using ALayout = Row; -using BLayout = Col; -using DLayout = Row; -using ELayout = Row; -``` - -Following the convention of various linear algebra libraries, CK assumes that the input matrix A is an M x K matrix, meaning the matrix has M rows and K columns. Similarly, matrix B is assumed to be K x N, meaning it has K rows and N columns. In computing, row-major order and column-major order are commonly used ways to store matrices in linear storage. After understanding the matrix storage pattern, the underlying optimized memory access manner can be applied to achieve better performance depending on the storage ordering of these matrices. - -(matrix-element-operation)= - -#### Matrix element operation - -```c++ -using AElementOp = PassThrough; -using BElementOp = PassThrough; -using CDEElementOp = AddRelu; -``` - -CK supports the pre-processing of the matrix before calculating GEMM, that is, `C = AElementOp(A) * BElementOp(B)`. It similarly supports the post-processing of GEMM results the same way, that is, `E = CDEElementOp(C, D)`. - -`AElementOp` and `BElementOp` determine the operation applied to matrix A and B separately before GEMM, which is achieved by binding the operation with a C++ struct function. - -The above `PassThrough` denotes no operations are performed on the target matrix. `CDEELementOp` determines the operations applied to `CShuffle` output and matrix D. The following binding struct `AddRelu` shows an example of adding the `CShuffle` output and matrix D, and ReLU (Rectified Linear Unit) operations to the addition result. It then passes the results to matrix E. - -```c++ -struct AddRelu -{ - __host__ __device__ void operator()(ck::half_t& e, const ck::half_t& c, const ck::half_t& d) const - { - const ck::half_t x = c + d; - e = x > 0 ? x : 0; - } -}; -``` - -(tunable-parameters)= - -#### Tunable parameters - -The CK instance includes a series of tunable template parameters to control the parallel granularity of the workload to achieve load balancing on different hardware platforms. - -These parameters include Block Size, M/N/K Per Block, M/N per XDL, AK1, BK1, etc. - -- Block Size determines the number of threads in the thread block. -- M/N/K Per Block determines the size of tile that each thread block is responsible for calculating. -- M/N Per XDL refers to M/N size for Instinct accelerator Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis. -- A/B K1 is related to the data type. It can be any value ranging from 1 to K Per Block. To achieve the optimal load/store performance, 128bit per load is suggested. In addition, the A/B loading parameters must be changed accordingly to match the A/B K1 value; otherwise, it will result in compilation errors. - -Conditions for achieving computational load balancing on different hardware platforms can vary. - -### Instantiating and running the templated kernel - -After determining the template parameters, we instantiate the kernel with actual arguments. Do one of the following: - -- Use `GetDeviceBuffer` from CK’s custom struct `DeviceMem` to pass the element values of the matrices that need to be calculated. -- Allocate device buffer via `hipMalloc`. Ensure the device buffer size can fit the matrix size. -- Pass matrix elements through the `data_ptr` method in the `Tensor` object if the matrix to be calculated is of `Tensor` type. - -The row and column, and stride information of input matrices are also passed to the instance. For batched GEMM, you must pass in additional batch count and batch stride values. The extra operations for pre and post-processing are also passed with an actual argument; for example, α and β for GEMM scaling operations. Afterward, the instantiated kernel is launched by the invoker, as illustrated in Figure 3. - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-kernel_launch.jpg -Templated kernel launching consists of kernel instantiation, making arguments by passing in actual application parameters, creating an invoker, and running the instance through the invoker. -``` - -## Developing fused INT8 kernels for SmoothQuant models - -[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct accelerators are shown in the following table. - -:::{table} Functionalities used to implement SmoothQuant model inference. - -| Functionality descriptions | Corresponding wrappers | -|:-------------------------------------|-----------------------------------------| -| {math}`E = α \times (A \times B) + β \times (D)`, where A, B, D, E are INT8 2-D tensors; | E = Linear_ABDE_I8(A, B, D, {math}`\alpha`, {math}`\beta`) | -| {math}`E = RELU (α \times (A \times B) + β \times (D))`, where A, B, D, E are INT8 2-D tensors; | E = Linear_ReLU_ABDE_I8(A, B, D, {math}`\alpha`, {math}`\beta`) | -| {math}`E = α \times (A \times B) + β \times (D)`, where A, B are INT8 2-D tensors, D and E are FP32 2-D tensors; | E = Linear_AB_I8_DE_F32(A, B, D, {math}`\alpha`, {math}`\beta`) | -| {math}`E = α \times (A \times B)`, where A, B, E are INT8 3-D tensors; | E = BMM_ABE_I8(A, B, {math}`\alpha`) | -| {math}`E = α \times (A \times B)`, where A, B are INT8 3-D tensors, E is FP32 3-D tensor; | E = BMM_AB_I8_E_F32(A, B, {math}`\alpha`) | -::: - -### Operation flow analysis - -The following section discusses the analysis of the operation flow of `Linear_ReLU_ABDE_I8`. The rest of the wrappers in Table 1 can be analyzed similarly. - -The first operation in the process is to perform the multiplication of input matrices A and B. The resulting matrix C is then scaled with α to obtain T1. At the same time, the process performs a scaling operation on D elements to obtain T2. Afterward, the process performs matrix addition between T1 and T2, element activation calculation using ReLU, and element rounding sequentially. The operations to generate E1, E2, and E are encapsulated and completed by a user-defined template function in CK (given in the next sub-section). This template function is integrated into the fundamental instance directly during the compilation phase so that all these steps can be fused in a single GPU kernel. - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-operation_flow.jpg -Operation flow. -``` - -The CK library contains many fundamental instances that implement different functions. Familiarize yourself with the names of various CK instances and determine whether they meet the target functional requirements. - -Second, consider whether the format of input data meets your actual calculation needs. For SQ models, the 8-bit integer data format (INT8) is applied for matrix calculations. - -Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct accelerators after being compiled and cannot run on Radeon-series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances. - -Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_kernel/tree/develop/example/24_batched_gemm) as the fundamental instance to implement the functionalities in the previous table. - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-root_instance.jpg -Use the ‘DeviceBatchedGemmMultiD_Xdl’ instance as a root. -``` - -The `DeviceBatchedGemmMultiD_Xdl` instance realizes the batched GEMM `BMM_ABE_I8` and `BMM_AB_I8_E_F32` kernels directly by using the proper input and output data precision types. - -Based on the two batched GEMM kernels, GEMM kernel `Linear_ABDE_I8` and `Linear_AB_I8_DE_F32` can be implemented by expanding their input 2-D tensors to 3-D tensors. Then, the 3-D output tensors produced by the root instance are squeezed back to 2-D output tensors before returning back. - -For example, unsqueeze A (M, K) to A (1, M, K) before assigning it into the root instance and squeeze E (1, M, N) to (M, N) after the calculations of the root instance return back. `Linear_ReLU_ABDE_I8` is implemented by adding a ReLU operation on the result output of `Linear_ABDE_I8`. - -### Developing the complete function - -The inference of SQ quantized models relies on using PyTorch and Transformer libraries, and a tensor type is used to represent matrices and vectors in `torch`, the C++ data types in CK need to be replaced with the `torch::tensor` type. The data types of the input and output matrices should be a `tensor` type. - -In GEMM, the A and B inputs are two-dimensional matrices, and the required input matrices of the selected fundamental CK instance are three-dimensional matrices. Therefore, we must convert the input 2-D tensors to 3-D tensors, by using `tensor`'s `unsqueeze()` method before passing these matrices to the instance. For batched GEMM in the preceding table, ignore this step. - -```c++ -// Function input and output -torch::Tensor linear_relu_abde_i8( - torch::Tensor A_, - torch::Tensor B_, - torch::Tensor D_, - float alpha, - float beta) -{ - // Convert torch::Tensor A_ (M, K) to torch::Tensor A (1, M, K) - auto A = A_.unsqueeze(0); - - // Convert torch::Tensor B_ (K, N) to torch::Tensor A (1, K, N) - auto B = B_.unsqueeze(0); -... -``` - -As shown in the following code block, we obtain M, N, and K values using input tensor size values. This stride size information is used to reshape the input vector D and allocate the storage space of tensor E. Stride reflects the exact size of continuous elements in memory, which are passed as important parameters to the fundamental instance for GPU kernel use. - -```c++ - // Return the batch count from the size of dimension 0 - int batch_count = A.size(0); - - // Return the M, N, K from the size of dimension 1 & 2 - int M = A.size(1); - int N = B.size(1); - int K = A.size(2); - - // Initialize the stride size for A, B, D and E - int stride_A = K; - int stride_B = K; - int stride_D0 = N; - int stride_E = N; - - // Initialize the stride size for batched A, B, D and E - long long int batch_stride_A = M * K; - long long int batch_stride_B = K * N; - long long int batch_stride_D0 = M * N; - long long int batch_stride_E = M * N; - - // Convert the tensor of 2-D to 3-D - auto D = D_.view({1,-1}).repeat({M, 1}); - - // Allocate memory for E - auto E = torch::empty({batch_count, M, N}, - torch::dtype(torch::kInt8).device(A.device())); -``` - -In the following code block, `ADataType`, `BDataType` and `D0DataType` are used to denote the data precision of the input tensors A, B and D, respectively. `EDataType` is used to denote the data precision of output tensor E. These parameters are specified to `I8` data format (8-bit integer data format) to meet the kernel's design requirements. - -`AccDataType` determines the data precision used to represent the multiply-add results of A and B elements. Generally, a larger range data type is applied to store the multiply-add results of A and B to avoid result overflow; `I32` is applied in this case. The `CShuffleDataType I32` data type indicates that the multiply-add results continue to be stored in LDS as an `I32` data format. All of this is implemented through the following code block. - -```c++ - // Data precision - using ADataType = I8; - using BDataType = I8; - using AccDataType = I32; - using CShuffleDataType = I32; - using D0DataType = I8; - using DsDataType = ck::Tuple; - using EDataType = I8; -``` - -Following the convention of various linear algebra libraries, row-major and column-major orders are used to denote the ways of storing matrices in linear storage. The advantage of specifying matrix B as column major is that all the relevant matrix elements are stored continuously in GPU global memory when a row in A is multiplied by a column in B, which can help GPU achieve data consistency access to improve access performance. - -```c++ - // Specify tensor order - using ALayout = RowMajor; - using BLayout = ColumnMajor; - using D0Layout = RowMajor; - using DsLayout = ck::Tuple; - using ELayout = RowMajor; -``` - -In CK, `PassThrough` is a struct denoting if an operation is applied to the tensor it binds to. To fuse the operations between E1, E2, and E introduced in section [Operation flow analysis](#operation-flow-analysis), we define a custom C++ struct, `ScaleScaleAddRelu`, and bind it to `CDEELementOp`. It determines the operations that will be applied to `CShuffle` (A×B results), tensor D, α, and β. - -```c++ - // No operations bound to the elements of A and B - using AElementOp = PassThrough; - using BElementOp = PassThrough; - - // Operations bound to the elements of C, D and E - using CDEElementOp = ScaleScaleAddRelu; -``` - -In the binding struct, `operator()` performs an addition operation between `CShuffle` and matrix D, a ReLU operation on the addition results, and a rounding operation on the output elements. It then returns the results to E. - -```c++ -struct ScaleScaleAddRelu { - - template <> - __host__ __device__ constexpr void - operator()(I8& e, const I32& c, const I8& d) const - { - // Scale AxB result with alpha - const F32 c_scale = ck::type_convert(c) * alpha; - - // Scale D with beta - const F32 d_scale = ck::type_convert(d) * beta; - - // Perform addition operation - F32 temp = c_scale + d_scale; - - // Perform RELU operation - temp = temp > 0 ? temp : 0; - - // Perform rounding operation - temp = temp > 127 ? 127 : temp; - - // Return to E - e = ck::type_convert(temp); - } - - F32 alpha; - F32 beta; -}; -``` - -The original input tensors need to be padded to meet GPU tile-based parallelism. - -```c++ -static constexpr auto GemmDefault = ck::tensor_operation::device::GemmSpecialization::MNKPadding; -``` - -The template parameters of the target fundamental instance are initialized with the above parameters and includes default tunable parameters. For specific tuning methods, see [Tunable parameters](#tunable-parameters). - -```c++ -using DeviceOpInstance = ck::tensor_operation::device::DeviceBatchedGemmMultiD_Xdl< - // Tensor layout - ALayout, BLayout, DsLayout, ELayout, - // Tensor data type - ADataType, BDataType, AccDataType, CShuffleDataType, DsDataType, EDataType, - // Tensor operation - AElementOp, BElementOp, CDEElementOp, - // Padding strategy - GemmDefault, - // Tunable parameters - tunable parameters>; -``` - -Return the address of the first element of tensors: - -```c++ - auto A_ref = A.data_ptr(); - auto B_ref = B.data_ptr(); - auto D0_ref = D.data_ptr(); - auto E_ref = E.data_ptr(); -``` - -The fundamental instance is then initialized and run with actual arguments: - -```c++ - auto device_op = DeviceOpInstance{}; - auto invoker = device_op.MakeInvoker(); - auto argument = device_op.MakeArgument( - A_ref, B_ref, {D0_ref}, E_ref, - M, N, K, - batch_count, - stride_A, stride_B, {stride_D0}, stride_E, - batch_stride_A, batch_stride_B, {batch_stride_D0}, batch_stride_E, - AElementOp{}, BElementOp{}, CDEElementOp{alpha, beta}); - -invoker.Run(argument, StreamConfig{nullptr, 0}); -``` - -The output of the fundamental instance is a calculated batched matrix E (batch, M, N). Before the return, it needs to be converted to a 2-D matrix if a normal GEMM result is required. - -```c++ -// Convert (1, M, N) to (M, N) -return E.squeeze(0); -``` - -### Binding to Python - -Since these functions are written in C++ and `torch::Tensor`, you can use `pybind11` to bind the functions and import them as Python modules. For the example, the necessary binding code for exposing the functions in the table spans but a few lines. - -```c++ -#include - -PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){ - m.def("linear_ab_i8_de_f32", &linear_ab_i8_de_f32); - m.def("linear_relu_abde_i8", &linear_relu_abde_i8); - m.def("linear_abde_i8", &linear_abde_i8); - m.def("bmm_abe_i8", &bmm_abe_i8); - m.def("bmm_ab_i8_e_f32", &bmm_ab_i8_e_f32); -} -``` - -Build the C++ extension by writing a `setup.py` script that uses `setuptools` to compile the C++ code. A reference implementation of the `setup.py` script is as follows. - -```python -import os -from setuptools import setup, find_packages -from torch.utils import cpp_extension -from torch.utils.cpp_extension import BuildExtension - -os.environ["CC"] = "hipcc" -os.environ["CXX"] = "hipcc" - -sources = [ - 'torch_int/kernels/linear.cpp', - 'torch_int/kernels/bmm.cpp', - 'torch_int/kernels/pybind.cpp', -] - -include_dirs = ['torch_int/kernels/include'] -extra_link_args = ['libutility.a'] -extra_compile_args = ['-O3','-DNDEBUG', '-std=c++17', '--offload-arch=gfx942', '-DCK_ENABLE_INT8', '-D__HIP_PLATFORM_AMD__=1'] - -setup( - name='torch_int', - ext_modules=[ - cpp_extension.CUDAExtension( - name='torch_int.rocm', - sources=sources, - include_dirs=include_dirs, - extra_link_args=extra_link_args, - extra_compile_args=extra_compile_args - ), - ], - cmdclass={ - 'build_ext': BuildExtension.with_options(use_ninja=False) - }, - packages=find_packages( - exclude=['notebook', 'scripts', 'tests']), -) -``` - -Run `python setup.py install` to build and install the extension. It should look something like Figure 6: - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-compilation.jpg -Compilation and installation of the INT8 kernels. -``` - -### INT8 model inference and performance - -The implementation architecture of running SmoothQuant models on MI300X GPUs is illustrated in Figure 7, where (a) shows the decoder layer composition components of the target model, (b) shows the major implementation class for the decoder layer components, and \(c\) denotes the underlying GPU kernels implemented by CK instance. - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg -The implementation architecture of running SmoothQuant models on AMD MI300X accelerators. -``` - -For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-smoothquant), each decoder layer contains three major components: attention calculation, layer normalization, and linear transformation in fully connected layers. The corresponding implementation classes for these components are: - -- `Int8OPTAttention` -- `W8A8B8O8LinearReLU` -- `W8A8BF32OF32Linear` - - These classes' underlying implementation logits will harness the functions in previous table. Note that for the example, the `LayerNormQ` module is implemented by the torch native module. - -Testing environment: -The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X accelerators and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub: -[`rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b) - -The tested models are OPT-1.3B, 2.7B, 6.7B and 13B FP16 models and the corresponding SmoothQuant INT8 OPT models were obtained from Hugging Face. - -Note that since the default values were used for the tunable parameters of the fundamental instance, the performance of the INT8 kernel is suboptimal. - -Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size. - - -```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg -Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. -``` - -For accuracy comparisons between the original FP16 and INT8 models, the evaluation is done by using the first 1,000 samples from the LAMBADA dataset's validation set. We employ the same Last Token Prediction Accuracy method introduced in [SmoothQuant Real-INT8 Inference for PyTorch](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb) as our evaluation metric. The comparison results are shown in Table 2. - -:::{table} The inference accuracy comparisons of SmoothQuant quantized models on Instinct MI300X. - -| Models | Hugging Face FP16 model accuracy | SmoothQuant quantized INT8 model accuracy | -|:-----------------|----------------------------------------|---------------------------------------------| -| opt-1.3B | 0.72 | 0.70 | -| opt-2.7B | 0.76 | 0.75 | -| opt-6.7B | 0.80 | 0.79 | -| opt-13B | 0.79 | 0.77 | -::: - -## Conclusion - -CK provides a rich set of template parameters for generating flexible accelerated computing kernels for difference application scenarios. - -CK supports multiple instruction sets of AMD Instinct GPUs, operator fusion and different data precisions. Its composability helps users quickly construct operator performance verification. - -With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD accelerator platforms. diff --git a/docs/how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst b/docs/how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst deleted file mode 100644 index 6b9b29341..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/profiling-and-debugging.rst +++ /dev/null @@ -1,29 +0,0 @@ -.. meta:: - :description: How to use ROCm profiling and debugging tools. - :keywords: ROCm, LLM, fine-tuning, usage, MI300X, tutorial, profiling, debugging, performance, Triton - -*********************** -Profiling and debugging -*********************** - -This section provides an index for further documentation on profiling and -debugging tools and their common usage patterns. - -See :ref:`AMD Instinct MI300X™ workload optimization ` -for a conceptual summary of the workload profiling workflow for ROCm applications -on AMD hardware -- including fine-tuning LLMs. - -There, you'll find information on higher-level and kernel-level profiling tools -as well as other profiling and debugging suggestions. - -* :ref:`PyTorch Profiler ` - -* :ref:`ROCm profiling tools ` - - * :ref:`ROCProfiler ` - - * :ref:`ROCm Compute Profiler ` - - * :ref:`ROCm Systems Profiler ` - -* :ref:`ROCr Debug Agent ` diff --git a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst deleted file mode 100644 index bc9463f58..000000000 --- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst +++ /dev/null @@ -1,2083 +0,0 @@ -.. meta:: - :description: Learn about workload tuning on AMD Instinct MI300X accelerators for optimal performance. - :keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm, - environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL, - MIOpen, accelerator, GPU, resource utilization - -***************************************** -AMD Instinct MI300X workload optimization -***************************************** - -This document provides guidelines for optimizing the performance of AMD -Instinct™ MI300X accelerators, with a particular focus on GPU kernel -programming, high-performance computing (HPC), and deep learning operations -using PyTorch. It delves into specific workloads such as -:ref:`model inference `, offering strategies to -enhance efficiency. - -The following topics highlight :ref:`auto-tunable configurations ` -that streamline optimization as well as advanced techniques like -:ref:`Triton kernel optimization ` for -meticulous tuning. - -Workload tuning strategy -======================== - -By following a structured approach, you can systematically address -performance issues and enhance the efficiency of your workloads on AMD Instinct -MI300X accelerators. - -Measure the current workload ----------------------------- - -Begin by evaluating the performance of your workload in its current state. This -involves running benchmarks and collecting performance data to establish a -baseline. Understanding how your workload behaves under different conditions -provides critical insights into where improvements are needed. - -.. _mi300x-profiling-start: - -Identify tuning requirements ----------------------------- - -Analyze the collected performance data to identify areas where tuning is -required. This could involve detecting bottlenecks in CPU, GPU, memory, or data -transfer. Understanding these requirements will help direct your optimization -efforts more effectively. - -Profiling is a fundamental step in workload tuning. It allows you to gather -detailed information about how your workload utilizes system resources, and -where potential inefficiencies lie. Profiling tools can provide insights into -both high-level and granular performance metrics. See :ref:`mi300x-profiling-tools`. - -High-level profiling tools -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For a broad overview, use tools like the -:ref:`PyTorch Profiler `, which helps in -understanding how PyTorch operations are executed and where time is spent. This -is particularly useful for developers new to workload tuning, as it provides a -comprehensive view without requiring in-depth knowledge of lower-level -operations. - -Kernel-level profiling tools -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When profiling indicates that GPUs are a performance bottleneck, delve deeper -into kernel-level profiling. Tools such as the -:ref:`ROCr Debug Agent `, -:ref:`ROCProfiler `, and -:ref:`ROCm Compute Profiler ` offer detailed insights -into GPU kernel execution. These tools can help isolate problematic GPU -operations and provide data needed for targeted optimizations. - -Analyze and tune ----------------- - -Based on the insights gained from profiling, focus your tuning efforts on the -identified bottlenecks. This might involve optimizing specific kernel -operations, adjusting memory access patterns, or modifying computational -algorithms. - -The following subsections discuss optimization ranging from high-level and more -automated strategies to more involved, hands-on optimization. - -Optimize model inference with vLLM -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -vLLM provides tools and techniques specifically designed for efficient model -inference on AMD Instinct MI300X accelerators. See :ref:`fine-tuning-llms-vllm` -for installation guidance. Optimizing performance with vLLM -involves configuring tensor parallelism, leveraging advanced features, and -ensuring efficient execution. Here’s how to optimize vLLM performance: - -* Tensor parallelism: Configure the - :ref:`tensor-parallel-size parameter ` to distribute - tensor computations across multiple GPUs. Adjust parameters such as - ``batch-size``, ``input-len``, and ``output-len`` based on your workload. - -* Configuration for vLLM: Set :ref:`parameters ` - according to workload requirements. Benchmark performance to understand - characteristics and identify bottlenecks. - -* Benchmarking and performance metrics: Measure latency and throughput to - evaluate performance. - -.. _mi300x-auto-tune: - -Auto-tunable configurations -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Auto-tunable configurations can significantly streamline performance -optimization by automatically adjusting parameters based on workload -characteristics. For example: - -* PyTorch: Utilize :ref:`PyTorch’s built-in auto-tuning features `, - such as the :ref:`TunableOp ` module, which helps in - optimizing operation performance by exploring different configurations. - -* MIOpen: Leverage :ref:`MIOpen’s auto-tuning capabilities ` - for convolutional operations and other primitives to find optimal settings for - your specific hardware. - -* Triton: Use :ref:`Triton’s auto-tuning features ` - to explore various kernel configurations and automatically select the - best-performing ones. - -Manual tuning -^^^^^^^^^^^^^ - -Advanced developers can manually adjust parameters and configurations to -optimize performance. Both Triton and HIP involve manual tuning aspects. - -* ROCm libraries: Optimize GPU performance by adjusting various parameters and - configurations within :ref:`ROCm libraries `. This - approach involves hands-on optimization to maximize efficiency for specific - workloads. - -* Triton: Tune Triton kernels by adjusting parameters tailored to - your workload to - :ref:`optimize GPU resource utilization ` and - better :ref:`leverage specific hardware features `. - -* HIP: Profile and :ref:`optimize HIP kernels ` by - optimizing parallel execution, memory access patterns, and other aspects. - -Iterate and validate --------------------- - -Optimization is an iterative process. After applying tuning changes, re-profile -the workload to validate improvements and ensure that the changes have had the -desired effect. Continuous iteration helps refine the performance gains and -address any new bottlenecks that may emerge. - -ROCm provides a prebuilt optimized Docker image that has everything required to implement -the LLM inference tips in this section. It includes ROCm, PyTorch, and vLLM. -For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. - -.. _mi300x-profiling-tools: - -Profiling tools -=============== - -AMD profiling tools provide valuable insights into how efficiently your -application utilizes hardware and help diagnose potential bottlenecks that -contribute to poor performance. Developers targeting AMD GPUs have multiple -tools available depending on their specific profiling needs. - -* ROCProfiler tool collects kernel execution performance - metrics. For more information, see the - :doc:`ROCProfiler ` - documentation. - -* ROCm Compute Profiler builds upon ROCProfiler but provides more guided analysis. - For more information, see - :doc:`ROCm Compute Profiler documentation `. - -Refer to :doc:`profiling-and-debugging` -to explore commonly used profiling tools and their usage patterns. - -Once performance bottlenecks are identified, you can implement an informed workload -tuning strategy. If kernels are the bottleneck, consider: - -* :ref:`Auto-tuning in PyTorch with TunableOp ` - -* :ref:`Auto-tuning in MIOpen ` - -* :ref:`Triton auto-tunable kernel configurations ` - -If auto-tuning does not meet your requirements, consider -:ref:`mi300x-triton-kernel-performance-optimization`. - -If the issue is multi-GPU scale-out, try -:ref:`RCCL tuning and configuration `. - -This section discusses profiling and debugging tools and some of their common usage patterns with ROCm applications. - -.. _mi300x-pytorch-profiler: - -PyTorch Profiler ----------------- - -`PyTorch Profiler `_ can be invoked inside Python scripts, letting you -collect CPU and GPU performance metrics while the script is running. See the `PyTorch Profiler tutorial -`_ for more information. - -You can then visualize and view these metrics using an open-source profile visualization tool like -`Perfetto UI `_. - -#. Use the following snippet to invoke PyTorch Profiler in your code. - - .. code-block:: python - - import torch - import torchvision.models as models - from torch.profiler import profile, record_function, ProfilerActivity - model = models.resnet18().cuda() - inputs = torch.randn(2000, 3, 224, 224).cuda() - - with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: - with record_function("model_inference"): - model(inputs) - prof.export_chrome_trace("resnet18_profile.json") - -#. Profile results in ``resnet18_profile.json`` can be viewed by the Perfetto visualization tool. Go to - ``__ and import the file. In your Perfetto visualization, you'll see that the upper section - shows transactions denoting the CPU activities that launch GPU kernels while the lower section shows the actual GPU - activities where it processes the ``resnet18`` inferences layer by layer. - - .. figure:: ../../../data/how-to/tuning-guides/perfetto-trace.svg - :width: 800 - - Perfetto trace visualization example. - -ROCm profiling tools --------------------- - -Heterogenous systems, where programs run on both CPUs and GPUs, introduce additional complexities. Understanding the -critical path and kernel execution is all the more important. So, performance tuning is a necessary component in the -benchmarking process. - -With AMD's profiling tools, developers are able to gain important insight into how efficiently their application is -using hardware resources and effectively diagnose potential bottlenecks contributing to poor performance. Developers -working with AMD Instinct accelerators have multiple tools depending on their specific profiling needs; these include: - -* :ref:`ROCProfiler ` - -* :ref:`ROCm Compute Profiler ` - -* :ref:`ROCm Systems Profiler ` - -.. _mi300x-rocprof: - -ROCProfiler -^^^^^^^^^^^ - -:doc:`ROCProfiler ` is primarily a low-level API for accessing and extracting GPU hardware performance -metrics, commonly called *performance counters*. These counters quantify the performance of the underlying architecture -showcasing which pieces of the computational pipeline and memory hierarchy are being utilized. - -Your ROCm installation contains a script or executable command called ``rocprof`` which provides the ability to list all -available hardware counters for your specific accelerator or GPU, and run applications while collecting counters during -their execution. - -This ``rocprof`` utility also depends on the :doc:`ROCTracer and ROC-TX libraries `, giving it the -ability to collect timeline traces of the accelerator software stack as well as user-annotated code regions. - -.. note:: - - ``rocprof`` is a CLI-only utility where inputs and outputs take the form of text and CSV files. These - formats provide a raw view of the data and puts the onus on the user to parse and analyze. ``rocprof`` - gives the user full access and control of raw performance profiling data, but requires extra effort to analyze the - collected data. - -.. _mi300x-rocprof-compute: - -ROCm Compute Profiler -^^^^^^^^^^^^^^^^^^^^^ - -:doc:`ROCm Compute Profiler ` is a system performance profiler for high-performance computing (HPC) and -machine learning (ML) workloads using Instinct accelerators. Under the hood, ROCm Compute Profiler uses -:ref:`ROCProfiler ` to collect hardware performance counters. The ROCm Compute Profiler tool performs -system profiling based on all approved hardware counters for Instinct -accelerator architectures. It provides high level performance analysis features including System Speed-of-Light, IP -block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more. - -ROCm Compute Profiler takes the guesswork out of profiling by removing the need to provide text input files with lists of counters -to collect and analyze raw CSV output files as is the case with ROCProfiler. Instead, ROCm Compute Profiler automates the collection -of all available hardware counters in one command and provides graphical interfaces to help users understand and -analyze bottlenecks and stressors for their computational workloads on AMD Instinct accelerators. - -.. note:: - - ROCm Compute Profiler collects hardware counters in multiple passes, and will therefore re-run the application during each pass - to collect different sets of metrics. - -.. figure:: ../../../data/how-to/tuning-guides/rocprof-compute-analysis.png - :width: 800 - - ROCm Compute Profiler memory chart analysis panel. - -In brief, ROCm Compute Profiler provides details about hardware activity for a particular GPU kernel. It also supports both -a web-based GUI or command-line analyzer, depending on your preference. - -.. _mi300x-rocprof-systems: - -ROCm Systems Profiler -^^^^^^^^^^^^^^^^^^^^^ - -:doc:`ROCm Systems Profiler ` is a comprehensive profiling and tracing tool for parallel applications, -including HPC and ML packages, written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU and -GPU. It is capable of gathering the performance information of functions through any combination of binary -instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. - -ROCm Systems Profiler supports interactive visualization of comprehensive traces in the web browser in addition to high-level -summary profiles with ``mean/min/max/stddev`` statistics. Beyond runtime -information, ROCm Systems Profiler supports the collection of system-level metrics such as CPU frequency, GPU temperature, and GPU -utilization. Process and thread level metrics such as memory usage, page faults, context switches, and numerous other -hardware counters are also included. - -.. tip:: - - When analyzing the performance of an application, it is best not to assume you know where the performance - bottlenecks are and why they are happening. ROCm Systems Profiler is the ideal tool for characterizing where optimization would - have the greatest impact on the end-to-end execution of the application and to discover what else is happening on the - system during a performance bottleneck. - -.. figure:: ../../../data/how-to/tuning-guides/rocprof-systems-timeline.png - :width: 800 - - ROCm Systems Profiler timeline trace example. - -.. _mi300x-vllm-optimization: - -vLLM performance optimization -============================= - -vLLM is a high-throughput and memory efficient inference and serving engine for large language models that has gained traction in the AI community for -its performance and ease of use. See :ref:`fine-tuning-llms-vllm` for a primer on vLLM with ROCm. - -Performance environment variables ---------------------------------- - -The following performance tips are not *specific* to vLLM -- they are general -but relevant in this context. You can tune the following vLLM parameters to -achieve optimal request latency and throughput performance. - -* As described in `Environment variables (MI300X) - `_, - the environment variable ``HIP_FORCE_DEV_KERNARG`` can improve vLLM - performance. Set it to ``export HIP_FORCE_DEV_KERNARG=1``. - -* Set the :ref:`RCCL environment variable ` ``NCCL_MIN_NCHANNELS`` - to ``112`` to increase the number of channels on MI300X to potentially improve - performance. - -* Set the environment variable ``TORCH_BLAS_PREFER_HIPBLASLT=1`` to use hipBLASLt to improve performance. - -Auto-tuning using PyTorch TunableOp ------------------------------------- - -Since vLLM is based on the PyTorch framework, PyTorch TunableOp can be used for auto-tuning. -You can run auto-tuning with TunableOp in two simple steps without modifying your code: - -* Enable TunableOp and tuning. Optionally, enable verbose mode: - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 your_vllm_script.sh - -* Enable TunableOp and disable tuning and measure. - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_vllm_script.sh - -Learn more about TunableOp in the :ref:`PyTorch TunableOp ` section. - -Performance tuning based on vLLM engine configurations -------------------------------------------------------- - -The following subsections describe vLLM-specific configurations for performance tuning. -You can tune the following vLLM parameters to achieve optimal performance. - -* ``tensor_parallel_size`` - -* ``gpu_memory_utilization`` - -* ``dtype`` - -* ``enforce_eager`` - -* ``kv_cache_dtype`` - -* ``input_len`` - -* ``output_len`` - -* ``max_num_seqs`` - -* ``num_scheduler_steps`` - -* ``max_model_len`` - -* ``enable_chunked_prefill`` - -* ``distributed_executor_backend`` - -* ``max_seq_len_to_capture`` - -Refer to `vLLM documentation `_ -for additional performance tips. :ref:`fine-tuning-llms-vllm` describes vLLM -usage with ROCm. - -ROCm provides a prebuilt optimized Docker image for validating the performance -of LLM inference with vLLM on MI300X series accelerators. The Docker image includes -ROCm, vLLM, and PyTorch. For more information, see -:doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. - -.. _mi300x-vllm-throughput-measurement: - -Evaluating performance by throughput measurement -------------------------------------------------- - -This tuning guide evaluates the performance of LLM inference workloads by measuring throughput in tokens per second (TPS). Throughput can be assessed using both real-world and synthetic data, depending on your evaluation goals. - -Refer to the benchmarking script located at ``benchmarks/benchmark_throughput.py`` in the `vLLM repository `_. -Use this script to measure throughput effectively. You can assess throughput using real-world and synthetic data, depending on your evaluation goals. - -* For realistic performance evaluation, you can use datasets like Hugging Face's - ``ShareGPT_V3_unfiltered_cleaned_split.json``. This dataset includes real-world conversational - data, making it a good representation of typical use cases for language models. Download it using - the following command: - - .. code-block:: shell - - wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json - -* For standardized benchmarking, you can set fixed input and output token - lengths. Synthetic prompts provide consistent benchmarking runs, making it - easier to compare performance across different models or configurations. - Additionally, a controlled environment simplifies analysis. - -By balancing real-world data and synthetic data approaches, you can get a well-rounded understanding of model performance in varied scenarios. - -.. _mi300x-vllm-single-node: - -Maximizing vLLM instances on a single node ------------------------------------------- - -The general guideline is to maximize per-node throughput by running as many vLLM instances as possible. -However, running too many instances might lead to insufficient memory for the KV-cache, which can affect performance. - -The Instinct MI300X accelerator is equipped with 192GB of HBM3 memory capacity and bandwidth. -For models that fit in one GPU -- to maximize the accumulated throughput -- you can run as many as eight vLLM instances -simultaneously on one MI300X node (with eight GPUs). To do so, use the GPU isolation environment -variable ``CUDA_VISIBLE_DEVICES``. - -For example, this script runs eight instances of vLLM for throughput benchmarking at the same time -with a model that can fit in one GPU: - -.. code-block:: shell - - for i in $(seq 0 7); - do - CUDA_VISIBLE_DEVICES="$i" python3 /app/vllm/benchmarks/benchmark_throughput.py -tp 1 --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - done - -The total throughput achieved by running ``N`` instances of vLLM is generally much higher than running a -single vLLM instance across ``N`` GPUs simultaneously (that is, configuring ``tensor_parallel_size`` as N or -using the ``-tp`` N option, where ``1 < N ≤ 8``). - -vLLM on MI300X accelerators can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on. -Notable configurations include Llama2-70b and Llama3-70b models on a single MI300X GPU, and the Llama3.1 405b model can fit on one single node with 8 MI300X GPUs. - -.. _mi300x-vllm-gpu-memory-utilization: - -Configure the gpu_memory_utilization parameter ----------------------------------------------- - -There are two ways to increase throughput by configuring ``gpu-memory-utilization`` parameter. - -1. Increase ``gpu-memory-utilization`` to improve the throughput for a single instance as long as - it does not incur HIP or CUDA Out Of Memory. The default ``gpu-memory-utilization`` is 0.9. - You can set it to ``>0.9`` and ``<1``. - - For example, below benchmarking command set the ``gpu-memory-utilization`` as 0.98, or 98%. - - .. code-block:: shell - - /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.98 --input-len 1024 --output-len 128 --model /path/to/model - -2. Decrease ``gpu-memory-utilization`` to maximize the number of vLLM instances on the same GPU. - - Specify GPU memory utilization to run as many instances of vLLM as possible on a single - GPU. However, too many instances can result in no memory for KV-cache. For small models, run - multiple instances of vLLM on the same GPU by specifying a smaller ``gpu-memory-utilization`` -- as - long as it would not cause HIP Out Of Memory. - - For example, run two instances of the Llama3-8b model at the same time on a single GPU by specifying - ``--gpu-memory-utilization`` to 0.4 (40%) as follows (on GPU ``0``): - - .. code-block:: shell - - CUDA_VISIBLE_DEVICES=0 python3 /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.4 - --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - - CUDA_VISIBLE_DEVICES=0 python3 /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.4 - --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - -See :ref:`vllm-engine-args` for other performance suggestions. - -.. _mi300x-vllm-multiple-gpus: - -Run vLLM on multiple GPUs -------------------------- - -The two main reasons to use multiple GPUs are: - -* The model size is too big to run vLLM using one GPU as it results HIP Out of Memory. - -* To achieve better latency when using a single GPU is not desirable. - -To run one vLLM instance on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` option to -specify multiple GPUs. Optionally, use the ``CUDA_VISIBLE_DEVICES`` environment variable to specify -the GPUs. - -For example, you can use two GPUs to start an API server on port 8000: - -.. code-block:: shell - - python -m vllm.entrypoints.api_server --model /path/to/model --dtype - float16 -tp 2 --port 8000 & - -To achieve both latency and throughput performance for serving, you can run multiple API servers on -different GPUs by specifying different ports for each server and use ``CUDA_VISIBLE_DEVICES`` to -specify the GPUs for each server, for example: - -.. code-block:: shell - - CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model - /path/to/model --dtype float16 -tp 2 --port 8000 & - - CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model - /path/to/model --dtype float16 -tp 2 --port 8001 & - -Choose an attention backend ---------------------------- - -vLLM on ROCm supports two attention backends, each suitable for different use cases and performance -requirements: - -- **Triton Flash Attention** - For benchmarking, run vLLM scripts at - least once as a warm-up step so Triton can perform auto-tuning before - collecting benchmarking numbers. This is the default setting. - -- **Composable Kernel (CK) Flash Attention** - To use CK Flash Attention, specify - the environment variable as ``export VLLM_USE_TRITON_FLASH_ATTN=0``. - - -Refer to :ref:`Model acceleration libraries ` -to learn more about Flash Attention with Triton or CK backends. - -.. _vllm-engine-args: - -vLLM engine arguments ---------------------- - -The following are configuration suggestions to potentially improve performance with vLLM. See -`vLLM's engine arguments documentation `_ -for a full list of configurable engine arguments. - -Configure the max-num-seqs parameter -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Increase the ``max-num-seqs`` parameter from the default ``256`` to ``512`` (``--max-num-seqs -512``). This increases the maximum number of sequences per iteration and can improve throughput. - -Use the float16 dtype -^^^^^^^^^^^^^^^^^^^^^ - -The default data type (``dtype``) is specified in the model’s configuration file. For instance, some models use ``torch.bfloat16`` as their default ``dtype``. -Use float16 (``--dtype float16``) for better performance. - -Multi-step scheduling -^^^^^^^^^^^^^^^^^^^^^ - -Setting ``num-scheduler-steps`` for multi-step scheduling can increase performance. Set it between 10 to 15 (``--num-scheduler-steps 10``). - -Distributed executor backend -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The vLLM supports two modes of distributed executor backend: ``ray`` and ``mp``. When using the ``__ fork, using the ``mp`` -backend (``--distributed_executor_backend mp``) is recommended. - -Graph mode max-seq-len-to-capture -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Maximum sequence length covered by CUDA graphs. In the default mode (where ``enforce_eager`` is ``False``), when a sequence has context length -larger than this, vLLM engine falls back to eager mode. The default is 8192. - -When working with models that support long context lengths, set the parameter ``--max-seq-len-to-capture`` to 16384. -See this `vLLM blog `__ for details. - -An example of long context length model is Qwen2-7b. - -Whether to enable chunked prefill -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Another vLLM performance tip is to enable chunked prefill to improve -throughput. Chunked prefill allows large prefills to be chunked into -smaller chunks and batched together with decode requests. - -You can enable the feature by specifying ``--enable-chunked-prefill`` in the -command line or setting ``enable_chunked_prefill=True`` in the LLM -constructor.  - -As stated in `vLLM's documentation, `__, -you can tune the performance by changing ``max_num_batched_tokens``. By -default, it is set to 512 and optimized for ITL (inter-token latency). -Smaller ``max_num_batched_tokens`` achieves better ITL because there are -fewer prefills interrupting decodes. -Higher ``max_num_batched_tokens`` achieves better TTFT (time to the first -token) as you can put more prefill to the batch. - -You might experience noticeable throughput improvements when -benchmarking on a single GPU or 8 GPUs using the vLLM throughput -benchmarking script along with the ShareGPT dataset as input. - -In the case of fixed ``input-len``/``output-len``, for some configurations, -enabling chunked prefill increases the throughput. For some other -configurations, the throughput may be worse and elicit a need to tune -parameter ``max_num_batched_tokens`` (for example, increasing ``max_num_batched_tokens`` value to 4096 or larger). - -.. note:: - - Chunked prefill is no longer recommended. See the vLLM blog: `Serving LLMs on AMD MI300X: Best Practices `_ (October 2024). - -Quantization support ---------------------- - -Quantization reduces the precision of the model’s weights and activations, which significantly decreases the memory footprint. -``fp8(w8a8)`` and ``AWQ`` quantization are supported for ROCm. - -FP8 quantization -^^^^^^^^^^^^^^^^^ - -``__ supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on the Instinct MI300X. -Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. - -AMD publishes Quark Quantized OCP FP8 models on Hugging Face. For example: - -* `Llama-3.1-8B-Instruct-FP8-KV `__ -* `Llama-3.1-70B-Instruct-FP8-KV `__ -* `Llama-3.1-405B-Instruct-FP8-KV `__ -* `Mixtral-8x7B-Instruct-v0.1-FP8-KV `__ -* `Mixtral-8x22B-Instruct-v0.1-FP8-KV `__ - -To enable vLLM benchmarking to run on fp8 quantized models, use the ``--quantization`` parameter with value ``fp8`` (``--quantization fp8``). - -AWQ quantization -^^^^^^^^^^^^^^^^ - -You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Hugging Face. Be aware that -that AWQ support in vLLM is currently underoptimized. - -To enable vLLM to run on ``awq`` quantized models, using ``--quantization`` parameter with ``awq`` (``--quantization awq``). - -You can find more specifics in the `vLLM AutoAWQ documentation `_. - -fp8 kv-cached-dtype -^^^^^^^^^^^^^^^^^^^^^^^ - -Using ``fp8 kv-cache dtype`` can improve performance as it reduces the size -of ``kv-cache``. As a result, it reduces the cost required for reading and -writing the ``kv-cache``. - -To use this feature, specify ``--kv-cache-dtype`` as ``fp8``. - -To specify the quantization scaling config, use the -``--quantization-param-path`` parameter. If the parameter is not specified, -the default scaling factor of ``1`` is used, which can lead to less accurate -results. To generate ``kv-cache`` scaling JSON file, see `FP8 KV -Cache `__ -in the vLLM GitHub repository. - -Two sample Llama scaling configuration files are in vLLM for ``llama2-70b`` and -``llama2-7b``. - -If building the vLLM using -`Dockerfile.rocm `_ -for ``llama2-70b`` scale config, find the file at -``/vllm-workspace/tests/fp8_kv/llama2-70b-fp8-kv/kv_cache_scales.json`` at -runtime. - -Below is a sample command to run benchmarking with this feature enabled -for the ``llama2-70b`` model: - -.. code-block:: shell - - python3 /vllm-workspace/benchmarks/benchmark_throughput.py --model \ - /path/to/llama2-70b-model --kv-cache-dtype "fp8" \ - --quantization-param-path \ - "/vllm-workspace/tests/fp8_kv/llama2-70b-fp8-kv/kv_cache_scales.json" \ - --input-len 512 --output-len 256 --num-prompts 500 - - -.. _mi300x-tunableop: - -PyTorch TunableOp -================== - -`TunableOp `_ -is a feature used to obtain the optimal GPU kernel for a key PyTorch operations. At the moment, -TunableOp supports the tuning of dense matrix multiplies (GEMM, batched GEMM, GEMM and bias, and scaled GEMM). -This feature is useful for squeezing out the last bit of performance. -In short, it will try up to thousands of matrix multiply algorithms that are available in rocBLAS and hipBLASLt. -A caveat is that as the math libraries improve over time, there is a less benefit to using TunableOp, -and there is also no guarantee that the workload being tuned will be able to outperform the default GEMM algorithm in hipBLASLt. - -Some additional references for PyTorch TunableOp include `ROCm blog `__, -TunableOp `README `__, and -`llm tuning `__. - -The three most important environment variables for controlling TunableOp are: - -``PYTORCH_TUNABLEOP_ENABLED`` - The main on/off switch for all TunableOp implementations. Default is ``0`` (disabled). Set to ``1`` to enable. - -``PYTORCH_TUNABLEOP_TUNING`` - When enabled, if a tuned entry isn't found, runs the tuning step and records the entry. Default is ``1`` (enabled). Set to ``0`` to disable. - -``PYTORCH_TUNABLEOP_VERBOSE`` - Enables verbose output for debugging purposes -- it can be useful to see if TunableOp is being used at all. Default is ``0`` (disabled). Set to ``1`` to enable. - -For the complete list of environment variables, see the -TunableOp `README `__. -There are also Python APIs to set some of these environment variables, -but the preferred way to set the TunableOp tuning parameters is to use the environment variables. - -Workflow --------- - -Use these environment variables to enable TunableOp for any applications or libraries that use PyTorch (2.3 or later). - -The first step is the tuning pass: - -1. Enable TunableOp and tuning. Optionally enable verbose mode: - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 your_script.sh - - This pass can be very slow. The output will be the ``tunableop_results.csv`` file containing a list of GEMMs encountered - and the optimal GPU kernel that was identified. - - - - Multi-GPU tuning is supported, producing a separate tunableop_results.csv file for each GPU. The - tuning algorithm executes independently on each GPU, with each tuning process sandboxed to its - respective GPU. There is no inter-GPU communication during tuning. - - For data-parallel algorithms, where GEMM configurations across GPUs are typically identical, this - approach can result in redundant work. In such cases, running the workload on a single GPU might - suffice. However, for algorithms involving multiple levels of parallelism (as in data parallelism - combined with ML model parallelism), different GPUs might require distinct GEMM parameters. In - these scenarios, a multi-GPU configuration is recommended. - -In the second step, we re-run the workload with optimal configuration using the ``tunableop_results.csv`` file obtained in step 1. - -2. Enable TunableOp, disable tuning, and measure: - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_script.sh - -Compare the wall-clock time from this second step to your reference wall-clock time with TunableOp completely disabled (``PYTORCH_TUNABLEOP_ENABLED=0``). - -Offline tuning --------------- - -A new feature of TunableOp, offline tuning, is available in upstream PyTorch and supported in PyTorch 2.6 or later. - -Traditionally, tuning is performed in-place during workload execution. While convenient for one-off -tuning, this approach can become cumbersome if frequent re-tuning is required -- such as when a new -version of a math library is released. In these cases, re-running the workload and performing tuning -repeatedly can be inefficient. - -Offline tuning addresses this challenge by decoupling the tuning process from workload execution. It -enables the collection of GEMMs from a workload during a collection pass, followed by tuning these -GEMMs in a separate tuning pass, without re-running the original workload. This approach -significantly reduces compute resource requirements, particularly for time-intensive workloads. - -For workflow instructions, refer to the `Offline Tuning documentation `_. - -.. _mi300x-torchinductor-tuning: - -PyTorch inductor max-autotune tuning knobs -========================================== - -The following are suggestions for optimizing matrix multiplication (GEMM) and -convolution (``conv``) operations in PyTorch using ``inductor``, a part of the -PyTorch compilation framework. - -Learn more about TorchInductor environment variables and usage in the -`PyTorch documentation `_. - -.. note:: - - Triton is not used if regular :doc:`MIOpen ` or - :doc:`rocBLAS ` performs faster for a specific operation. - -.. note:: - - Experimental: TunableOp (see the :ref:`PyTorch TunableOp ` section) can also be used in combination - with ``TorchInductor`` ``max-autotune`` mode to boost ATen GEMM performance but will further increase tuning time. - The environment variable ``TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1`` can be useful in single GPU workloads to distribute Triton GEMM tuning. - -Triton backend --------------- - -The goal is to leverage Triton to achieve better performance. To tune Triton kernels with ``gemm`` and convolution ops (``conv``), use the -``torch.compile`` function with the ``max-autotune`` mode. This benchmarks a -predefined list of Triton configurations and selects the fastest one for each -shape. See the configurations in PyTorch source code: - -* `conv configurations for "max-autotune" `_ - -* `matmul configurations for "max-autotune" `_ - -This tuning will select the best Triton ``gemm`` configurations according to tile-size -``(BLOCK_M, BLOCK_N, BLOCK_K), num_stages, num_warps`` and ``mfma`` instruction size ( ``matrix_instr_nonkdim`` ) -(see "Triton kernel optimization" section for more details). - -* Set ``torch._inductor.config.max_autotune = True`` or ``TORCHINDUCTOR_MAX_AUTOTUNE=1``. - -* Or, for more fine-grained control: - - ``torch._inductor.config.max_autotune_gemm = True`` - To enable tuning or lowering of ``mm``/``conv``\s. - - ``torch._inductor.config.max_autotune.pointwise = True`` - To enable tuning for ``pointwise``/``reduction`` ops. - - ``torch._inductor.max_autotune_gemm_backends`` or ``TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS`` - Selects the candidate backends for ``mm`` auto-tuning. Defaults to - ``TRITON,ATEN``. - Limiting this to ``TRITON`` might improve performance by - enabling more fused ``mm`` kernels instead of going to rocBLAS. - -* Inference can see large improvements on AMD GPUs by utilizing - ``torch._inductor.config.freezing=True`` or the ``TORCHINDUCTOR_FREEZING=1`` variable, which - in-lines weights as constants and enables constant folding optimizations. - -* Enabling ``inductor``’s cpp_wrapper might improve overhead. This generates - C++ code which launches Triton binaries directly with - ``hipModuleLaunchKernel`` and relies on `hipification`. - - ``torch._inductor.config.cpp_wrapper=True`` or ``TORCHINDUCTOR_CPP_WRAPPER=1`` - -* Convolution workloads might see a performance benefit by specifying - ``torch._inductor.config.layout_optimization=True`` or ``TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1``. - This can help performance by enforcing ``channel_last`` memory format on the - convolution in TorchInductor, avoiding any unnecessary transpose operations. - Note that ``PYTORCH_MIOPEN_SUGGEST_NHWC=1`` is recommended if using this. - -* To extract the Triton kernels generated by ``inductor``, set the environment variable - ``TORCH_COMPILE_DEBUG=1``, which will create a ``torch_compile_debug/`` directory - in the current path. The wrapper codes generated by ``inductor`` are in one or more - ``output_code.py`` files corresponding to the FX graphs associated with the model. - The Triton kernels are defined in these generated codes. - - -Composable Kernel backend --------------------------- - -You can enable the Composable Kernel (``CK``) backend by appending ``CK`` to the comma-separated list of backends. This allows the -auto-tuning process to use kernels from the Composable Kernel library. - -``torch._inductor.max_autotune_gemm_backends`` or ``TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS``. - -Install the Composable Kernel library's Python wrapper via pip using the following command: - -.. code-block:: shell - - pip install git+https://github.com/rocm/composable_kernel@develop - -This wrapper library is responsible for constructing a list of kernel instances available in the Composable Kernel library, -as well as storing the kernel instance C++ includes in a known location (so clang can look into these paths when compiling the ``gemm`` auto-tune candidates). - - * ``matmul`` (with ``float16`` and ``bfloat16`` inputs, row-major X, row-major or column-major W) - * ``addmm`` (with ``float16`` or ``bfloat16`` X, W and Bias; row-major X, row-major or column-major W; Bias can be broadcast either along row-major or column-major dimension) - * ``scaled_mm`` (``float8_e4m3fnuz`` inputs, ``bfloat16`` output) - * ``conv2d`` (with ``float32``, ``float16`` or ``bfloat16`` inputs, channels-last weight layout) - -* For working examples, see `test/inductor/test_ck_backend.py `_. - -* Compiling or build time can be configured by modifying ``torch._inductor.config`` to reduce the build time to avoid time-out. - - * ``compile_threads``: Number of threads used for compilation. Set it to the number of available CPU cores. - * ``rocm.n_max_profiling_configs``: Limiting the number of kernels to speed up compilation. - -* Setting environment variable ``PYTORCH_MIOPEN_SUGGEST_NHWC=1`` to optimize convolution operations. - -Debugging and troubleshooting performance: - -* Generate a standalone executable runner to debug or assess kernels' performance by setting environment variable - ``INDUCTOR_CK_BACKEND_GENERATE_TEST_RUNNER_CODE=1`` to facilitate debugging and profiling. By default, - the CK backend will not build a standalone executable runner. -* Enable debug by passing compilation flags (e.g., ``is_debug``) to clang when compiling the kernels in ``torch._inductor.config.rocm`` class. -* The generated source files and other products of clang compilation are located in the torch inductor root directory (default: ``/tmp/torchinductor_root``) - -.. _mi300x-rocm-library-tuning: - -ROCm library tuning -=================== - -ROCm library tuning involves optimizing the performance of routine computational -operations (such as ``GEMM``) provided by ROCm libraries like -:ref:`hipBLASLt `, :ref:`Composable Kernel `, -:ref:`MIOpen `, and :ref:`RCCL `. This tuning aims -to maximize efficiency and throughput on Instinct MI300X accelerators to gain -improved application performance. - -.. _mi300x-library-gemm: - -GEMM (general matrix multiplication) ------------------------------------- - -GEMMs (General Matrix Multiplications) are a fundamental building block for many operations in neural networks. -GEMM is defined as ``C = αAB + βC`` where A is an ``MxK`` matrix input and B is ``KxN`` matrix input, -and C is ``MxN`` matrix input and is overwritten by the output. α and β are scalar inputs. -hipBLASLt is a library that provides general matrix-matrix operations with a flexible API -and extends functionalities beyond a traditional BLAS library. - -.. _mi300x-hipblaslt: - -hipBLASLt benchmarking -^^^^^^^^^^^^^^^^^^^^^^ - -The GEMM library -`hipBLASLt `_ -provides a benchmark tool for its supported operations. Refer to the -`documentation `_ -for details. - -* Example 1: Benchmark mix fp8 GEMM - - .. code-block:: shell - - HIP_FORCE_DEV_KERNARG=1  hipblaslt-bench --alpha 1 --beta 0 -r f16_r \ - --a_type f16_r --b_type f8_r --compute_type f32_f16_r \ - --initialization trig_float  --cold_iters 100 --iters 1000 --rotating 256 - -* Example 2: Benchmark forward epilogues and backward epilogues - - * ``HIPBLASLT_EPILOGUE_RELU: "--activation_type relu";`` - - * ``HIPBLASLT_EPILOGUE_BIAS: "--bias_vector";`` - - * ``HIPBLASLT_EPILOGUE_RELU_BIAS: "--activation_type relu --bias_vector";`` - - * ``HIPBLASLT_EPILOGUE_GELU: "--activation_type gelu";`` - - * ``HIPBLASLT_EPILOGUE_DGELU": --activation_type gelu --gradient";`` - - * ``HIPBLASLT_EPILOGUE_GELU_BIAS: "--activation_type gelu --bias_vector";`` - - * ``HIPBLASLT_EPILOGUE_GELU_AUX: "--activation_type gelu --use_e";`` - - * ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS: "--activation_type gelu --bias_vector --use_e";`` - - * ``HIPBLASLT_EPILOGUE_DGELU_BGRAD: "--activation_type gelu --bias_vector --gradient";`` - - * ``HIPBLASLT_EPILOGUE_BGRADA: "--bias_vector --gradient --bias_source a";`` - - * ``HIPBLASLT_EPILOGUE_BGRADB:  "--bias_vector --gradient --bias_source b";`` - - -hipBLASLt auto-tuning using hipblaslt-bench -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use the auto-tuning tool in hipBLASLt to get the best solution for a given problem size. - -Prerequisite -'''''''''''' - -Build hipBLASLt. -See the `hipBLASLt repository `_ to see detailed build instructions. - -Quick start -''''''''''' - -Create a working folder for the auto-tuning tool, for example, ``tuning/``. - -1. Set the ``ProblemType``, ``TestConfig``, and ``TuningParameters`` in the YAML file. You can modify the template YAML file in ``hipblaslt/utilities``. - -.. figure:: ../../../data/how-to/tuning-guides/hipblaslt_yaml_template.png - :align: center - :alt: HipBLASLt auto-tuning yaml file template - -2. Run the following command to start tuning. - - .. code-block:: shell - - # python3 hipblaslt/utilities/find_exact.py - # Assume we're in folder tuning, the default root of the build folder of hipblaslt is hipblaslt/build/release - python3 ../hipblaslt/utilities/find_exact.py tuning.yaml hipblaslt/build/release ./ - - -Output -'''''' - -The tool will create two output folders. The first one is the benchmark results, -the second one is the generated equality kernels. If ``SplitK`` is used, the solution's ``GlobalSplitU`` will -also change if the winner is using a different ``SplitK`` from the solution. The YAML files generated inside the -folder ``1_LogicYaml`` are logic ones. These YAML files are just like those generated from TensileLite. - -.. figure:: ../../../data/how-to/tuning-guides/hipblaslt_auto_tuning_output_files.png - :align: center - :alt: HipBLASLt auto-tuning output folder - - -A quick view of the config YAML -''''''''''''''''''''''''''''''' - -The tuning tool is a two-step tool. It first runs the benchmark, then it creates the equality YAML for the user. Note that this config YAML file is different from the config YAML used in TensileLite. - -* **Benchmarking** - - The first step is to run the benchmark, ``find_exact.py`` will run the benchmark with ``hipblaslt-bench``. - For the default configurations, see the Python file. - - .. code-block:: python - - defaultBenchOptions = {"ProblemType": { -     "TransposeA": 0, -     "TransposeB": 0, -     "ComputeInputDataType": "s", -     "ComputeDataType": "s", -     "DataTypeC": "s", -     "DataTypeD": "s", -     "UseBias": False - }, "TestConfig": { -     "ColdIter": 20, -     "Iter": 100, -     "AlgoMethod": "all", -     "RequestedSolutions": 2, # Only works in AlgoMethod heuristic -     "SolutionIndex": None, # Only works in AlgoMethod index -     "ApiMethod": "cpp", -     "RotatingBuffer": 0, - }, "TuningParameters": { -     "SplitK": [0] - }, "ProblemSizes": []} - defaultCreateLogicOptions = {}  # Currently unused - -* ``TestConfig`` - 1. ``ColdIter``: This is number the warm-up iterations before starting the kernel benchmark. - 2. ``Iter``: This is the number of iterations in kernel benchmarking - 3. ``AlgoMethod``: We recommended to keep this unchanged because method "all" returns all the available solutions for the problem type. - 4. ``ApiMethod``: We have c, mix, and cpp. Doesn't affect the result much. - 5. ``RotatingBuffer``: This is a size in the unit of MB. Recommended to set the value equal to the size of the cache of the card to avoid the kernel fetching data from the cache. - -* ``TuningParameters`` - ``SplitK``: Divide ``K`` into ``N`` portions. Not every solution supports ``SplitK``. - The solution will be skipped if not supported. - -* ``CreateLogic`` - Currently no control parameters. - -hipBLASLt backend assembly generator tuning -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:doc:`hipBLASLt ` has a backend assembly generator in -`hipBLASLt's GitHub repository `_, -named TensileLite. TensileLite enables performance optimization by tuning the backend assembly generator. -The following section explains how to use TensileLite to tune hipBLASLt for better performance. - -.. code-block:: shell - - cd /hipBLASLt/tensilelite - ./Tensile/bin/Tensile config.yaml output_path - -config.yaml -''''''''''' - -This file contains the parameters and settings for the tuning process. Here’s -a breakdown of the important sections: - -``GlobalParameters`` - The set of parameters which provides context for the entire tuning exercise. - - Using ``0`` for ``NumElementsToValidate`` is suggested for performance tuning to avoid validation overhead. - - .. code-block:: python - - globalParameters["NumElementsToValidate"] = 0 - -``BenchmarkProblems`` - Defines the set of kernel specifications as well as the size definitions - for the tuning exercise. - - * ``ProblemType`` (``OperationType``, ``DataType``, ``TransposeA``, ``TransposeB``) - * ``BenchmarkCommonParameters`` (the same parameters for all solutions) - * ``ForkParameters`` - * ``BenchmarkFinalParameters`` (``ProblemSizes``) - -``LibraryLogic`` - Specifies the target environment and platform. - - * ``ScheduleName`` - - * ``aldebaran`` is MI200 - - * ``aquavanjaram`` is MI300 - - .. code-block:: shell - - $ ls - aldebaran aquavanjaram navi31 navi32 - - .. code-block:: yaml - - LibraryLogic: - ScheduleName: "aldebaran" - DeviceNames: [Device 0050, Device 0052, Device 0054, Device 0062, Device 7400] - ArchitectureName: "gfx90a" - -``LibraryClient`` - If defined, this will enable step 4 of the tuning process, which means the final - library will be created. - - .. code-block:: shell - - $ ls - aldebaran_Cijk_Ailk_Bjlk_S.yaml - -TensileLite tuning flow ------------------------- - -The TensileLite tuning flow consists of seven steps. In the first six steps, -the programmable benchmarking protocol generates fast kernel candidates. In the -final step (:ref:`step 7 `), these candidates are benchmarked against a predefined set -of problem sizes. - -.. _tensilelite-tuning-flow-fig: - -.. figure:: ../../../data/how-to/tuning-guides/tensilelite-tuning-flow.png - :align: center - :alt: TensileLite tuning flow - -.. _tensilelite-tuning-step-1: - -Step 1: Initial solution parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Before Tensile is able to benchmark a kernel parameter in Step 2 of the :ref:`preceding figure `, -such as ``PrefetchGlobalRead={False, True}``, all other kernel parameters not being measured must be specified. -Therefore, the first step is to initialize a list of default kernel parameters, then subsequent steps of -benchmarking will override a parameter from this default list, with the parameter determined from benchmarking. -Tensile is pre-loaded with default parameters for any unspecified during tuning. - -Step 2: Benchmark common parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Benchmarking common parameters determines parameters which are universally preferable to their alternatives -regardless of other parameters. To benchmark common parameters: - -* User specifies parameters and values to benchmark. - -* Tensile benchmarks all parameter combinations for a user-specified problem size. - -* Tensile selects the fastest parameter combination which is now labeled determined and will subsequently be used. - -In practice, these parameters are not used, since globally preferred parameters are set as defaults in Tensile and do not need to be re-measured. - -Step 3: Fork parameters -^^^^^^^^^^^^^^^^^^^^^^^ - -Rather than continuing to determine globally fastest parameters, which eventually leads -to a single fastest kernel, forking creates many different kernels, -all of which will be considered for use. All forked -parameters are considered determined, i.e., they aren't measured to determine -which is fastest. The :ref:`preceding figure ` shows 7 kernels being forked in Step 3. - -Step 4: Benchmark fork parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Next, tuning continues its refinement by determining fastest parameters for -each forked permutation, same as in Step 2. - -Step 5: Join parameters -^^^^^^^^^^^^^^^^^^^^^^^ - -After tuning the forked kernels, joining reduces the list of kernels so that fewer kernels -will be considered for final use. Each kernel in the resulting list must have different values -for the listed ``JoinParameters``, for example, employing ``JoinParameters`` = ``MacroTile`` will result in only a -few final kernels, each with a different ``MacroTile``. If there are multiple kernels with the same ``MacroTile``, -only the fastest is kept. In the above figure the 7 forked kernel have been reduced to 3 joined kernels. - -Step 6: Benchmark join parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Users can further tune parameters of the joined kernels. This steps is same as Steps 4 except -that it tunes after joining so that there are fewer kernels to be tuned. In practice, -this step is not used; using Step 4 is preferred so that all parameters are measured before joining. - -.. _tensilelite-tuning-step-7: - -Step 7: Benchmark final parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -At the conclusion of Step 6, all parameters of all kernels have been determined and the -final set of kernels for consideration has been established. Now all final kernels will be -measured against all problem sizes specified by the user. Problem sizes can be specified -as Range sizes and Exact sizes. Range sizes cause benchmarking of a broad range of sizes, -and Tensile will be able to interpolate which kernel is best even between the specifically -measured sizes. Exact sizes cause a single problem size to be measured, and the final -library is guaranteed to choose the fastest kernel for that size. This final benchmarking -generates the data that is subsequently analyzed for creating the mapping of problem size -to optimal kernel. - -Update logic YAML files ------------------------- - -The logic YAML files in hipBLASLt are located in -``library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/``. - -To merge the YAML files from the tuned results in TensileLite, use the -``merge.py`` located in ``tensilelite/Tensile/Utilities`` with the following -command: - -.. code-block:: shell - - merge.py original_dir new_tuned_yaml_dir output_dir  - -The following table describes the logic YAML files. - -+----------------+------------------------------------------------------+ -| Logic YAML | Description | -+================+======================================================+ -| ``Equality`` | Update the equality file when your tuned YAML is | -| | an exact tuning. | -+----------------+------------------------------------------------------+ -| ``GridBased`` | Update the gridbased file when your tuned YAML is | -| | a grid-based tuning. | -+----------------+------------------------------------------------------+ -| ``FreeSize`` | Update the freesize file when your tuned YAML | -| | contains confidential sizes, or others. Note that | -| | freesize YAML files do not require any problem size. | -+----------------+------------------------------------------------------+ - -Tensile optimization and performance tuning tips -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -MI16x16 versus MI32x32 - MI16x16 outperforms MI32x32 due to its superior power efficiency. The MI16x16 - format refers to the ``v_mfma`` instruction (such as - ``v_mfma_f32_16x16x16f16``). See - ``__. - -Clock differences among XCDs - There can be a clock speed variation of 3% to 10% among different XCDs. - Typically, XCD0 has the highest clock speed, while XCD7 has the lowest on - MI300X. For optimal efficiency calculations on MI300X, use the XCD with the - lowest average clock speed. If the average clock speed of XCD0 is used, - target efficiencies (such as, 95% for DGEMM HPL cases with K=512) may not be - achievable. - -`WorkGroupMapping` - To maximize L2 cache efficiency, use multiples of the XCD number. For MI300X, - this means using multiples of 8 (such as, 24, 32, 40). - -GEMM stride issues - On MI300, if the matrix stride in GEMM is a multiple of 512 bytes, it can lead to - Tagram channel hotspotting issues, causing a significant performance drop, especially for TN - transpose cases. This can increase the latency of VMEM instructions and cause - a notable performance drop. To avoid this, use stride padding to ensure the - stride is not a multiple of 512 bytes (for instance, for TN F16 GEMM, set - ``lda = ldb = K + 128`` when ``K % 256 == 0``). - -.. _mi300x-ck: - -Optimizing Composable Kernel GEMM kernels -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The performance of a GEMM kernel is significantly influenced by the input -values. The performance hierarchy based on input value types, from highest to -lowest, is as follows: - -* Case 1: [all 0] - -* Case 2: [all identical integers] - -* Case 3: [random integers] - -* Case 4: [random floats] - -There can be more than a 20 percent performance drop between Case 1 and Case 4, -and a 10 percent drop between random integers and random floats. - -Additionally, ``bf16`` matrix core execution is noticeably faster than ``f16``. - -Distributing workgroups with data sharing on the same XCD can enhance -performance (reduce latency) and improve benchmarking stability. - -CK provides a rich set of template parameters for generating flexible accelerated -computing kernels for difference application scenarios. - -See :doc:`optimizing-with-composable-kernel` -for an overview of Composable Kernel GEMM kernels, information on tunable -parameters, and examples. - -.. _mi300x-miopen: - -MIOpen ------- - -MIOpen is AMD's open-source, deep learning primitives library for GPUs. It -implements fusion to optimize for memory bandwidth and GPU launch overheads, -providing an auto-tuning infrastructure to overcome the large design space of -problem configurations. - -Convolution -^^^^^^^^^^^ - -Many of MIOpen kernels have parameters which affect -their performance. Setting these kernel parameters to optimal values -for a given convolution problem, allows reaching the best possible -throughput. The optimal values of these kernel parameters are saved -in PerfDb (Performance database). PerfDb is populated through -tuning. To manipulate the tuning level, use the environment variable -``MIOPEN_FIND_ENFORCE`` (1-6). Optimal values of kernel parameters are -used to benchmark all applicable convolution kernels for the given -convolution problem. These values reside in the FindDb. To manipulate -how to find the best performing kernel for a given convolution -problem, use the environment variable ``MIOPEN_FIND_MODE`` (1-5). - -.. _mi300x-miopen-tuning: - -Tuning in MIOpen -^^^^^^^^^^^^^^^^ - -``MIOPEN_FIND_ENFORCE=DB_UPDATE``, ``2`` - Performs auto-tuning and update to the PerfDb. - -``MIOPEN_FIND_ENFORCE=SEARCH``, ``3`` - Only perform auto-tuning if PerfDb does not contain optimized value for a - given convolution problem - -What does :doc:`PerfDb ` look like? - -.. code-block:: - - [ - 2x128x56xNHWCxF, [ - ConvAsm1x1U : 1,8,2,64,2,4,1,8 ; // optimum kernel params for convolution problem 2x128x56xNHWCxF - ConvOclDirectFwd1x1 : 1,128,1,1,0,2,32,4,0; // optimum kernel params for convolution problem 2x128x56xNHWCxF - ], - 2x992x516xNHWCxF, [ - ConvAsm1x1U : 64,18,2,64,2,4,41,6 ; // optimum kernel params for convolution problem 2x992x516xNHWCxF - ConvOclDirectFwd1x1 : 54,128,21,21,1,23,32,4,0 // optimum kernel params for convolution problem 2x992x516xNHWCxF - ] - ... - ] - -See :doc:`miopen:conceptual/perfdb` for more information. - -Finding the fastest kernel -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``MIOPEN_FIND_MODE=NORMAL``, ``1`` - Benchmark all the solvers and return a list (front element is the fastest kernel). - -``MIOPEN_FIND_MODE=FAST``, ``2`` - Check FindDb (Find database) if convolution problem is found return - else - immediate fallback mode (predict the performing kernel parameters based on - mathematical and AI models). - -``MIOPEN_FIND_MODE=HYBRID``, ``3`` - Check FindDb if convolution problem is found return - else benchmark that - problem. - -What does :doc:`FindDb ` look like? - -.. code-block:: - - [ - - 2x128x56xNHWCxF, [ - ConvAsm1x1U : 0.045 (time), 12312 (workspace), algo_type; - ConvOclDirectFwd1x1 : 1.145 (time), 0 (workspace), algo_type; - ], - - 2x992x516xNHWCxF, [ - ConvAsm1x1U : 2.045 (time), 12312 (workspace), algo_type; - ConvOclDirectFwd1x1 : 1.145 (time), 0 (workspace), algo_type; - ] - ... - ] - -See :doc:`miopen:how-to/find-and-immediate` for more information. - -For example: - -.. code-block:: shell - - MIOPEN_FIND_ENFORCE=3 MIOPEN_FIND_MODE=1 ./bin/MIOpenDriver convbfp16 -n 1 -c 1024 -H 14 -W 14 -k 256 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 - -.. _mi300x-rccl: - -RCCL ----- - -:doc:`RCCL ` is a stand-alone library of standard collective -communication routines for GPUs, implementing all-reduce, all-gather, reduce, -broadcast, reduce-scatter, gather, scatter, and all-to-all. RCCL supports an -arbitrary number of GPUs installed in a single node or multiple nodes -and can be used in either single- or multi-process (such as MPI) -applications. - -The following subtopics include information on RCCL features and optimization -strategies: - -* :ref:`Use all eight GPUs ` - -* :ref:`Disable NUMA auto-balancing ` - -* :ref:`Disable ACS for multi-node RCCL ` - -* :ref:`Run RCCL-Unittests ` - -* :ref:`NPKit profiler ` - -* :ref:`RCCL-tests ` - -* :ref:`Use one-process-per-GPU mode ` - -* :ref:`RCCL in E2E workloads ` - -.. _mi300x-rccl-8-gpu: - -Use all eight GPUs -^^^^^^^^^^^^^^^^^^ - -In an :ref:`MI300X architecture `, there are -dedicated links between each pair of GPUs in a fully connected topology. -Therefore, for collective operations, the best performance is achieved -when all 8 GPUs and, hence, all the links between them are used. In the -case of 2- or 4-GPU collective operations (generally less than 8 GPUs), -you can only use a fraction of the potential bandwidth on the node. - -The following figure shows an -:doc:`MI300X node-level architecture ` of a -system with AMD EPYC processors in a dual-socket configuration and eight -AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via -PCIe Gen 5 x16 links (yellow lines). The GPUs use seven high-bandwidth, -low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected -8-GPU system. - -.. _mi300x-node-level-arch-fig: - -.. figure:: ../../../data/shared/mi300-node-level-arch.png - - MI300 series node-level architecture showing 8 fully interconnected MI300X - OAM modules connected to (optional) PCIe switches via re-timers and HGX - connectors. - -.. _mi300x-rccl-disable-numa: - -Disable NUMA auto-balancing -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In order to reduce performance variability and also achieve better -performance, you need to make sure that NUMA auto-balancing is disabled -on the node. - -Check whether NUMA auto-balancing is disabled, by running the -following command: ``cat /proc/sys/kernel/numa_balancing`` and -checking whether the output is ``0``. - -If the output is ``1``, you can disable NUMA auto-balancing by running the -following command: ``sudo sysctl kernel.numa_balancing=0``. For more details, -see `AMD Instinct MI300X system optimization -`_. - -.. _mi300x-rccl-disable-acs: - -Disable ACS for multi-node RCCL -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Check if ACS is disabled with ``sudo lspci -vvv \| grep -i "acsctl"``. -This will print many lines. Check if there are any that show ``SrcValid+`` - -If there are any ``SrcValid+``, then use the following ``disable_acs.sh`` script -to disable ACS (requires ``sudo``). - -.. code-block:: shell - - #!/bin/bash - - # - - # Disable ACS on every device that supports it - - # - - PLATFORM=$(dmidecode --string system-product-name) - - logger "PLATFORM=${PLATFORM}" - - # Enforce platform check here. - - #case "${PLATFORM}" in - - #"OAM"*) - - #logger "INFO: Disabling ACS is no longer necessary for ${PLATFORM}" - - #exit 0 - - #;; - - #*) - - #;; - - #esac - - # must be root to access extended PCI config space - - if [ "$EUID" -ne 0 ]; then - - echo "ERROR: $0 must be run as root" - - exit 1 - - fi - - for BDF in \`lspci -d "*:*:*" \| awk '{print $1}'`; do - - # skip if it doesn't support ACS - - setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 - - if [ $? -ne 0 ]; then - - #echo "${BDF} does not support ACS, skipping" - - continue - - fi - - logger "Disabling ACS on \`lspci -s ${BDF}`" - - setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 - - if [ $? -ne 0 ]; then - - logger "Error enabling directTrans ACS on ${BDF}" - - continue - - fi - - NEW_VAL=`setpci -v -s ${BDF} ECAP_ACS+0x6.w \| awk '{print $NF}'\` - - if [ "${NEW_VAL}" != "0000" ]; then - - logger "Failed to enabling directTrans ACS on ${BDF}" - - continue - - fi - - done - - exit 0 - -.. _mi300x-rccl-unittests: - -Run RCCL-Unittests -^^^^^^^^^^^^^^^^^^ - -In order to verify RCCL installation and test whether all parts and -units of RCCL work as expected you can run the RCCL-Unittests which is -explained in ``__. - -.. _mi300x-rccl-npkit: - -NPKit profiler -^^^^^^^^^^^^^^ - -To collect fine-grained trace events in RCCL components, especially in -giant collective GPU kernels you can use the NPKit profiler explained -in ``__. - -.. _mi300x-rccl-tests: - -RCCL-tests -^^^^^^^^^^ - -RCCL-tests are performance and error-checking tests for RCCL -maintained in ``__. - -These tests are one of the best ways to check the performance of -different collectives provided by RCCL. You can select collectives, -message sizes, datatypes, operations, number of iterations, etc., for -your test, and then rccl-tests deliver performance metrics such as -latency, algorithm bandwidth, and bus bandwidth for each case. - -.. _mi300x-rccl-one-process-per-gpu: - -Use one-process-per-GPU mode -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -RCCL delivers the best performance for collectives when it is configured -in a one-process-per-GPU mode. This is due to the fact that for a -one-process-per-multiple-GPUs configuration, you can run into kernel launch -latency issues. This is because ROCm serializes kernel launches on multiple GPUs -from one process which hurts performance. - -.. _mi300x-rccl-e2e: - -RCCL in E2E workloads -^^^^^^^^^^^^^^^^^^^^^ - -Use the following environment variable to increase the number of -channels used by RCCL when using RCCL in end-to-end workloads to potentially -improve the performance: - -.. code-block:: text - - export NCCL_MIN_NCHANNELS=112 - -.. _mi300x-triton-kernel-performance-optimization: - -Triton kernel performance optimization -====================================== - -Triton kernel optimization encompasses a variety of strategies aimed at -maximizing the efficiency and performance of GPU computations. These strategies -include -:ref:`optimizing overall GPU resource utilization `, -:ref:`tuning kernel configurations `, and -:ref:`leveraging specific hardware features ` to -achieve higher throughput and lower latency. - -.. _mi300x-autotunable-kernel-config: - -Auto-tunable kernel configurations ----------------------------------- - -Auto-tunable kernel configuration involves adjusting memory access and computational -resources assigned to each compute unit. It encompasses the usage of -:ref:`LDS `, register, and task scheduling on a compute unit. - -The accelerator or GPU contains global memory, local data share (LDS), and -registers. Global memory has high access latency, but is large. LDS access has -much lower latency, but is smaller. It is a fast on-CU software-managed memory -that can be used to efficiently share data between all work items in a block. -Register access is the fastest yet smallest among the three. - -.. _mi300x-cu-fig: - -.. figure:: ../../../data/shared/compute-unit.png - - Schematic representation of a CU in the CDNA2 or CDNA3 architecture. - -The following is a list of kernel arguments used for tuning performance and -resource allocation on AMD accelerators, which helps in optimizing the -efficiency and throughput of various computational kernels. - -``num_stages=n`` - Adjusts the number of pipeline stages for different types of kernels. On AMD accelerators, set ``num_stages`` - according to the following rules: - - * For kernels with a single GEMM, set to ``2``. - - * For kernels with two GEMMs fused (Flash Attention, or any other kernel - that fuses 2 GEMMs), set to ``1``. - - * For kernels that fuse a single GEMM with another non-GEMM operator - (for example ReLU activation), set to ``2``. - - * For kernels that have no GEMMs, set to ``1``. - -``waves_per_eu=n`` - Helps to manage Vector General Purpose Registers (VGPR) usage to achieve - desired occupancy levels. This argument hints to the compiler to reduce VGPR - to achieve ``n`` occupancy where ``n`` is a number. The goal is to achieve a - certain occupancy level for each Execution Unit (EU, also called - :ref:`SIMD Unit `) to achieve better latency or throughput. - For more information on how to compute occupancy, see - :ref:`mi300x-compute-kernel-occ`. - - This argument is useful if: - - * The occupancy of the kernel is limited by VGPR usage, and - - * The current VGPR usage is only a few above a boundary in - :ref:`Occupancy related to VGPR usage in an Instinct MI300X accelerator `. - -.. _mi300x-occupancy-vgpr-table: - -.. figure:: ../../../data/shared/occupancy-vgpr.png - :alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator. - :align: center - - Occupancy related to VGPRs usage on an Instinct MI300X accelerator - -For example, according to the table, each Execution Unit (EU) has 512 available -VGPRs, which are allocated in blocks of 16. If the current VGPR usage is 170, -it will be rounded up to 176 due to the allocation granularity. In this case, -the occupancy is limited to 2 waves per EU because :math:`176 \times 3 > 512`. -So, if you set ``waves_per_eu`` to 3, the LLVM backend will attempt to reduce -VGPR usage so that it might fit 3 waves per EU. - -``BLOCK_M``, ``BLOCK_N``, ``BLOCK_K`` - Tile sizes to be tuned to balance the memory-to-computation ratio. The goal - is to minimize the memory transfer from global to shared and reuse memory - across different threads. This needs to be tuned. The tile sizes should be - large enough to maximize the efficiency of the memory-to-computation - ratio but small enough to parallelize the greatest number of workgroups at - the grid level. - -``matrix_instr_nonkdim`` - Experimental feature for Flash Attention-like kernels that determines the size of the Matrix Fused Multiply-Add - (MFMA) instruction used. - - - ``matrix_instr_nonkdim = 16``: ``mfma_16x16`` is used. - - - ``matrix_instr_nonkdim = 32``: ``mfma_32x32`` is used. - - For GEMM kernels on an MI300X accelerator, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large - tile/GEMM sizes. - - -.. _mi300x-triton-gpu-utilization: - -Overall GPU resource utilization --------------------------------- - -As depicted in the following figure, each XCD in -:doc:`MI300X ` contains 40 compute units (CUs), -with 38 active. Each MI300X contains eight vertical XCDs, and a total of 304 -active compute units capable of parallel computation. The first consideration is -the number of CUs a kernel can distribute its task across. - -.. figure:: ../../../data/shared/xcd-sys-arch.png - - XCD-level system architecture showing 40 compute units, - each with 32 KB L1 cache, a unified compute system with 4 ACE compute - accelerators, shared 4MB of L2 cache, and a hardware scheduler (HWS). - -You can query hardware resources with the command ``rocminfo`` in the -``/opt/rocm/bin`` directory. For instance, query the number of CUs, number of -SIMD, and wavefront size using the following commands. - -.. code-block:: shell - - rocminfo | grep "Compute Unit" - - rocminfo | grep "SIMD" - - rocminfo | grep "Wavefront Size" - -For the MI300X, the goal is to have a minimum of 1024 thread -blocks or workgroups in the grid (kernel), with a preference for -more. - -Identifying additional parallelism within the algorithm is necessary to -enhance GPU utilization. For more information and examples, see -`Accelerating A Triton Fused Kernel For W4a16 Quantized Inference With -SplitK Work Decomposition `__. - -.. _mi300x-mlir-analysis: - -MLIR analysis -------------- - -Triton includes the following layouts: **blocked**, **shared**, **sliced**, and **MFMA**. - -Use the Triton GPU Intermediate Representation (IR) to identify the memory in -which each computation takes place. - -Use the environment variable ``MLIR_ENABLE_DUMP`` to dump MLIR: - -.. code-block:: shell - - export MLIR_ENABLE_DUMP=1 - -The following is a snippet of IR from the Flash Attention decode ``int4`` KV program. It is to -de-quantize the ``int4`` key-value from the ``int4`` data type to ``fp16``. - -.. code-block:: text - - %190 = tt.load %189 {cache = 1 : i32, evict = 1 : i32, isVolatile = - false} : tensor<1x64xi32, #blocked6> loc(#loc159) - - %266 = arith.andi %190, %cst_28 : tensor<1x64xi32, #blocked6> - loc(#loc250) - - %267 = arith.trunci %266 : tensor<1x64xi32, #blocked6> to - tensor<1x64xi16, #blocked6> loc(#loc251) - - %268 = tt.bitcast %267 : tensor<1x64xi16, #blocked6> -> tensor<1x64xf16, - #blocked6> loc(#loc252) - - %269 = triton_gpu.convert_layout %268 : (tensor<1x64xf16, #blocked6>) -> - tensor<1x64xf16, #shared1> loc(#loc252) - - %270 = tt.trans %269 : (tensor<1x64xf16, #shared1>) -> tensor<64x1xf16, - #shared2> loc(#loc194) - - %276 = triton_gpu.convert_layout %270 : (tensor<64x1xf16, #shared2>) -> - tensor<64x1xf16, #blocked5> loc(#loc254) - - %293 = arith.mulf %276, %cst_30 : tensor<64x1xf16, #blocked5> - loc(#loc254) - - %295 = arith.mulf %292, %294 : tensor<64x32xf16, #blocked5> loc(#loc264) - - %297 = arith.addf %295, %296 : tensor<64x32xf16, #blocked5> loc(#loc255) - - %298 = triton_gpu.convert_layout %297 : (tensor<64x32xf16, #blocked5>) - -> tensor<64x32xf16, #shared1> loc(#loc255) - - %299 = tt.trans %298 : (tensor<64x32xf16, #shared1>) -> - tensor<32x64xf16, #shared2> loc(#loc196) - - %300 = triton_gpu.convert_layout %299 : (tensor<32x64xf16, #shared2>) -> - tensor<32x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mfma, kWidth - = 4}>> loc(#loc197) - -From the IR snippet, you can see ``i32`` data is loaded from global memory to -registers (``%190``). With a few element-wise operations in registers, it is -stored in shared memory (``%269``) for the transpose operation (``%270``), which -needs data movement across different threads. With the transpose done, it is -loaded from LDS to register again (``%276``), and with a few more -element-wise operations, it is stored to LDS again (``%298``). The last step -loads from LDS to registers and converts to the dot-operand layout -(``%300``). - -The IR snippet uses the LDS twice. The first is for the transpose, and -the second is to convert a blocked layout to a dot operand layout. -There’s an opportunity to optimize performance by using LDS once. - -.. _mi300x-assembly-analysis: - -ISA assembly analysis ---------------------- - -To generate ISA, ``export AMDGCN_ENABLE_DUMP=1`` when running the Triton -program. The generated ISA will be printed as standard output. You can -dump it to a file for analysis. - -* Ensure ``global_load_dwordx4`` is used in the ISA, especially when the - global memory load happens in the loop. - -* In most cases, the LDS load and store should use ``_b128`` to - minimize the number of LDS access instructions. - -* The AMD ISA has ``s_waitcnt`` instruction to synchronize the dependency - of memory access and computations. The ``s_waitcnt`` instructions can - typically have two signals in the Triton context: - - * ``lgkmcnt(n)``: ``lgkm`` stands for LDS, GDS - (Global Data Share), Constant, and Message. It is often related to - LDS access. The ``n`` indicates the number of data accesses can still - be ongoing before moving on to the next step. For example, if ``n`` is - ``0``, wait for all ``lgkm`` access to finish before continuing. If ``n`` - is ``1``, move on even if ``1`` ``lgkm`` access is still running - asynchronously. - - * ``vmcnt(n)``: ``vm`` represents vector memory. This happens when - vector memory is accessed, for example, when global load moves - from global memory to vector memory. The variable ``n`` is the same as - the previous setting. - -Generally recommended guidelines are as follows. - -* Vectorize memory access as much as possible. - -* Ensure synchronization is done efficiently. - -* Overlap of instructions to hide latency, but it requires thoughtful - analysis of the algorithms. - -* If you find inefficiencies, you can trace it back to LLVM IR, TTGIR - and even TTIR to see where the problem comes from. If you find it - during compiler optimization, activate the MLIR dump - (``export MLIR_ENABLE_DUMP=1``) and check which optimization pass caused the - problem. - -.. _mi300x-hip-optimization: - -HIP performance optimization -============================ - -This section summarizes the best practices described in the -:doc:`Performance guidelines ` section of the -HIP documentation. - -Optimization areas of concern include: - -* Parallel execution - -* Memory usage optimization - -* Optimization for maximum throughput - -* Minimizing memory thrashing - -Parallel execution and GPU hardware utilization ------------------------------------------------ - -The application should reveal and efficiently imply as much parallelism as -possible for optimal use to keep all system components active. - -Memory usage optimization -------------------------- - -To optimize memory throughput, minimize low-bandwidth data transfers, -particularly between the host and device. Maximize on-chip memory, including -shared memory and caches, to reduce data transfers between global memory and the -device. - -In a GPU, global memory has high latency but a large size, while local data -share (LDS) has lower latency but a smaller size, and registers have the fastest -but smallest access. Aim to limit load/store operations in global memory. If -multiple threads in a block need the same data, transfer it from global memory -to LDS for efficient access. - -See :doc:`HIP's performance guidelines ` for -greater detail. - -Diagnostic and performance analysis -=================================== - -.. _mi300x-rocr-debug-agent: - -Debug memory access faults --------------------------- - -Identifying a faulting kernel is often enough to triage a memory access -fault. The ROCr Debug Agent can trap a memory access fault and provide a -dump of all active wavefronts that caused the error, as well as the name -of the kernel. For more information, see -:doc:`ROCr Debug Agent documentation `. - -To summarize, the key points include: - -1. Compiling with ``-ggdb -O0`` is recommended but not required. - -2. ``HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 HSA_ENABLE_DEBUG=1 ./my_program`` - -When the debug agent traps the fault, it produces verbose output of all -wavefront registers and memory content. Importantly, it also prints -something similar to the following: - -.. code-block:: text - - Disassembly for function vector_add_assert_trap(int*, int*, int*): - - code object: - file:////rocm-debug-agent/build/test/rocm-debug-agent-test#offset=14309&size=31336 - - loaded at: [0x7fd4f100c000-0x7fd4f100e070] - -The kernel name and the code object file should be listed. In the -example above, the kernel name is vector_add_assert_trap, but this might -also look like: - -.. code-block:: text - - Disassembly for function memory:///path/to/codeobject#offset=1234&size=567: - -In this case, it's an in-memory kernel that was generated at runtime. -Using the environment variable ``ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects"`` -will have the debug agent save all code objects to the current directory. Use -``--save-code-objects=[DIR]`` to save them in another location. - -The code objects will be renamed from the URI format with special -characters replaced by ‘_’. Use ``llvm-objdump`` to disassemble the -indicated in-memory code object that has been saved to disk. The name of -the kernel is often found in the disassembled code object. - -.. code-block:: shell - - llvm-objdump --disassemble-all path/to/code-object.co - -Disabling memory caching strategies within the ROCm stack and PyTorch is -recommended, where possible. This gives the debug agent the best chance -of finding the memory fault where it originates. Otherwise, it could be -masked by writing past the end of a cached block within a larger -allocation. - -.. code-block:: text - - PYTORCH_NO_HIP_MEMORY_CACHING=1 - - HSA_DISABLE_FRAGMENT_ALLOCATOR=1 - -.. _mi300x-compute-kernel-occ: - -Compute the occupancy of a kernel ---------------------------------- - -1. Get the VGPR count, search for ``.vgpr_count`` in the ISA (for example, - ``N``). - -2. Get the allocated LDS following the steps (for example, L for the kernel). - - a. ``export MLIR_ENABLE_DUMP=1`` - - b. ``rm -rf ~/.triton/cache`` - - c. ``python kernel.py | | grep "triton_gpu.shared = " | tail -n 1`` - - d. You should see something like ``triton_gpu.shared = 65536``, indicating - 65536 bytes of LDS are allocated for the kernel. - -3. Get number of waves per workgroup using the following steps (for example, ``nW``). - - a. ``export MLIR_ENABLE_DUMP=1`` - - b. ``rm -rf ~/.triton/cache`` - - c. ``python kernel.py | | grep "triton_gpu.num-warps " | tail -n 1`` - - d. You should see something like ``“triton_gpu.num-warps" = 8``, indicating 8 - waves per workgroup. - -4. Compute occupancy limited by VGPR based on N according to the - :ref:`preceding table `. For example, waves per - EU as ``occ_vgpr``. - -5. Compute occupancy limited by LDS based on L by: ``occ_lds = floor(65536 / L)``. - -6. Then the occupancy is ``occ = min(floor(occ_vgpr * 4 / nW), occ_lds) * nW / 4`` - - a. ``occ_vgpr \* 4`` gives the total number of waves on all 4 execution units (SIMDs) - per CU. - - b. ``floor(occ_vgpr * 4 / nW)`` gives the occupancy of workgroups per CU - regrading VGPR usage. - - c. The true ``occ`` is the minimum of the two. - -Find the full ``occ.sh`` at -``__. - -Special considerations -====================== - -Multi-GPU communications ------------------------- - -Because of the characteristics of MI300X inter-GPU communication and -limitation of bandwidth between and among 2 GPUs and 4 GPUs, avoid running -workloads that use 2 or 4 GPU collectives. It's optimal to either use a -single GPU (where no collective is required) or employ 8 GPU -collectives. - -Multi-node FSDP and RCCL settings ---------------------------------- - -When using PyTorch's FSDP (Full Sharded Data Parallel) feature, the HIP -streams used by RCCL and HIP streams used for compute kernels do not -always overlap well. As a workaround, it's recommended to use -high-priority HIP streams with RCCL. - -To configure high-priority streams: - -- Set environment variable ``TORCH_NCCL_HIGH_PRIORITY=1`` to force all RCCL - streams to be high-priority. - -- Set environment variable ``GPU_MAX_HW_QUEUES=2`` via the HIP runtime - library. - -Hardware efficiency is maximized with 4 or fewer HIP streams. These environment variables limit the -configuration to two compute streams and two RCCL streams, aligning with this best practice. -Additionally, RCCL is often pre-optimized for MI300 systems in production by querying the node -topology during startup, reducing the need for extensive manual tuning. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst deleted file mode 100644 index f2c993377..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst +++ /dev/null @@ -1,25 +0,0 @@ -:orphan: - -**************************************************** -SGLang inference performance testing version history -**************************************************** - -This table lists previous versions of the ROCm SGLang inference performance -testing environment. For detailed information about available models for -benchmarking, see the version-specific documentation. - -.. list-table:: - :header-rows: 1 - - * - Docker image tag - - Components - - Resources - - * - ``lmsysorg/sglang:v0.4.5-rocm630`` - - - * ROCm 6.3.0 - * SGLang 0.4.5 - * PyTorch 2.6.0 - - - * :doc:`Documentation <../sglang>` - * `Docker Hub `__ diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst deleted file mode 100644 index 68d7f66e7..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst +++ /dev/null @@ -1,445 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker-812: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.0_20250812-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - * - `ROCm `__ - - {{ unified_docker.rocm_version }} - - * - `vLLM `__ - - {{ unified_docker.vllm_version }} - - * - `PyTorch `__ - - {{ unified_docker.pytorch_version }} - - * - `hipBLASLt `__ - - {{ unified_docker.hipblaslt_version }} - -With this Docker image, you can quickly test the :ref:`expected -inference performance numbers ` for -MI300X series accelerators. - -What's new -========== - -The following is summary of notable changes since the :doc:`previous ROCm/vLLM Docker release `. - -* Upgraded to vLLM v0.10. - -* FP8 KV cache support via AITER. - -* Full graph capture support via AITER. - -Supported models -================ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.0_20250812-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - .. _vllm-benchmark-available-models-812: - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm-812: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - -.. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - -.. _vllm-benchmark-performance-measurements-812: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and serving measurements for inferencing popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.0_20250812-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-812: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The throughput and serving reports of the - model are collected in the following paths: ``{{ model.mad_tag }}_throughput.csv`` - and ``{{ model.mad_tag }}_serving.csv``. - - Although the :ref:`available models - ` are preconfigured to collect - offline throughput and online serving performance data, you can - also change the benchmarking parameters. See the standalone - benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled (see - ``__). To enable it, include - the ``--tunableop on`` argument in your run. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the - performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required scripts - - 1. Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - docker run -it \ - --device=/dev/kfd \ - --device=/dev/dri \ - --group-add video \ - --shm-size 16G \ - --security-opt seccomp=unconfined \ - --security-opt apparmor=unconfined \ - --cap-add=SYS_PTRACE \ - -v $(pwd):/workspace \ - --env HUGGINGFACE_HUB_CACHE=/workspace \ - --name test \ - {{ unified_docker.pull_tag }} - - 2. In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - 3. To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./run.sh \ - --config $CONFIG_CSV \ - --model_repo {{ model.model_repo }} \ - - - .. dropdown:: Benchmark options - :open: - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``--config`` - - ``configs/default.csv`` - - Run configs from the CSV for the chosen model repo and benchmark. - - * - - - ``configs/extended.csv`` - - - - * - - - ``configs/performance.csv`` - - - - * - ``--benchmark`` - - ``throughput`` - - Measure offline end-to-end throughput. - - * - - - ``serving`` - - Measure online serving performance. - - * - - - ``all`` - - Measure both throughput and serving. - - * - `` - - See `run.sh `__ for more info. - - Additional overrides to the config CSV. - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - For best performance, it's recommended to run with ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``. - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. rubric:: Benchmarking examples - - Here are some examples of running the benchmark with various options: - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - export MAD_MODEL_NAME={{ model.mad_tag }} - ./run.sh \ - --config configs/default.csv \ - --model_repo {{model.model_repo}} \ - --benchmark throughput - - Find the throughput benchmark report at ``./{{ model.mad_tag }}_throughput.csv``. - - * Serving benchmark - - Use this command to benchmark the serving performance of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - export MAD_MODEL_NAME={{ model.mad_tag }} - ./run.sh \ - --config configs/default.csv \ - --model_repo {{model.model_repo}} \ - --benchmark serving - - Find the serving benchmark report at ``./{{ model.mad_tag }}_serving.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Advanced usage -============== - -For information on experimental features and known issues related to ROCm optimization efforts on vLLM, -see the developer's guide at ``__. - -Reproducing the Docker image ----------------------------- - -To reproduce this ROCm/vLLM Docker image release, follow these steps: - -1. Clone the `vLLM repository `__. - - .. code-block:: shell - - git clone https://github.com/ROCm/vllm.git - -2. Checkout the specific release commit. - - .. code-block:: shell - - cd vllm - git checkout 340ea86dfe5955d6f9a9e767d6abab5aacf2c978 - -3. Build the Docker image. Replace ``vllm-rocm`` with your desired image tag. - - .. code-block:: shell - - docker build -f docker/Dockerfile.rocm -t vllm-rocm . - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst deleted file mode 100644 index a68618338..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst +++ /dev/null @@ -1,448 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker-909: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.1_20250909-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - - The `ROCm vLLM Docker <{{ docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -With this Docker image, you can quickly test the :ref:`expected -inference performance numbers ` for -MI300X series accelerators. - -What's new -========== - -The following is summary of notable changes since the :doc:`previous ROCm/vLLM Docker release `. - -* Upgraded to vLLM v0.10.1. - -* Set ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1`` by default for better performance. - -* Set ``VLLM_ROCM_USE_AITER_RMSNORM=0`` by default to avoid various issues with torch compile. - -.. _vllm-benchmark-supported-models-909: - -Supported models -================ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.1_20250909-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - .. _vllm-benchmark-available-models-909: - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm-909: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - {% if model.precision == "float8" and model.model_repo.startswith("amd") %} - This model uses FP8 quantization via `AMD Quark `__ for efficient inference on AMD accelerators. - {% endif %} - - {% endfor %} - {% endfor %} - -.. _vllm-benchmark-performance-measurements-909: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and serving measurements for inferencing popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.10.1_20250909-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-909: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - The following run command is tailored to {{ model.model }}. - See :ref:`vllm-benchmark-supported-models-909` to switch to another available model. - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The throughput and serving reports of the - model are collected in the following paths: ``{{ model.mad_tag }}_throughput.csv`` - and ``{{ model.mad_tag }}_serving.csv``. - - Although the :ref:`available models - ` are preconfigured to collect - offline throughput and online serving performance data, you can - also change the benchmarking parameters. See the standalone - benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled (see - ``__). To enable it, include - the ``--tunableop on`` argument in your run. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the - performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - The following commands are optimized for {{ model.model }}. - See :ref:`vllm-benchmark-supported-models-909` to switch to another available model. - - .. seealso:: - - For more information on configuration, see the `config files - `__ - in the MAD repository. Refer to the `vLLM engine `__ - for descriptions of available configuration options - and `Benchmarking vLLM `__ for - additional benchmarking information. - - .. rubric:: Launch the container - - You can run the vLLM benchmark tool independently by starting the - `Docker container <{{ docker.docker_hub_url }}>`_ as shown - in the following snippet. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - docker run -it \ - --device=/dev/kfd \ - --device=/dev/dri \ - --group-add video \ - --shm-size 16G \ - --security-opt seccomp=unconfined \ - --security-opt apparmor=unconfined \ - --cap-add=SYS_PTRACE \ - -v $(pwd):/workspace \ - --env HUGGINGFACE_HUB_CACHE=/workspace \ - --name test \ - {{ docker.pull_tag }} - - .. rubric:: Throughput command - - Use the following command to start the throughput benchmark. - - .. code-block:: shell - - model={{ model.model_repo }} - tp={{ model.config.tp }} - num_prompts=1024 - in=128 - out=128 - dtype={{ model.config.dtype }} - kv_cache_dtype={{ model.config.kv_cache_dtype }} - max_num_seqs=1024 - max_seq_len_to_capture={{ model.config.max_seq_len_to_capture }} - max_num_batched_tokens={{ model.config.max_num_batched_tokens }} - max_model_len={{ model.config.max_model_len }} - - vllm bench throughput --model $model \ - -tp $tp \ - --num-prompts $num_prompts \ - --input-len $in \ - --output-len $out \ - --dtype $dtype \ - --kv-cache-dtype $kv_cache_dtype \ - --max-num-seqs $max_num_seqs \ - --max-seq-len-to-capture $max_seq_len_to_capture \ - --max-num-batched-tokens $max_num_batched_tokens \ - --max-model-len $max_model_len \ - --trust-remote-code \ - --output-json ${model}_throughput.json \ - --gpu-memory-utilization 0.9 - - .. rubric:: Serving command - - 1. Start the server using the following command: - - .. code-block:: shell - - model={{ model.model_repo }} - tp={{ model.config.tp }} - dtype={{ model.config.dtype }} - kv_cache_dtype={{ model.config.kv_cache_dtype }} - max_num_seqs=256 - max_seq_len_to_capture={{ model.config.max_seq_len_to_capture }} - max_num_batched_tokens={{ model.config.max_num_batched_tokens }} - max_model_len={{ model.config.max_model_len }} - - vllm serve $model \ - -tp $tp \ - --dtype $dtype \ - --kv-cache-dtype $kv_cache_dtype \ - --max-num-seqs $max_num_seqs \ - --max-seq-len-to-capture $max_seq_len_to_capture \ - --max-num-batched-tokens $max_num_batched_tokens \ - --max-model-len $max_model_len \ - --no-enable-prefix-caching \ - --swap-space 16 \ - --disable-log-requests \ - --trust-remote-code \ - --gpu-memory-utilization 0.9 - - Wait until the model has loaded and the server is ready to accept requests. - - 2. On another terminal on the same machine, run the benchmark: - - .. code-block:: shell - - # Connect to the container - docker exec -it test bash - - # Wait for the server to start - until curl -s http://localhost:8000/v1/models; do sleep 30; done - - # Run the benchmark - model={{ model.model_repo }} - max_concurrency=1 - num_prompts=10 - in=128 - out=128 - vllm bench serve --model $model \ - --percentile-metrics "ttft,tpot,itl,e2el" \ - --dataset-name random \ - --ignore-eos \ - --max-concurrency $max_concurrency \ - --num-prompts $num_prompts \ - --random-input-len $in \ - --random-output-len $out \ - --trust-remote-code \ - --save-result \ - --result-filename ${model}_serving.json - - .. note:: - - For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, - try adding ``export VLLM_ROCM_USE_AITER=1`` to your commands. - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Advanced usage -============== - -For information on experimental features and known issues related to ROCm optimization efforts on vLLM, -see the developer's guide at ``__. - -Reproducing the Docker image ----------------------------- - -To reproduce this ROCm/vLLM Docker image release, follow these steps: - -1. Clone the `vLLM repository `__. - - .. code-block:: shell - - git clone https://github.com/ROCm/vllm.git - -2. Checkout the specific release commit. - - .. code-block:: shell - - cd vllm - git checkout 6663000a391911eba96d7864a26ac42b07f6ef29 - -3. Build the Docker image. Replace ``vllm-rocm`` with your desired image tag. - - .. code-block:: shell - - docker build -f docker/Dockerfile.rocm -t vllm-rocm . - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - a brief introduction to vLLM and optimization strategies. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst deleted file mode 100644 index 8a03f95d2..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst +++ /dev/null @@ -1,346 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified - ROCm Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -The `ROCm vLLM Docker `_ image offers -a prebuilt, optimized environment designed for validating large language model -(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This -ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the -MI300X accelerator and includes the following components: - -* `ROCm 6.2.0 `_ - -* `vLLM 0.4.3 `_ - -* `PyTorch 2.4.0 `_ - -* Tuning files (in CSV format) - -With this Docker image, you can quickly validate the expected inference -performance numbers on the MI300X accelerator. This topic also provides tips on -optimizing performance with popular AI models. - -.. _vllm-benchmark-vllm: - -.. note:: - - vLLM is a toolkit and library for LLM inference and - serving. It deploys the PagedAttention algorithm, which reduces memory - consumption and increases throughput by leveraging dynamic key and value - allocation in GPU memory. vLLM also incorporates many LLM acceleration - and quantization algorithms. In addition, AMD implements high-performance - custom kernels and modules in vLLM to enhance performance further. See - :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for more - information. - -Getting started -=============== - -Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. - -.. _vllm-benchmark-get-started: - -1. Disable NUMA auto-balancing. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - -2. Download the :ref:`ROCm vLLM Docker image `. - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 - -Once setup is complete, you can choose between two options to reproduce the -benchmark results: - -- :ref:`MAD-integrated benchmarking ` - -- :ref:`Standalone benchmarking ` - -.. _vllm-benchmark-mad-v043: - -MAD-integrated benchmarking -=========================== - -Clone the ROCm Model Automation and Dashboarding (``__) repository to a local -directory and install the required packages on the host machine. - -.. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - -Use this command to run a performance benchmark test of the Llama 3.1 8B model -on one GPU with ``float16`` data type in the host machine. - -.. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800 - -ROCm MAD launches a Docker container with the name -``container_ci-pyt_vllm_llama-3.1-8b``. The latency and throughput reports of the -model are collected in the following path: ``~/MAD/reports_float16/`` - -Although the following eight models are pre-configured to collect latency and -throughput performance data, users can also change the benchmarking parameters. -Refer to the :ref:`Standalone benchmarking ` section. - -Available models ----------------- - -.. hlist:: - :columns: 3 - - * ``pyt_vllm_llama-3.1-8b`` - - * ``pyt_vllm_llama-3.1-70b`` - - * ``pyt_vllm_llama-3.1-405b`` - - * ``pyt_vllm_llama-2-7b`` - - * ``pyt_vllm_mistral-7b`` - - * ``pyt_vllm_qwen2-7b`` - - * ``pyt_vllm_jais-13b`` - - * ``pyt_vllm_jais-30b`` - -.. _vllm-benchmark-standalone-v043: - -Standalone benchmarking -======================= - -You can run the vLLM benchmark tool independently by starting the -:ref:`Docker container ` as shown in the following -snippet. - -.. code-block:: - - docker pull rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name unified_docker_vllm rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50 - -In the Docker container, clone the ROCm MAD repository and navigate to the -benchmark scripts directory at ``~/MAD/scripts/vllm``. - -.. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - -Multiprocessing distributed executor --------------------------------------- - -To optimize vLLM performance, add the multiprocessing API server argument ``--distributed-executor-backend mp``. - -Command -^^^^^^^^^^^^^^^^^^^^^^^^^ - -To start the benchmark, use the following command with the appropriate options. -See :ref:`Options ` for the list of -options and their descriptions. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype - -See the :ref:`examples ` for more information. - -.. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - -.. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: shell - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - -.. _vllm-benchmark-standalone-options-v043: - -Options -^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$model_repo`` - - ``meta-llama/Meta-Llama-3.1-8B-Instruct`` - - Llama 3.1 8B - - * - (``float16``) - - ``meta-llama/Meta-Llama-3.1-70B-Instruct`` - - Llama 3.1 70B - - * - - - ``meta-llama/Meta-Llama-3.1-405B-Instruct`` - - Llama 3.1 405B - - * - - - ``meta-llama/Llama-2-7b-chat-hf`` - - Llama 2 7B - - * - - - ``mistralai/Mixtral-8x7B-Instruct-v0.1`` - - Mixtral 8x7B - - * - - - ``mistralai/Mixtral-8x22B-Instruct-v0.1`` - - Mixtral 8x22B - - * - - - ``mistralai/Mistral-7B-Instruct-v0.3`` - - Mixtral 7B - - * - - - ``Qwen/Qwen2-7B-Instruct`` - - Qwen2 7B - - * - - - ``core42/jais-13b-chat`` - - JAIS 13B - - * - - - ``core42/jais-30b-chat-v3`` - - JAIS 30B - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` - - Data type - -.. _vllm-benchmark-run-benchmark-v043: - -Running the benchmark on the MI300X accelerator ------------------------------------------------ - -Here are some examples of running the benchmark with various options. -See :ref:`Options ` for the list of -options and their descriptions. - -Latency benchmark example -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the ``float16`` data type. - -.. code-block:: - - ./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16 - -Find the latency report at: - -- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv`` - -Throughput benchmark example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16 - -Find the throughput reports at: - -- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv`` - -.. raw:: html - - - -.. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - -Further reading -=============== - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst deleted file mode 100644 index e3111c7fa..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst +++ /dev/null @@ -1,416 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified - ROCm Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -The `ROCm vLLM Docker `_ image offers -a prebuilt, optimized environment designed for validating large language model -(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This -ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the -MI300X accelerator and includes the following components: - -* `ROCm 6.2.1 `_ - -* `vLLM 0.6.4 `_ - -* `PyTorch 2.5.0 `_ - -* Tuning files (in CSV format) - -With this Docker image, you can quickly validate the expected inference -performance numbers on the MI300X accelerator. This topic also provides tips on -optimizing performance with popular AI models. - -.. hlist:: - :columns: 6 - - * Llama 3.1 8B - - * Llama 3.1 70B - - * Llama 3.1 405B - - * Llama 2 7B - - * Llama 2 70B - - * Mixtral 8x7B - - * Mixtral 8x22B - - * Mixtral 7B - - * Qwen2 7B - - * Qwen2 72B - - * JAIS 13B - - * JAIS 30B - -.. _vllm-benchmark-vllm: - -.. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - -Getting started -=============== - -Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. - -.. _vllm-benchmark-get-started: - -1. Disable NUMA auto-balancing. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - -2. Download the :ref:`ROCm vLLM Docker image `. - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 - -Once setup is complete, you can choose between two options to reproduce the -benchmark results: - -- :ref:`MAD-integrated benchmarking ` - -- :ref:`Standalone benchmarking ` - -.. _vllm-benchmark-mad-v064: - -MAD-integrated benchmarking -=========================== - -Clone the ROCm Model Automation and Dashboarding (``__) repository to a local -directory and install the required packages on the host machine. - -.. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - -Use this command to run a performance benchmark test of the Llama 3.1 8B model -on one GPU with ``float16`` data type in the host machine. - -.. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800 - -ROCm MAD launches a Docker container with the name -``container_ci-pyt_vllm_llama-3.1-8b``. The latency and throughput reports of the -model are collected in the following path: ``~/MAD/reports_float16/``. - -Although the following models are preconfigured to collect latency and -throughput performance data, you can also change the benchmarking parameters. -Refer to the :ref:`Standalone benchmarking ` section. - -Available models ----------------- - -.. hlist:: - :columns: 3 - - * ``pyt_vllm_llama-3.1-8b`` - - * ``pyt_vllm_llama-3.1-70b`` - - * ``pyt_vllm_llama-3.1-405b`` - - * ``pyt_vllm_llama-2-7b`` - - * ``pyt_vllm_llama-2-70b`` - - * ``pyt_vllm_mixtral-8x7b`` - - * ``pyt_vllm_mixtral-8x22b`` - - * ``pyt_vllm_mistral-7b`` - - * ``pyt_vllm_qwen2-7b`` - - * ``pyt_vllm_qwen2-72b`` - - * ``pyt_vllm_jais-13b`` - - * ``pyt_vllm_jais-30b`` - - * ``pyt_vllm_llama-3.1-8b_fp8`` - - * ``pyt_vllm_llama-3.1-70b_fp8`` - - * ``pyt_vllm_llama-3.1-405b_fp8`` - - * ``pyt_vllm_mixtral-8x7b_fp8`` - - * ``pyt_vllm_mixtral-8x22b_fp8`` - -.. _vllm-benchmark-standalone-v064: - -Standalone benchmarking -======================= - -You can run the vLLM benchmark tool independently by starting the -:ref:`Docker container ` as shown in the following -snippet. - -.. code-block:: - - docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.4 rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 - -In the Docker container, clone the ROCm MAD repository and navigate to the -benchmark scripts directory at ``~/MAD/scripts/vllm``. - -.. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - -Command -------- - -To start the benchmark, use the following command with the appropriate options. -See :ref:`Options ` for the list of -options and their descriptions. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype - -See the :ref:`examples ` for more information. - -.. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - -.. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: shell - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - -.. _vllm-benchmark-standalone-v064-options: - -Options -------- - -.. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$model_repo`` - - ``meta-llama/Meta-Llama-3.1-8B-Instruct`` - - Llama 3.1 8B - - * - (``float16``) - - ``meta-llama/Meta-Llama-3.1-70B-Instruct`` - - Llama 3.1 70B - - * - - - ``meta-llama/Meta-Llama-3.1-405B-Instruct`` - - Llama 3.1 405B - - * - - - ``meta-llama/Llama-2-7b-chat-hf`` - - Llama 2 7B - - * - - - ``meta-llama/Llama-2-70b-chat-hf`` - - Llama 2 70B - - * - - - ``mistralai/Mixtral-8x7B-Instruct-v0.1`` - - Mixtral 8x7B - - * - - - ``mistralai/Mixtral-8x22B-Instruct-v0.1`` - - Mixtral 8x22B - - * - - - ``mistralai/Mistral-7B-Instruct-v0.3`` - - Mixtral 7B - - * - - - ``Qwen/Qwen2-7B-Instruct`` - - Qwen2 7B - - * - - - ``Qwen/Qwen2-72B-Instruct`` - - Qwen2 72B - - * - - - ``core42/jais-13b-chat`` - - JAIS 13B - - * - - - ``core42/jais-30b-chat-v3`` - - JAIS 30B - - * - ``$model_repo`` - - ``amd/Meta-Llama-3.1-8B-Instruct-FP8-KV`` - - Llama 3.1 8B - - * - (``float8``) - - ``amd/Meta-Llama-3.1-70B-Instruct-FP8-KV`` - - Llama 3.1 70B - - * - - - ``amd/Meta-Llama-3.1-405B-Instruct-FP8-KV`` - - Llama 3.1 405B - - * - - - ``amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV`` - - Mixtral 8x7B - - * - - - ``amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV`` - - Mixtral 8x22B - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - -.. _vllm-benchmark-run-benchmark-v064: - -Running the benchmark on the MI300X accelerator ------------------------------------------------ - -Here are some examples of running the benchmark with various options. -See :ref:`Options ` for the list of -options and their descriptions. - -Example 1: latency benchmark -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the latency of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types. - -.. code-block:: - - ./vllm_benchmark_report.sh -s latency -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16 - ./vllm_benchmark_report.sh -s latency -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8 - -Find the latency reports at: - -- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_latency_report.csv`` - -- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_latency_report.csv`` - -Example 2: throughput benchmark -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the throughput of the Llama 3.1 8B model on one GPU with the ``float16`` and ``float8`` data types. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m meta-llama/Meta-Llama-3.1-8B-Instruct -g 1 -d float16 - ./vllm_benchmark_report.sh -s throughput -m amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -g 1 -d float8 - -Find the throughput reports at: - -- ``./reports_float16/summary/Meta-Llama-3.1-8B-Instruct_throughput_report.csv`` - -- ``./reports_float8/summary/Meta-Llama-3.1-8B-Instruct-FP8-KV_throughput_report.csv`` - -.. raw:: html - - - -.. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - -Further reading -=============== - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst deleted file mode 100644 index 996a7ddb8..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst +++ /dev/null @@ -1,461 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -*********************************************************** -LLM inference performance validation on AMD Instinct MI300X -*********************************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -The `ROCm vLLM Docker `_ image offers -a prebuilt, optimized environment for validating large language model (LLM) -inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM -Docker image integrates vLLM and PyTorch tailored specifically for the MI300X -accelerator and includes the following components: - -* `ROCm 6.3.1 `_ - -* `vLLM 0.6.6 `_ - -* `PyTorch 2.7.0 (2.7.0a0+git3a58512) `_ - -With this Docker image, you can quickly validate the expected inference -performance numbers for the MI300X accelerator. This topic also provides tips on -optimizing performance with popular AI models. For more information, see the lists of -:ref:`available models for MAD-integrated benchmarking ` -and :ref:`standalone benchmarking `. - -.. _vllm-benchmark-vllm: - -.. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - -Getting started -=============== - -Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. - -.. _vllm-benchmark-get-started: - -1. Disable NUMA auto-balancing. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - -2. Download the :ref:`ROCm vLLM Docker image `. - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 - -Once the setup is complete, choose between two options to reproduce the -benchmark results: - -- :ref:`MAD-integrated benchmarking ` - -- :ref:`Standalone benchmarking ` - -.. _vllm-benchmark-mad-v066: - -MAD-integrated benchmarking -=========================== - -Clone the ROCm Model Automation and Dashboarding (``__) repository to a local -directory and install the required packages on the host machine. - -.. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - -Use this command to run a performance benchmark test of the Llama 3.1 8B model -on one GPU with ``float16`` data type in the host machine. - -.. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800 - -ROCm MAD launches a Docker container with the name -``container_ci-pyt_vllm_llama-3.1-8b``. The latency and throughput reports of the -model are collected in the following path: ``~/MAD/reports_float16/``. - -Although the following models are preconfigured to collect latency and -throughput performance data, you can also change the benchmarking parameters. -Refer to the :ref:`Standalone benchmarking ` section. - -.. _vllm-benchmark-mad-v066-models: - -Available models ----------------- - -.. list-table:: - :header-rows: 1 - :widths: 2, 3 - - * - Model name - - Tag - - * - `Llama 3.1 8B `_ - - ``pyt_vllm_llama-3.1-8b`` - - * - `Llama 3.1 70B `_ - - ``pyt_vllm_llama-3.1-70b`` - - * - `Llama 3.1 405B `_ - - ``pyt_vllm_llama-3.1-405b`` - - * - `Llama 3.2 11B Vision `_ - - ``pyt_vllm_llama-3.2-11b-vision-instruct`` - - * - `Llama 2 7B `__ - - ``pyt_vllm_llama-2-7b`` - - * - `Llama 2 70B `__ - - ``pyt_vllm_llama-2-70b`` - - * - `Mixtral MoE 8x7B `_ - - ``pyt_vllm_mixtral-8x7b`` - - * - `Mixtral MoE 8x22B `_ - - ``pyt_vllm_mixtral-8x22b`` - - * - `Mistral 7B `_ - - ``pyt_vllm_mistral-7b`` - - * - `Qwen2 7B `_ - - ``pyt_vllm_qwen2-7b`` - - * - `Qwen2 72B `_ - - ``pyt_vllm_qwen2-72b`` - - * - `JAIS 13B `_ - - ``pyt_vllm_jais-13b`` - - * - `JAIS 30B `_ - - ``pyt_vllm_jais-30b`` - - * - `DBRX Instruct `_ - - ``pyt_vllm_dbrx-instruct`` - - * - `Gemma 2 27B `_ - - ``pyt_vllm_gemma-2-27b`` - - * - `C4AI Command R+ 08-2024 `_ - - ``pyt_vllm_c4ai-command-r-plus-08-2024`` - - * - `DeepSeek MoE 16B `_ - - ``pyt_vllm_deepseek-moe-16b-chat`` - - * - `Llama 3.1 70B FP8 `_ - - ``pyt_vllm_llama-3.1-70b_fp8`` - - * - `Llama 3.1 405B FP8 `_ - - ``pyt_vllm_llama-3.1-405b_fp8`` - - * - `Mixtral MoE 8x7B FP8 `_ - - ``pyt_vllm_mixtral-8x7b_fp8`` - - * - `Mixtral MoE 8x22B FP8 `_ - - ``pyt_vllm_mixtral-8x22b_fp8`` - - * - `Mistral 7B FP8 `_ - - ``pyt_vllm_mistral-7b_fp8`` - - * - `DBRX Instruct FP8 `_ - - ``pyt_vllm_dbrx_fp8`` - - * - `C4AI Command R+ 08-2024 FP8 `_ - - ``pyt_vllm_command-r-plus_fp8`` - -.. _vllm-benchmark-standalone-v066: - -Standalone benchmarking -======================= - -You can run the vLLM benchmark tool independently by starting the -:ref:`Docker container ` as shown in the following -snippet. - -.. code-block:: - - docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6 - -In the Docker container, clone the ROCm MAD repository and navigate to the -benchmark scripts directory at ``~/MAD/scripts/vllm``. - -.. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - -Command -------- - -To start the benchmark, use the following command with the appropriate options. -See :ref:`Options ` for the list of -options and their descriptions. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype - -See the :ref:`examples ` for more information. - -.. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - -.. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: shell - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - -.. _vllm-benchmark-standalone-v066-options: - -Options and available models ----------------------------- - -.. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$model_repo`` - - ``meta-llama/Llama-3.1-8B-Instruct`` - - `Llama 3.1 8B `_ - - * - (``float16``) - - ``meta-llama/Llama-3.1-70B-Instruct`` - - `Llama 3.1 70B `_ - - * - - - ``meta-llama/Llama-3.1-405B-Instruct`` - - `Llama 3.1 405B `_ - - * - - - ``meta-llama/Llama-3.2-11B-Vision-Instruct`` - - `Llama 3.2 11B Vision `_ - - * - - - ``meta-llama/Llama-2-7b-chat-hf`` - - `Llama 2 7B `__ - - * - - - ``meta-llama/Llama-2-70b-chat-hf`` - - `Llama 2 70B `__ - - * - - - ``mistralai/Mixtral-8x7B-Instruct-v0.1`` - - `Mixtral MoE 8x7B `_ - - * - - - ``mistralai/Mixtral-8x22B-Instruct-v0.1`` - - `Mixtral MoE 8x22B `_ - - * - - - ``mistralai/Mistral-7B-Instruct-v0.3`` - - `Mistral 7B `_ - - * - - - ``Qwen/Qwen2-7B-Instruct`` - - `Qwen2 7B `_ - - * - - - ``Qwen/Qwen2-72B-Instruct`` - - `Qwen2 72B `_ - - * - - - ``core42/jais-13b-chat`` - - `JAIS 13B `_ - - * - - - ``core42/jais-30b-chat-v3`` - - `JAIS 30B `_ - - * - - - ``databricks/dbrx-instruct`` - - `DBRX Instruct `_ - - * - - - ``google/gemma-2-27b`` - - `Gemma 2 27B `_ - - * - - - ``CohereForAI/c4ai-command-r-plus-08-2024`` - - `C4AI Command R+ 08-2024 `_ - - * - - - ``deepseek-ai/deepseek-moe-16b-chat`` - - `DeepSeek MoE 16B `_ - - * - ``$model_repo`` - - ``amd/Llama-3.1-70B-Instruct-FP8-KV`` - - `Llama 3.1 70B FP8 `_ - - * - (``float8``) - - ``amd/Llama-3.1-405B-Instruct-FP8-KV`` - - `Llama 3.1 405B FP8 `_ - - * - - - ``amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV`` - - `Mixtral MoE 8x7B FP8 `_ - - * - - - ``amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV`` - - `Mixtral MoE 8x22B FP8 `_ - - * - - - ``amd/Mistral-7B-v0.1-FP8-KV`` - - `Mistral 7B FP8 `_ - - * - - - ``amd/dbrx-instruct-FP8-KV`` - - `DBRX Instruct FP8 `_ - - * - - - ``amd/c4ai-command-r-plus-FP8-KV`` - - `C4AI Command R+ 08-2024 FP8 `_ - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - -.. _vllm-benchmark-run-benchmark-v066: - -Running the benchmark on the MI300X accelerator ------------------------------------------------ - -Here are some examples of running the benchmark with various options. -See :ref:`Options ` for the list of -options and their descriptions. - -Example 1: latency benchmark -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types. - -.. code-block:: - - ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16 - ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8 - -Find the latency reports at: - -- ``./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv`` - -- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv`` - -Example 2: throughput benchmark -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types. - -.. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16 - ./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8 - -Find the throughput reports at: - -- ``./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv`` - -- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv`` - -.. raw:: html - - - -.. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - -Further reading -=============== - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst deleted file mode 100644 index 2bd3910a0..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst +++ /dev/null @@ -1,329 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.7.3_20250325-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-v073: - - Available models - ================ - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-v073: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing - popular AI models. - - .. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - Getting started - =============== - - Use the following procedures to reproduce the benchmark results on an - MI300X accelerator with the prebuilt vLLM Docker image. - - .. _vllm-benchmark-get-started: - - 1. Disable NUMA auto-balancing. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - 2. Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-v073: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with the :literal:`{{model.precision}}` data type. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to throughput the latency of the {{model.model}} model on eight GPUs with the :literal:`{{model.precision}}` data type. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst deleted file mode 100644 index e096f965d..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst +++ /dev/null @@ -1,345 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. _vllm-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.8.3_20250415-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-v083: - - Supported models - ================ - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-v083: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing - popular AI models. - - .. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). To - enable it, edit the default run behavior in the ``models.json`` - configuration before running inference -- update the model's run - ``args`` by changing ``--tunableop off`` to ``--tunableop on``. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst deleted file mode 100644 index 79e6c12dd..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst +++ /dev/null @@ -1,354 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.8.5_20250513-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-v085-20250513: - - Supported models - ================ - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-v085-20250513: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing - popular AI models. - - .. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). To - enable it, edit the default run behavior in the ``models.json`` - configuration before running inference -- update the model's run - ``args`` by changing ``--tunableop off`` to ``--tunableop on``. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst deleted file mode 100644 index 0bcdf3ed4..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst +++ /dev/null @@ -1,355 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.8.5_20250521-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-v085-20250521: - - Supported models - ================ - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-v085-20250521: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing - popular AI models. - - .. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). To - enable it, edit the default run behavior in the ``models.json`` - configuration before running inference -- update the model's run - ``args`` by changing ``--tunableop off`` to ``--tunableop on``. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. - diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst deleted file mode 100644 index 81b7edfec..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst +++ /dev/null @@ -1,353 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.0.1_20250605-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-v0901-20250605: - - Supported models - ================ - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-v0901-20250605: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing popular AI models. - - .. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). To - enable it, edit the default run behavior in the ``models.json`` - configuration before running inference -- update the model's run - ``args`` by changing ``--tunableop off`` to ``--tunableop on``. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X accelerators, see `AMD Instinct MI300X system optimization `_ - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst deleted file mode 100644 index a482c27c7..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst +++ /dev/null @@ -1,353 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker-702: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.1_20250702-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - * `ROCm {{ unified_docker.rocm_version }} `_ - - * `vLLM {{ unified_docker.vllm_version }} `_ - - * `PyTorch {{ unified_docker.pytorch_version }} `_ - - * `hipBLASLt {{ unified_docker.hipblaslt_version }} `_ - - With this Docker image, you can quickly test the :ref:`expected - inference performance numbers ` for - MI300X series accelerators. - - .. _vllm-benchmark-available-models-20250702: - - Supported models - ================ - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm-702: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - .. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - - .. _vllm-benchmark-performance-measurements-20250702: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for inferencing popular AI models. - - .. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - - Advanced features and known issues - ================================== - - For information on experimental features and known issues related to ROCm optimization efforts on vLLM, - see the developer's guide at ``__. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-702: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). To - enable it, edit the default run behavior in the ``models.json`` - configuration before running inference -- update the model's run - ``args`` by changing ``--tunableop off`` to ``--tunableop on``. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: - - docker pull {{ unified_docker.pull_tag }} - docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }} - - In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - To start the benchmark, use the following command with the appropriate options. - - .. code-block:: - - ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}} - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - .. note:: - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - Here are some examples of running the benchmark with various options. - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst deleted file mode 100644 index 9f6d001ad..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst +++ /dev/null @@ -1,450 +0,0 @@ -:orphan: - -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - inference performance documentation. See :doc:`../vllm` for the latest version. - -.. _vllm-benchmark-unified-docker-715: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.1_20250715-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers - a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - * - `ROCm `__ - - {{ unified_docker.rocm_version }} - - * - `vLLM `__ - - {{ unified_docker.vllm_version }} - - * - `PyTorch `__ - - {{ unified_docker.pytorch_version }} - - * - `hipBLASLt `__ - - {{ unified_docker.hipblaslt_version }} - -With this Docker image, you can quickly test the :ref:`expected -inference performance numbers ` for -MI300X series accelerators. - -What's new -========== - -The following is summary of notable changes since the :doc:`previous ROCm/vLLM Docker release `. - -* The ``--compilation-config-parameter`` is no longer required as its options are now enabled by default. - This parameter has been removed from the benchmarking script. - -* Resolved Llama 3.1 405 B custom all-reduce issue, eliminating the need for ``--disable-custom-all-reduce``. - This parameter has been removed from the benchmarking script. - -* Fixed a ``+rms_norm`` custom kernel issue. - -* Added quick reduce functionality. Set ``VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP`` to enable; supported modes are ``FP``, ``INT8``, ``INT6``, ``INT4``. - -* Implemented a workaround to potentially mitigate GPU crashes experienced with the Command R+ model, pending a driver fix. - -Supported models -================ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.1_20250715-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - .. _vllm-benchmark-available-models-715: - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model group
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm-715: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - -.. note:: - - vLLM is a toolkit and library for LLM inference and serving. AMD implements - high-performance custom kernels and modules in vLLM to enhance performance. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - more information. - -.. _vllm-benchmark-performance-measurements-715: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and latency measurements for inferencing popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.1_20250715-benchmark-models.yaml - - {% set unified_docker = data.vllm_benchmark.unified_docker.latest %} - {% set model_groups = data.vllm_benchmark.model_groups %} - - Pull the Docker image - ===================== - - Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-715: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``. - - Although the :ref:`available models ` are preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled - (see - ``__). - To enable it, include the ``--tunableop on`` argument in your - run. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed - by the performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required scripts - - 1. Run the vLLM benchmark tool independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`_ - as shown in the following snippet. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - docker run -it \ - --device=/dev/kfd \ - --device=/dev/dri \ - --group-add video \ - --shm-size 16G \ - --security-opt seccomp=unconfined \ - --security-opt apparmor=unconfined \ - --cap-add=SYS_PTRACE \ - -v $(pwd):/workspace \ - --env HUGGINGFACE_HUB_CACHE=/workspace \ - --name test \ - {{ unified_docker.pull_tag }} - - 2. In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/vllm``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/vllm - - 3. To start the benchmark, use the following command with the appropriate options. - - .. dropdown:: Benchmark options - :open: - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 1 or 8 - - Number of GPUs - - * - ``$datatype`` - - ``float16`` or ``float8`` - - Data type - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - Command: - - .. code-block:: - - ./vllm_benchmark_report.sh \ - -s $test_option \ - -m {{model.model_repo}} \ - -g $num_gpu \ - -d {{model.precision}} - - .. note:: - - For best performance, it's recommend to run with ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``. - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. rubric:: Benchmarking examples - - Here are some examples of running the benchmark with various options: - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: - - ./vllm_benchmark_report.sh \ - -s latency \ - -m {{model.model_repo}} \ - -g 8 \ - -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision. - - .. code-block:: shell - - ./vllm_benchmark_report.sh \ - -s throughput \ - -m {{model.model_repo}} \ - -g 8 \ - -d {{model.precision}} - - Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Advanced usage -============== - -For information on experimental features and known issues related to ROCm optimization efforts on vLLM, -see the developer's guide at ``__. - -Reproducing the Docker image ----------------------------- - -To reproduce this ROCm/vLLM Docker image release, follow these steps: - -1. Clone the `vLLM repository `__. - - .. code-block:: shell - - git clone https://github.com/ROCm/vllm.git - -2. Checkout the specific release commit. - - .. code-block:: shell - - cd vllm - git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f - -3. Build the Docker image. Replace ``vllm-rocm`` with your desired image tag. - - .. code-block:: shell - - docker build -f docker/Dockerfile.rocm -t vllm-rocm . - -Known issues and workarounds -============================ - -AITER does not support FP8 KV cache yet. - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-history.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-history.rst deleted file mode 100644 index 274492147..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-history.rst +++ /dev/null @@ -1,136 +0,0 @@ -:orphan: - -************************************************** -vLLM inference performance testing version history -************************************************** - -This table lists previous versions of the ROCm vLLM inference Docker image for -inference performance testing. For detailed information about available models -for benchmarking, see the version-specific documentation. You can find tagged -previous releases of the ``ROCm/vllm`` Docker image on `Docker Hub `__. - -.. list-table:: - :header-rows: 1 - - * - Docker image tag - - Components - - Resources - - * - ``rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006`` - (latest) - - - * ROCm 7.0.0 - * vLLM 0.10.2 - * PyTorch 2.9.0 - - - * :doc:`Documentation <../vllm>` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909`` - - - * ROCm 6.4.1 - * vLLM 0.10.1 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812`` - - - * ROCm 6.4.1 - * vLLM 0.10.0 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.4.1_vllm_0.9.1_20250715`` - - - * ROCm 6.4.1 - * vLLM 0.9.1 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.4.1_vllm_0.9.1_20250702`` - - - * ROCm 6.4.1 - * vLLM 0.9.1 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.4.1_vllm_0.9.0.1_20250605`` - - - * ROCm 6.4.1 - * vLLM 0.9.0.1 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.3.1_vllm_0.8.5_20250521`` - - - * ROCm 6.3.1 - * 0.8.5 vLLM (0.8.6.dev) - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513`` - - - * ROCm 6.3.1 - * vLLM 0.8.5 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415`` - - - * ROCm 6.3.1 - * vLLM 0.8.3 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250325`` - - - * ROCm 6.3.1 - * vLLM 0.7.3 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6`` - - - * ROCm 6.3.1 - * vLLM 0.6.6 - * PyTorch 2.7.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4`` - - - * ROCm 6.2.1 - * vLLM 0.6.4 - * PyTorch 2.5.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - ``rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50`` - - - * ROCm 6.2.0 - * vLLM 0.4.3 - * PyTorch 2.4.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ - diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst deleted file mode 100644 index 21ee1b647..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst +++ /dev/null @@ -1,190 +0,0 @@ -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the - ROCm PyTorch Docker image. - :keywords: model, MAD, automation, dashboarding, validate, pytorch - -************************************* -PyTorch inference performance testing -************************************* - -.. _pytorch-inference-benchmark-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/pytorch-inference-benchmark-models.yaml - - {% set unified_docker = data.pytorch_inference_benchmark.unified_docker.latest %} - {% set model_groups = data.pytorch_inference_benchmark.model_groups %} - - The `ROCm PyTorch Docker `_ image offers a prebuilt, - optimized environment for testing model inference performance on AMD Instinct™ MI300X series - GPUs. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD) - tool with the ROCm PyTorch container to test inference performance on various models efficiently. - - .. _pytorch-inference-benchmark-available-models: - - Supported models - ================ - - The following models are supported for inference performance benchmarking - with PyTorch and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- - -
- - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization before use via an external license agreement through a third party. - - {% endfor %} - {% endfor %} - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see the :ref:`system validation steps `. - - .. code-block:: shell - - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - Pull the Docker image - ===================== - - .. container:: model-doc pyt_chai1_inference - - Use the following command to pull the `ROCm PyTorch Docker image `__ from Docker Hub. - - .. code-block:: shell - - docker pull rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0_triton_llvm_reg_issue - - .. note:: - - The Chai-1 benchmark uses a specifically selected Docker image using ROCm 6.2.3 and PyTorch 2.3.0 to address an accuracy issue. - - .. container:: model-doc pyt_clip_inference pyt_mochi_video_inference pyt_wan2.1_inference pyt_janus_pro_inference pyt_hy_video - - Use the following command to pull the `ROCm PyTorch Docker image `__ from Docker Hub. - - .. code-block:: shell - - docker pull rocm/pytorch:latest - - .. _pytorch-benchmark-get-started: - - Benchmarking - ============ - - .. _pytorch-inference-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - To simplify performance testing, the ROCm Model Automation and Dashboarding - (``__) project provides ready-to-use scripts and configuration. - To start, clone the MAD repository to a local directory and install the required packages on the - host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the ``{{model.precision}}`` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in ``perf_{{model.mad_tag}}.csv``. - - {% if model.mad_tag != "pyt_janus_pro_inference" %} - .. note:: - - For improved performance, consider enabling TunableOp. By default, - ``{{model.mad_tag}}`` runs with TunableOp disabled (see - ``__). To enable - it, include the ``--tunableop on`` argument in your run. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run. - Although this might increase the initial training time, it can result in a performance gain. - {% endif %} - - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`../../inference-optimization/workload`. - -- To learn how to run LLM models from Hugging Face or your model, see - :doc:`Running models from Hugging Face <../hugging-face-models>`. - -- To learn how to optimize inference on LLMs, see - :doc:`Inference optimization <../../inference-optimization/index>`. - -- To learn how to fine-tune LLMs, see - :doc:`Fine-tuning LLMs <../../fine-tuning/index>`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst deleted file mode 100644 index 17e5ea54b..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst +++ /dev/null @@ -1,257 +0,0 @@ -.. meta:: - :description: SGLang multi-node disaggregated distributed inference using Mooncake - :keywords: model, sglang, mooncake, disagg, disaggregated, distributed, multi-node, docker - -****************************************** -SGLang distributed inference with Mooncake -****************************************** - -As LLM inference increasingly demands handling massive models and dynamic workloads, efficient -distributed inference becomes essential. Traditional co-located architectures face bottlenecks due -to tightly coupled memory and compute resources, which limits scalability and flexibility. -Disaggregated inference refers to the process of splitting the inference of LLMs into distinct -phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth -RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes. -This allows for independent resource scaling and optimization, resulting in -improved efficiency and throughput. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - - `SGLang `__ is a high-performance inference and - serving engine for large language models (LLMs) and vision models. The - ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__ - bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series - GPUs. It includes the following software components: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -The following guides on setting up and running SGLang and Mooncake for disaggregated -distributed inference on a Slurm cluster using AMD Instinct MI300X series GPUs backed by -Mellanox CX-7 NICs. - -Prerequisites -============= - -Before starting, ensure you have: - -* A Slurm cluster with at least three nodes: one for the proxy, one for prefill (``xP``), and one for decode (``yD``). - - ``Nodes -> xP + yD + 1`` - -* A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See :ref:`sglang-disagg-inf-build-docker-image` for instructions. - -* A shared filesystem for storing models, scripts, and logs (cluster-specific). - -Supported models -================ - -The following models are supported for SGLang disaggregated prefill/decode -inference. Some instructions, commands, and recommendations in this -documentation might vary by selected model. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model type
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.model_repo }} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`__ to learn more about this model. - Some models require access authorization prior to use through an external license agreement with a third party. - - {% endfor %} - {% endfor %} - -.. _sglang-disagg-inf-build-docker-image: - -Build the Docker image ----------------------- - -Get the Dockerfile located in -``__. -It uses `lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x -`__ -as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network -drivers. - -.. code-block:: shell - - git clone https://github.com/ROCm/MAD.git - cd MAD/docker - docker build \ - -t sglang_disagg_pd_image \ - -f sglang_disagg_inference.ubuntu.amd.Dockerfile . - -Benchmarking -============ - -The ``__ -repository contains scripts to launch SGLang inference with prefill/decode -disaggregation via Mooncake for supported models. - -* `scripts/sglang_dissag/run_xPyD_models.slurm `__ - -- the main Slurm batch script to launch Docker containers on all nodes using ``sbatch`` or ``salloc``. - -* `scripts/sglang_dissag/sglang_disagg_server.sh `__ - -- the entrypoint script that runs inside each container to start the correct service -- proxy, prefill, or decode. - -* `scripts/sglang_dissag/benchmark_xPyD.sh `__ - -- the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement. - -* `scripts/sglang_dissag/benchmark_parser.py `__ - -- the log parser script to be run on the concurrency benchmark log file to generate tabulated data. - -Launch the service ------------------- - -The service is deployed using a Slurm batch script that orchestrates the containers across the -allocated nodes. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.model_repo }} - - .. code-block:: shell - - # Clone the MAD repo if you haven't already and - # navigate to the scripts directory - git clone https://github.com/ROCm/MAD.git - cd MAD/scripts/sglang_disagg/ - - # Slurm sbatch run command - export DOCKER_IMAGE_NAME=sglang_disagg_pd_image - export xP= - export yD= - export MODEL_NAME={{ model.model_repo }} - # num_nodes = xP + yD + 1 - sbatch -N -n --nodelist= run_xPyD_models.slurm - - {% endfor %} - {% endfor %} - -Post-run logs and testing -------------------------- - -Logs are stored in your shared filesystem in the directory specified by the ``LOG_PATH`` variable in the Slurm script. -A new directory named after the Slurm job ID is created for each run. - -Inside that directory, you can access various logs: - -* ``pd_sglang_bench_serving.sh_NODE<...>.log`` -- the main log for each server node. - -* ``etcd_NODE<...>.log`` -- logs for etcd services. - -* ``prefill_NODE<...>.log`` -- logs for the prefill services. - -* ``decode_NODE<...>.log`` -- logs for the decode services. - -Use the benchmark parser script for concurrency logs to tabulate different data. - -.. code-block:: shell - - python3 benchmark_parser.py - -To verify the service is responsive, you can try sending a ``curl`` request to test the launched -server from the Docker container on the proxy node. For example: - -.. code-block:: shell - - curl -X POST http://127.0.0.1:30000/generate \ - -H "Content-Type: application/json" \ - -d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }' - -Known issues -============ - -When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at -higher concurrency levels (512+), the following error might occur: - -.. code-block:: shell-session - - `__. - -- To learn more about the options for latency and throughput benchmark scripts, - see ``__. - -- See the base upstream Docker image on `Docker Hub `__. - -- To learn more about system settings and management practices to configure your system for - MI300X series GPUs, see `AMD Instinct MI300X system optimization `__. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/sglang-history` to find documentation for previous releases -of SGLang inference performance testing. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst deleted file mode 100644 index 1722b2018..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst +++ /dev/null @@ -1,276 +0,0 @@ -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and SGLang - :keywords: model, MAD, automation, dashboarding, validate - -***************************************************************** -SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B -***************************************************************** - -.. _sglang-benchmark-unified-docker: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - - `SGLang `__ is a high-performance inference and - serving engine for large language models (LLMs) and vision models. The - ROCm-enabled `SGLang Docker image <{{ docker.docker_hub_url }}>`__ - bundles SGLang with PyTorch, optimized for AMD Instinct MI300X series - accelerators. It includes the following software components: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Pull the Docker image - ===================== - - Download the `SGLang Docker image <{{ unified_docker.docker_hub_url }}>`__. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Benchmarking - ============ - - Once the setup is complete, choose one of the following methods to benchmark inference performance with - `DeepSeek-R1-Distill-Qwen-32B `__. - - .. _sglang-benchmark-mad: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model - using one GPU with the ``{{model.precision}}`` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv``. - - Although the DeepSeek-R1-Distill-Qwen-32B is preconfigured - to collect latency and throughput performance data, you can also change the benchmarking - parameters. See the standalone benchmarking tab for more information. - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required scripts - - 1. Run the SGLang benchmark script independently by starting the - `Docker container <{{ unified_docker.docker_hub_url }}>`__ - as shown in the following snippet. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - docker run -it \ - --device=/dev/kfd \ - --device=/dev/dri \ - --group-add video \ - --shm-size 16G \ - --security-opt seccomp=unconfined \ - --security-opt apparmor=unconfined \ - --cap-add=SYS_PTRACE \ - -v $(pwd):/workspace \ - --env HUGGINGFACE_HUB_CACHE=/workspace \ - --name test \ - {{ unified_docker.pull_tag }} - - 2. In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``~/MAD/scripts/sglang``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/sglang - - 3. To start the benchmark, use the following command with the appropriate options. - - .. dropdown:: Benchmark options - :open: - - .. list-table:: - :header-rows: 1 - :align: center - - * - Name - - Options - - Description - - * - ``$test_option`` - - latency - - Measure decoding token latency - - * - - - throughput - - Measure token generation throughput - - * - - - all - - Measure both throughput and latency - - * - ``$num_gpu`` - - 8 - - Number of GPUs - - * - ``$datatype`` - - ``bfloat16`` - - Data type - - * - ``$dataset`` - - random - - Dataset - - The input sequence length, output sequence length, and tensor parallel (TP) are - already configured. You don't need to specify them with this script. - - Command: - - .. code-block:: shell - - ./sglang_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d $datatype [-a $dataset] - - .. note:: - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: shell-session - - OSError: You are trying to access a gated repo. - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. rubric:: Benchmarking examples - - Here are some examples of running the benchmark with various options: - - * Latency benchmark - - Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with ``{{model.precision}}`` precision. - - .. code-block:: shell - - ./sglang_benchmark_report.sh \ - -s latency \ - -m {{model.model_repo}} \ - -g 8 \ - -d {{model.precision}} - - Find the latency report at ``./reports_{{model.precision}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. - - * Throughput benchmark - - Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with ``{{model.precision}}`` precision. - - .. code-block:: shell - - ./sglang_benchmark_report.sh \ - -s throughput \ - -m {{model.model_repo}} \ - -g 8 \ - -d {{model.precision}} \ - -a random - - Find the throughput report at ``./reports_{{model.precision}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``__. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `__. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- To learn how to run community models from Hugging Face on AMD GPUs, see - :doc:`Running models from Hugging Face `. - -- To learn how to fine-tune LLMs and optimize inference, see - :doc:`Fine-tuning LLMs and inference optimization `. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/sglang-history` to find documentation for previous releases -of SGLang inference performance testing. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst deleted file mode 100644 index 66e7f6621..000000000 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst +++ /dev/null @@ -1,475 +0,0 @@ -.. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the ROCm vLLM Docker image. - :keywords: model, MAD, automation, dashboarding, validate - -********************************** -vLLM inference performance testing -********************************** - -.. _vllm-benchmark-unified-docker-930: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - - The `ROCm vLLM Docker <{{ docker.docker_hub_url }}>`_ image offers a - prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI355X, MI350X, MI325X and MI300X - GPUs. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored - specifically for AMD data center GPUs and includes the following components: - - .. tab-set:: - - .. tab-item:: {{ docker.pull_tag }} - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -With this Docker image, you can quickly test the :ref:`expected -inference performance numbers ` for -AMD Instinct GPUs. - -What's new -========== - -The following is summary of notable changes since the :doc:`previous ROCm/vLLM Docker release `. - -* Added support for AMD Instinct MI355X and MI350X GPUs. - -* Added support and benchmarking instructions for the following models. See :ref:`vllm-benchmark-supported-models-930`. - - * Llama 4 Scout and Maverick - - * DeepSeek R1 0528 FP8 - - * MXFP4 models (MI355X and MI350X only): Llama 3.3 70B MXFP4 and Llama 3.1 405B MXFP4 - - * GPT OSS 20B and 120B - - * Qwen 3 32B, 30B-A3B, and 235B-A22B - -* Removed the deprecated ``--max-seq-len-to-capture`` flag. - -* ``--gpu-memory-utilization`` is now configurable via the `configuration files - `__ in the MAD - repository. - -.. _vllm-benchmark-supported-models-930: - -Supported models -================ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - .. _vllm-benchmark-available-models-930: - - The following models are supported for inference performance benchmarking - with vLLM and ROCm. Some instructions, commands, and recommendations in this - documentation might vary by model -- select one to get started. MXFP4 models - are only supported on MI355X and MI350X GPUs. - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. _vllm-benchmark-vllm-930: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - - {% if model.precision == "float4" %} - .. important:: - - MXFP4 is supported only on MI355X and MI350X GPUs. - {% endif %} - - .. note:: - - See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model. - Some models require access authorization prior to use via an external license agreement through a third party. - {% if model.precision == "float8" and model.model_repo.startswith("amd") %} - This model uses FP8 quantization via `AMD Quark `__ for efficient inference on AMD GPUs. - {% endif %} - {% if model.precision == "float4" and model.model_repo.startswith("amd") %} - This model uses FP4 quantization via `AMD Quark `__ for efficient inference on AMD GPUs. - {% endif %} - - {% endfor %} - {% endfor %} - -.. _vllm-benchmark-performance-measurements-930: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and serving measurements for inferencing popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct GPUs or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -Pull the Docker image -===================== - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - - Download the `ROCm vLLM Docker image <{{ docker.docker_hub_url }}>`_. - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - -Benchmarking -============ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Once the setup is complete, choose between two options to reproduce the - benchmark results: - - .. _vllm-benchmark-mad-930: - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - The following run command is tailored to {{ model.model }}. - See :ref:`vllm-benchmark-supported-models-930` to switch to another available model. - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. On the host machine, use this command to run the performance benchmark test on - the `{{model.model}} <{{ model.url }}>`_ model using one node with the - :literal:`{{model.precision}}` data type. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The throughput and serving reports of the - model are collected in the following paths: ``{{ model.mad_tag }}_throughput.csv`` - and ``{{ model.mad_tag }}_serving.csv``. - - Although the :ref:`available models - ` are preconfigured to collect - offline throughput and online serving performance data, you can - also change the benchmarking parameters. See the standalone - benchmarking tab for more information. - - {% if model.tunableop %} - - .. note:: - - For improved performance, consider enabling :ref:`PyTorch TunableOp `. - TunableOp automatically explores different implementations and configurations of certain PyTorch - operators to find the fastest one for your hardware. - - By default, ``{{model.mad_tag}}`` runs with TunableOp disabled (see - ``__). To enable it, include - the ``--tunableop on`` argument in your run. - - Enabling TunableOp triggers a two-pass run -- a warm-up followed by the - performance-collection run. - - {% endif %} - - .. tab-item:: Standalone benchmarking - - The following commands are optimized for {{ model.model }}. - See :ref:`vllm-benchmark-supported-models-930` to switch to another available model. - - .. seealso:: - - For more information on configuration, see the `config files - `__ - in the MAD repository. Refer to the `vLLM engine `__ - for descriptions of available configuration options - and `Benchmarking vLLM `__ for - additional benchmarking information. - - .. rubric:: Launch the container - - You can run the vLLM benchmark tool independently by starting the - `Docker container <{{ docker.docker_hub_url }}>`_ as shown - in the following snippet. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - docker run -it \ - --device=/dev/kfd \ - --device=/dev/dri \ - --group-add video \ - --shm-size 16G \ - --security-opt seccomp=unconfined \ - --security-opt apparmor=unconfined \ - --cap-add=SYS_PTRACE \ - -v $(pwd):/workspace \ - --env HUGGINGFACE_HUB_CACHE=/workspace \ - --name test \ - {{ docker.pull_tag }} - - .. rubric:: Throughput command - - Use the following command to start the throughput benchmark. - - .. code-block:: shell - - model={{ model.model_repo }} - tp={{ model.config.tp }} - num_prompts={{ model.config.num_prompts | default(1024) }} - in={{ model.config.in | default(128) }} - out={{ model.config.in | default(128) }} - dtype={{ model.config.dtype | default("auto") }} - kv_cache_dtype={{ model.config.kv_cache_dtype }} - max_num_seqs={{ model.config.max_num_seqs | default(1024) }} - max_num_batched_tokens={{ model.config.max_num_batched_tokens }} - max_model_len={{ model.config.max_model_len }} - - vllm bench throughput --model $model \ - -tp $tp \ - --num-prompts $num_prompts \ - --input-len $in \ - --output-len $out \ - --dtype $dtype \ - --kv-cache-dtype $kv_cache_dtype \ - --max-num-seqs $max_num_seqs \ - --max-num-batched-tokens $max_num_batched_tokens \ - --max-model-len $max_model_len \ - --trust-remote-code \ - --output-json ${model}_throughput.json \ - --gpu-memory-utilization {{ model.config.gpu_memory_utilization | default(0.9) }} - - .. rubric:: Serving command - - 1. Start the server using the following command: - - .. code-block:: shell - - model={{ model.model_repo }} - tp={{ model.config.tp }} - dtype={{ model.config.dtype }} - kv_cache_dtype={{ model.config.kv_cache_dtype }} - max_num_seqs=256 - max_num_batched_tokens={{ model.config.max_num_batched_tokens }} - max_model_len={{ model.config.max_model_len }} - - vllm serve $model \ - -tp $tp \ - --dtype $dtype \ - --kv-cache-dtype $kv_cache_dtype \ - --max-num-seqs $max_num_seqs \ - --max-num-batched-tokens $max_num_batched_tokens \ - --max-model-len $max_model_len \ - --no-enable-prefix-caching \ - --swap-space 16 \ - --disable-log-requests \ - --trust-remote-code \ - --gpu-memory-utilization 0.9 - - Wait until the model has loaded and the server is ready to accept requests. - - 2. On another terminal on the same machine, run the benchmark: - - .. code-block:: shell - - # Connect to the container - docker exec -it test bash - - # Wait for the server to start - until curl -s http://localhost:8000/v1/models; do sleep 30; done - - # Run the benchmark - model={{ model.model_repo }} - max_concurrency=1 - num_prompts=10 - in=128 - out=128 - vllm bench serve --model $model \ - --percentile-metrics "ttft,tpot,itl,e2el" \ - --dataset-name random \ - --ignore-eos \ - --max-concurrency $max_concurrency \ - --num-prompts $num_prompts \ - --random-input-len $in \ - --random-output-len $out \ - --trust-remote-code \ - --save-result \ - --result-filename ${model}_serving.json - - .. note:: - - For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B, - try adding ``export VLLM_ROCM_USE_AITER=1`` to your commands. - - If you encounter the following error, pass your access-authorized Hugging - Face token to the gated models. - - .. code-block:: - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. raw:: html - - - - .. note:: - - Throughput is calculated as: - - - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time - - - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time - {% endfor %} - {% endfor %} - -Advanced usage -============== - -For information on experimental features and known issues related to ROCm optimization efforts on vLLM, -see the developer's guide at ``__. - -Reproducing the Docker image ----------------------------- - -To reproduce this ROCm-enabled vLLM Docker image release, follow these steps: - -1. Clone the `vLLM repository `__. - - .. code-block:: shell - - git clone https://github.com/vllm-project/vllm.git - cd vllm - -2. Use the following command to build the image directly from the specified commit. - - .. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml - - {% set docker = data.dockers[0] %} - .. code-block:: shell - - docker build -f docker/Dockerfile.rocm \ - --build-arg REMOTE_VLLM=1 \ - --build-arg VLLM_REPO=https://github.com/ROCm/vllm \ - --build-arg VLLM_BRANCH="{{ docker.dockerfile.commit }}" \ - -t vllm-rocm . - - .. tip:: - - Replace ``vllm-rocm`` with your desired image tag. - -Further reading -=============== - -- To learn more about the options for latency and throughput benchmark scripts, - see ``_. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for - a brief introduction to vLLM and optimization strategies. - -- For application performance optimization strategies for HPC and AI workloads, - including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/vllm-history` to find documentation for previous releases -of the ``ROCm/vllm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst b/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst deleted file mode 100644 index fc5bc7732..000000000 --- a/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst +++ /dev/null @@ -1,121 +0,0 @@ -.. meta:: - :description: How to deploy your model for AI inference using vLLM and Hugging Face TGI. - :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial - -******************** -Deploying your model -******************** - -ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers. -This section focuses on deploying transformers-based LLM models. - -ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks. - -.. _rocm-for-ai-serve-vllm: - -Serving using vLLM -================== - -vLLM is a fast and easy-to-use library for LLM inference and serving. AMD is actively working with the vLLM team to improve performance and support the latest ROCm versions. - -See the `GitHub repository `_ and `official vLLM documentation -`_ for more information. - -For guidance on using vLLM with ROCm, refer to `Installation with ROCm -`_. - -vLLM installation ------------------ - -vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links. - -- `Build from source with Docker - `_ (recommended) - -- `Build from source `_ - -vLLM walkthrough ----------------- - -Refer to this developer blog for guidance on serving with vLLM `Inferencing and serving with vLLM on AMD GPUs — ROCm -Blogs `_ - -Validating vLLM performance ---------------------------- - -ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM -on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV -format. For more information, see the guide to -`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X accelerator `_ -on the ROCm GitHub repository. - -.. _rocm-for-ai-serve-hugging-face-tgi: - -Serving using Hugging Face TGI -============================== - -The `Hugging Face Text Generation Inference `_ -(TGI) library is optimized for serving LLMs with low latency. Refer to the `Quick tour of TGI -`_ for more details. - -TGI installation ----------------- - -The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at -``__. - -TGI walkthrough ---------------- - -#. Set up the LLM server. - - Deploy the Llama2 7B model with TGI using the official Docker image. - - .. code-block:: shell - - model=TheBloke/Llama-2-7B-fp16 - volume=$PWD - docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data --name tgi_amd ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model - -#. Set up the client. - - a. Open another shell session and run the following command to access the server with the client URL. - - .. code-block:: shell - - curl 127.0.0.1:8080/generate \\ - -X POST \\ - -d '{"inputs":"What is Deep - Learning?","parameters":{"max_new_tokens":20}}' \\ - -H 'Content-Type: application/json' - - b. Access the server with request endpoints. - - .. code-block:: shell - - pip install request - PYTHONPATH=/usr/lib/python3/dist-packages python requests_model.py - - ``requests_model.py`` should look like: - - .. code-block:: python - - import requests - - headers = { - "Content-Type": "application/json", - } - - data = { - 'inputs': 'What is Deep Learning?', - 'parameters': { 'max_new_tokens': 20 }, - } - - response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data) - - print(response.json()) - -vLLM and Hugging Face TGI are robust solutions for anyone looking to deploy LLMs for applications that demand high -performance, low latency, and scalability. - -Visit the topics in :doc:`Using ROCm for AI <../index>` to learn about other ROCm-aware solutions for AI development. diff --git a/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst b/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst deleted file mode 100644 index fe84e33d9..000000000 --- a/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst +++ /dev/null @@ -1,210 +0,0 @@ -.. meta:: - :description: How to run models from Hugging Face on AMD GPUs. - :keywords: ROCm, AI, LLM, Hugging Face, Optimum, Flash Attention, GPTQ, ONNX, tutorial - -******************************** -Running models from Hugging Face -******************************** - -`Hugging Face `_ hosts the world’s largest AI model repository for developers to obtain -transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in -developing and deploying AI solutions. - -This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs. - -.. _rocm-for-ai-hugging-face-transformers: - -Using Hugging Face Transformers -------------------------------- - -First, `install the Hugging Face Transformers library `_, -which lets you easily import any of the transformer models into your Python application. - -.. code-block:: shell - - pip install transformers - -Here is an example of running `GPT2 `_: - -.. code-block:: python - - from transformers import GPT2Tokenizer, GPT2Model - - tokenizer = GPT2Tokenizer.from_pretrained('gpt2') - - model = GPT2Model.from_pretrained('gpt2') - - text = "Replace me with any text you'd like." - - encoded_input = tokenizer(text, return_tensors='pt') - - output = model(**encoded_input) - -Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core -models should also function correctly. - -Here are some mainstream models to get you started: - -- `BERT `_ - -- `BLOOM `_ - -- `Llama `_ - -- `OPT `_ - -- `T5 `_ - -.. _rocm-for-ai-hugging-face-optimum: - -Using Hugging Face with Optimum-AMD ------------------------------------ - -Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack. - -For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the -`Optimum-AMD `_ page on Hugging Face for guidance on -using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration. - -Hugging Face libraries natively support AMD Instinct accelerators. For other -:doc:`ROCm-capable hardware `, support is currently not -validated, but most features are expected to work without issues. - -.. _rocm-for-ai-install-optimum-amd: - -Installation -~~~~~~~~~~~~ - -Install Optimum-AMD using pip. - -.. code-block:: shell - - pip install --upgrade --upgrade-strategy eager optimum[amd] - -Or, install from source. - -.. code-block:: shell - - git clone https://github.com/huggingface/optimum-amd.git - cd optimum-amd - pip install -e . - -.. _rocm-for-ai-flash-attention: - -Flash Attention ---------------- - -#. Use `the Hugging Face team's example Dockerfile - `_ to use - Flash Attention with ROCm. - - .. code-block:: shell - - docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash . - volume=$PWD - docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd - transformers_pytorch_amd_gpu_flash:latest - -#. Use Flash Attention 2 with `Transformers - `_ by adding the - ``use_flash_attention_2`` parameter to ``from_pretrained()``: - - .. code-block:: python - - import torch - from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM - - tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") - - with torch.device("cuda"): - model = AutoModelForCausalLM.from_pretrained( - "tiiuae/falcon-7b", - torch_dtype=torch.float16, - use_flash_attention_2=True, - ) - -.. _rocm-for-ai-gptq: - -GPTQ ----- - -To enable `GPTQ `_, hosted wheels are available for ROCm. - -#. First, :ref:`install Optimum-AMD `. - -#. Install AutoGPTQ using pip. Refer to `AutoGPTQ Installation `_ for - in-depth guidance. - - .. code-block:: shell - - pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ - - Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable. - - .. code-block:: shell - - ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e . - - -#. Load GPTQ-quantized models in Transformers using the backend `AutoGPTQ library - `_: - - .. code-block:: python - - import torch - from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM - - tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ") - - with torch.device("cuda"): - model = AutoModelForCausalLM.from_pretrained( - "TheBloke/Llama-2-7B-Chat-GPTQ", - torch_dtype=torch.float16, - ) - -.. _rocm-for-ai-onnx: - -ONNX ----- - -Hugging Face Optimum also supports the `ONNX Runtime `_ integration. For ONNX models, usage is -straightforward. - -#. Specify the provider argument in the ``ORTModel.from_pretrained()`` method: - - .. code-block:: python - - from optimum.onnxruntime import ORTModelForSequenceClassification - .. - ort_model = ORTModelForSequenceClassification.from_pretrained( - .. - provider="ROCMExecutionProvider" - ) - -#. Try running a `BERT text classification - `_ ONNX model with ROCm: - - .. code-block:: python - - from optimum.onnxruntime import ORTModelForSequenceClassification - from optimum.pipelines import pipeline - from transformers import AutoTokenizer - import onnxruntime as ort - - session_options = ort.SessionOptions() - - session_options.log_severity_level = 0 - - ort_model = ORTModelForSequenceClassification.from_pretrained( - "distilbert-base-uncased-finetuned-sst-2-english", - export=True, - provider="ROCMExecutionProvider", - session_options=session_options - ) - - tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") - - pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") - - result = pipe("Both the music and visual were astounding, not to mention the actors performance.") - diff --git a/docs/how-to/rocm-for-ai/inference/index.rst b/docs/how-to/rocm-for-ai/inference/index.rst deleted file mode 100644 index 3c211882b..000000000 --- a/docs/how-to/rocm-for-ai/inference/index.rst +++ /dev/null @@ -1,29 +0,0 @@ -.. meta:: - :description: How to use ROCm for AI inference workloads. - :keywords: ROCm, AI, machine learning, LLM, AI inference, NLP, GPUs, usage, tutorial - -**************************** -Use ROCm for AI inference -**************************** -AI inference is a process of deploying a trained machine learning model to make predictions or classifications on new data. This commonly involves using the model with real-time data and making quick decisions based on the predictions made by the model.  - -Understanding the ROCm™ software platform’s architecture and capabilities is vital for running AI inference. By leveraging the ROCm platform's capabilities, you can harness the power of high-performance computing and efficient resource management to run inference workloads, leading to faster predictions and classifications on real-time data. - -Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X accelerators and illustrate how to deploy trained models in production environments. - -The AI Developer Hub contains `AMD ROCm tutorials `_ for -training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs. - -- :doc:`Installing ROCm and machine learning frameworks <../install>` - -- :doc:`Running models from Hugging Face ` - -- :doc:`LLM inference frameworks ` - -- :doc:`vLLM inference performance testing ` - -- :doc:`PyTorch inference performance testing ` - -- :doc:`SGLang inference performance testing ` - -- :doc:`Deploying your model ` diff --git a/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst b/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst deleted file mode 100644 index d1e81d1a9..000000000 --- a/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst +++ /dev/null @@ -1,219 +0,0 @@ -.. meta:: - :description: How to implement the LLM inference frameworks with ROCm acceleration. - :keywords: ROCm, LLM, fine-tuning, usage, tutorial, inference, vLLM, TGI, text generation inference - -************************ -LLM inference frameworks -************************ - -This section discusses how to implement `vLLM `_ and `Hugging Face TGI -`_ using -:doc:`single-accelerator <../fine-tuning/single-gpu-fine-tuning-and-inference>` and -:doc:`multi-accelerator <../fine-tuning/multi-gpu-fine-tuning-and-inference>` systems. - -.. _fine-tuning-llms-vllm: - -vLLM inference -============== - -vLLM is renowned for its PagedAttention algorithm that can reduce memory consumption and increase throughput thanks to -its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the -models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention -is also effective when multiple requests share the same key and value contents for a large value of beam search or -multiple parallel requests. - -vLLM also incorporates many modern LLM acceleration and quantization algorithms, such as Flash Attention, HIP and CUDA -graphs, tensor parallel multi-GPU, GPTQ, AWQ, and token speculation. - -Installing vLLM ---------------- - -.. _fine-tuning-llms-vllm-rocm-docker-image: - -1. Run the following commands to build a Docker image ``vllm-rocm``. - - .. code-block:: shell - - git clone https://github.com/vllm-project/vllm.git - cd vllm - docker build -f docker/Dockerfile.rocm -t vllm-rocm . - -.. tab-set:: - - .. tab-item:: vLLM on a single-accelerator system - :sync: single - - 2. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm - Docker image `. - - .. code-block:: shell - - docker run -it \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/model \ - vllm-rocm \ - bash - - 3. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command. - - .. code-block:: shell - - python -m vllm.entrypoints.api_server --model /app/model --dtype float16 --port 8000 & - - The following log message is displayed in your command line indicates that the server is listening for requests. - - .. image:: ../../../data/how-to/llm-fine-tuning-optimization/vllm-single-gpu-log.png - :alt: vLLM API server log message - :align: center - - 4. To test, send it a curl request containing a prompt. - - .. code-block:: shell - - curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }' - - You should receive a response like the following. - - .. code-block:: text - - {"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]} - - .. tab-item:: vLLM on a multi-accelerator system - :sync: multi - - 2. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm - Docker image `. - - .. code-block:: shell - - docker run -it \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/model \ - vllm-rocm \ - bash - - - 3. To run API server on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` parameter. For example, to use two - GPUs, start the API server using the following command. - - .. code-block:: shell - - python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 & - - 4. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to - isolate each instance to a different accelerator. - - For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a - a command like the following. - - .. code-block:: shell - - ROCR_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2 --port 8000 & - ROCR_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2--port 8001 & - - 5. To test, send it a curl request containing a prompt. - - .. code-block:: shell - - curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }' - - You should receive a response like the following. - - .. code-block:: text - - {"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]} - -.. seealso:: - - See :ref:`mi300x-vllm-optimization` for performance optimization tips. - - ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM - on the MI300X accelerator. The Docker image includes ROCm, vLLM, and PyTorch. - For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. - -.. _fine-tuning-llms-tgi: - -Hugging Face TGI -================ - -Text Generation Inference (TGI) is LLM serving framework from Hugging -Face, and it also supports the majority of high-performance LLM -acceleration algorithms such as Flash Attention, Paged Attention, -CUDA/HIP graph, tensor parallel multi-GPU, GPTQ, AWQ, and token -speculation. - -.. tip:: - - In addition to LLM serving capability, TGI also provides the `Text Generation Inference benchmarking tool - `_. - -Install TGI ------------ - -1. Launch the TGI Docker container in the host machine. - - .. code-block:: shell - - docker run --name tgi --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined - --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g - --net host -v $PWD:/data - --entrypoint "/bin/bash" - --env HUGGINGFACE_HUB_CACHE=/data - ghcr.io/huggingface/text-generation-inference:latest-rocm - -.. tab-set:: - - .. tab-item:: TGI on a single-accelerator system - :sync: single - - 2. Inside the container, launch a model using TGI server on a single accelerator. - - .. code-block:: shell - - export ROCM_USE_FLASH_ATTN_V2_TRITON=True - text-generation-launcher --model-id NousResearch/Meta-Llama-3-70B --dtype float16 --port 8000 & - - 3. To test, send it a curl request containing a prompt. - - .. code-block:: shell - - curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' - - You should receive a response like the following. - - .. code-block:: shell - - data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2822266,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null} - - .. tab-item:: TGI on a multi-accelerator system - - 2. Inside the container, launch a model using TGI server on multiple accelerators (4 in this case). - - .. code-block:: shell - - export ROCM_USE_FLASH_ATTN_V2_TRITON=True - text-generation-launcher --model-id NousResearch/Meta-Llama-3-8B --dtype float16 --port 8000 --num-shard 4 & - - 3. To test, send it a curl request containing a prompt. - - .. code-block:: shell - - curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' - - You should receive a response like the following. - - .. code-block:: shell - - data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2773438,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null} diff --git a/docs/how-to/rocm-for-ai/install.rst b/docs/how-to/rocm-for-ai/install.rst deleted file mode 100644 index ff7011e2d..000000000 --- a/docs/how-to/rocm-for-ai/install.rst +++ /dev/null @@ -1,60 +0,0 @@ -.. meta:: - :description: How to install ROCm and popular deep learning frameworks. - :keywords: ROCm, AI, LLM, train, fine-tune, FSDP, DeepSpeed, LLaMA, tutorial - -.. _rocm-for-ai-install: - -******************************************** -Installing ROCm and deep learning frameworks -******************************************** - -Before getting started, install ROCm and supported deep learning frameworks. - -.. grid:: 1 - - .. grid-item-card:: Pre-install - - Each release of ROCm supports specific hardware and software configurations. Before installing, consult the - :doc:`System requirements ` and - :doc:`Installation prerequisites ` guides. - -If you’re new to ROCm, refer to the :doc:`ROCm quick start install guide for Linux -`. - -If you’re using a Radeon GPU for graphics-accelerated applications, refer to the -`Radeon installation instructions `_. - -You can install ROCm on :doc:`compatible systems ` via your Linux -distribution's package manager. See the following documentation resources to get started: - -* :doc:`ROCm installation overview ` - -* :doc:`Using your Linux distribution's package manager ` - -* :ref:`Multi-version installation ` - -.. grid:: 1 - - .. grid-item-card:: Post-install - - Follow the :doc:`post-installation instructions ` to - configure your system linker, PATH, and verify the installation. - - If you encounter any issues during installation, refer to the - :doc:`Installation troubleshooting ` guide. - -Deep learning frameworks -======================== - -ROCm supports deep learning frameworks and libraries including `PyTorch -`_, `TensorFlow -`_, `JAX `_, and more. - -Review the :doc:`framework installation documentation <../deep-learning-rocm>`. For ease-of-use, it's recommended to use official ROCm prebuilt Docker -images with the framework pre-installed. - -Next steps -========== - -After installing ROCm and your desired ML libraries -- and before running AI workloads -- conduct system health benchmarks -to test the optimal performance of your AMD hardware. See :doc:`system-setup/index` to get started. diff --git a/docs/how-to/rocm-for-ai/system-setup/index.rst b/docs/how-to/rocm-for-ai/system-setup/index.rst deleted file mode 100644 index 466c1ba2f..000000000 --- a/docs/how-to/rocm-for-ai/system-setup/index.rst +++ /dev/null @@ -1,40 +0,0 @@ -.. meta:: - :description: System setup and validation steps for AI training and inference on ROCm - :keywords: AMD Instinct, ROCm, GPU, AI, training, inference, benchmarking, performance, validation - -************************************* -System setup for AI workloads on ROCm -************************************* - -Before you begin training or inference on AMD Instinct™ GPUs, complete -the following system setup and validation steps to ensure optimal performance. - -Prerequisite system validation -============================== - -First, confirm that your system meets all software and hardware prerequisites. -See :doc:`prerequisite-system-validation`. - -Docker images for AMD Instinct GPUs -=================================== - -AMD provides prebuilt Docker images for AMD Instinct™ MI300X and MI325X -GPUs. These images include ROCm-enabled deep learning frameworks and -essential software components. They support single-node and multi-node configurations -and are ready for training and inference workloads out of the box. - -Multi-node training -------------------- - -For instructions on enabling multi-node training, see :doc:`multi-node-setup`. - -System optimization and validation -================================== - -Before running workloads, verify that the system is configured correctly and -operating at peak efficiency. Recommended steps include: - -- Disabling NUMA auto-balancing -- Running system benchmarks to validate hardware performance - -For details on running system health checks, see :doc:`system-health-check`. diff --git a/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst b/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst deleted file mode 100644 index 739a9c8e8..000000000 --- a/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst +++ /dev/null @@ -1,320 +0,0 @@ -.. meta:: - :description: Multi-node setup for AI training - :keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training - -.. _rocm-for-ai-multi-node-setup: - -********************************* -Multi-node setup for AI workloads -********************************* - -AMD provides ready-to-use Docker images for AMD Instinct™ MI300X and MI325X -GPUs containing ROCm-capable deep learning frameworks and essential -software components. These Docker images can run and leverage multiple nodes if -they are available. This page describes how to enable the multi-node training -of AI workloads on AMD Instinct GPUs. - -Prerequisites -============= - -Before starting, ensure your environment meets the following requirements: - -* Multi-node networking: your cluster should have a configured multi-node network. For setup - instructions, see the `Multi-node network configuration for AMD Instinct - accelerators - `__ - guide in the Instinct documentation. - -* ROCm Docker container to simplify environment setup for AI workloads. See the following resources to get started: - - * :doc:`Training a model with Megatron-LM and ROCm <../training/benchmark-docker/megatron-lm>` - - * :doc:`Training a model with PyTorch and ROCm <../training/benchmark-docker/pytorch-training>` - - * :doc:`Training a model with JAX MaxText and ROCm <../training/benchmark-docker/jax-maxtext>` - -* Slurm workload manager to run the :ref:`provided examples `. - -Install required packages -========================= - -To run multi-node workloads, ensure you have all the required packages installed based on your -network device. For example, on Ubuntu systems: - -.. code-block:: shell - - apt install -y iproute2 - - apt install -y linux-headers-"$(uname -r)" libelf-dev - - apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace libibmad5 libibnetdisc5 ibverbs-providers libibumad-dev libibumad3 libibverbs1 libnl-3-dev libnl-route-3-dev - -Compile and install the RoCE library ------------------------------------- - -If you're using Broadcom NICs, you need to compile and install the RoCE (RDMA -over Converged Ethernet) library. See `RoCE cluster network configuration guide -for AMD Instinct accelerators -`__ -for more information. - -See the `Ethernet networking guide for AMD -Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source -`_ for more details. - -.. important:: - - It is crucial to install the exact same version of the RoCE library that - is installed on your host system. Also, ensure that the path to these - libraries on the host is correctly mounted into your Docker container. - Failure to do so can lead to compatibility issues and communication - failures. - -1. Set ``BUILD_DIR`` to the path on the host system where the Broadcom drivers and ``bnxt_rocelib`` source are located. - Then, navigate to the ``bnxt_rocelib`` directory. - - .. code-block:: shell - - export BUILD_DIR=/path/to/your/broadcom_drivers_on_host - cd $BUILD_DIR/drivers_linux/bnxt_rocelib/ - -2. The ``bnxt_rocelib`` directory contains a version of ``libbnxt_re`` in a zipped ``.tar.gz`` file. - - .. code-block:: shell - - tar -xf libbnxt_re-a.b.c.d.tar.gz - cd libbnxt_re-a.b.c.d - -3. Compile and install the RoCE library. - - .. code-block:: shell - - sh autogen.sh - ./configure - make - find /usr/lib64/ /usr/lib -name "libbnxt_re-rdmav*.so" -exec mv {} {}.inbox \; - make install all - sh -c "echo /usr/local/lib >> /etc/ld.so.conf" - ldconfig - cp -f bnxt_re.driver /etc/libibverbs.d/ - find . -name "*.so" -exec md5sum {} \; - BUILT_MD5SUM=$(find . -name "libbnxt_re-rdmav*.so" -exec md5sum {} \; | cut -d " " -f 1) - -Environment setup -================= - -Before running multi-node workloads, set these essential environment variables: - -Master address --------------- - -By default, ``localhost`` is used for single-node configurations. Change -``localhost`` to the master node's resolvable hostname or IP address: - -.. code-block:: bash - - export MASTER_ADDR="${MASTER_ADDR:-localhost}" - -Number of nodes ---------------- - -Set the number of nodes you want to train on (for example, ``2``, ``4``, or ``8``): - -.. code-block:: bash - - export NNODES="${NNODES:-}" - -Node ranks ----------- - -Set the rank of each node (``0`` for master, ``1`` for the first worker node, and so on). -Node ranks should be unique across all nodes in the cluster. - -.. code-block:: bash - - export NODE_RANK="${NODE_RANK:-}" - -Network interface ------------------ - -Update the network interface in the script to match your system's network interface. To -find your network interface, run the following (outside of any Docker container): - -.. code-block:: bash - - ip a - -Look for an active interface (status "UP") with an IP address in the same subnet as -your other nodes. Then, update the following variable in the script, for -example: - -.. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - -This variable specifies which network interface to use for inter-node communication. -Setting this variable to the incorrect interface can result in communication failures -or significantly reduced performance. - -.. tip:: - - This command sets ``NCCL_SOCKET_IFNAME``'s value to the last RDMA interface. - - .. code-block:: bash - - export NCCL_SOCKET_IFNAME=$(rdma link show | awk '{print $NF}' | sort | tail -n1) - -RDMA/IB interface ------------------ - -Set the RDMA interfaces to be used for communication. NICs can come from different vendors and the names of the RDMA interface can be different. To get the list of all the RDMA/IB devices, run: - -.. code-block:: bash - - ibv_devices - -The command below gets the list of all RDMA/IB devices and puts them in a -comma-separated format. If -(``rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7``) are your RDMA -interfaces, then set: - -.. code-block:: bash - - # If using Broadcom NIC - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - # If using Mellanox NIC - # export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9 - -.. tip:: - - Alternatively, if you want to choose the RDMA interface automatically, you - can use the following. This command will sort the RDMA interfaces and then - select the first eight RDMA interfaces. - - .. code-block:: bash - - export NCCL_IB_HCA=$(ibv_devices | awk 'NR>2 {print $1}' | sort | head -n 8 | paste -sd,) - -Global ID index ---------------- - -Update the global ID index if you're using RoCE. - -.. code-block:: bash - - export NCCL_IB_GID_INDEX=3 - -.. _multi-node-setup-training-examples: - -Multi-node training examples -============================ - -The following examples use the Slurm workload manager to launch jobs on -multiple nodes. To run these scripts as-is, you must have a Slurm environment -configured. The scripts are designed to work with both Broadcom Thor 2 and -Mellanox NICs by automatically installing the required libraries and setting -the necessary environment variables. For systems with Broadcom NICs, the -scripts assume the host's RoCE library is located in the ``/opt`` directory. - -The following benchmarking examples demonstrate the training of a Llama 3 8B model -across multiple 8-GPU nodes, using FSDP for intra-node parallelism and DP for -inter-node parallelism. - -.. _rocm-for-ai-multi-node-setup-jax-train-example: - -JAX MaxText ------------ - -1. Download the desired multi-node benchmarking script from ``__. - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/MAD/refs/heads/develop/scripts/jax-maxtext/gpu-rocm/llama3_8b_multinode.sh - - Or clone the ``__ repository. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd scripts/jax-maxtext/gpu-rocm - -2. Run the benchmark for multi-node training. - - .. code-block:: shell - - sbatch -N llama3_8b_multinode.sh - -.. _rocm-for-ai-multi-node-setup-pyt-train-example: - -PyTorch training ----------------- - -.. note:: - - The ROCm PyTorch Training Docker image now focuses on :doc:`Training a model - with Primus and PyTorch <../training/benchmark-docker/primus-pytorch>`. The - following example refers to the legacy workflow :ref:`Training a - model with PyTorch `. - -1. Download the ``run_multinode_train.sh`` benchmarking script from ``__. - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/MAD/refs/heads/develop/scripts/pytorch_train/run_multinode_train.sh - - Or clone the ``__ repository. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd scripts/pytorch_train - -2. Run the benchmark for multi-node training. - - .. code-block:: shell - - sbatch -N run_multinode_train.sh - -.. seealso:: - - See :ref:`Training a model with PyTorch ` for more examples and information. - -Megatron-LM ------------ - -.. note:: - - The Megatron-LM Docker image now focuses on :ref:`Training a model with - Primus and Megatron `. The - following example refers to the legacy Megatron-LM :ref:`Training a model - with Megatron-LM ` and might have - limited support. - -1. Download the ``train_llama_slurm.sh`` benchmarking script from - ``__. - -2. Set the network interface parameters as per the above guidelines and run the script. - - .. code-block:: shell - - cd - export NETWORK_INTERFACE=$NCCL_SOCKET_IFNAME - export NCCL_IB_HCA=$NCCL_IB_HCA - export IMAGE=docker.io/rocm/megatron-lm:latest OR your preferred image - export DATA_CACHE_PATH=/nfs/mounted/repo - - sbatch –N examples/llama/train_llama_slurm.sh - -2. For example, to run a Llama 3 8B workload in BF16 precision, use the following command. - - .. code-block:: shell - - MODEL_NAME=llama3 sbatch –N 8 examples/llama/train_llama_slurm.sh 8 2 128 8192 0 0 - # Other parameters, such as TP, FP8 datatype, can be adjusted in the script. - -Further reading -=============== - -* `Multi-node network configuration for AMD Instinct accelerators `__ - -* `Ethernet networking guide for AMD Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source `__ diff --git a/docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst b/docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst deleted file mode 100644 index 60aedecfe..000000000 --- a/docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst +++ /dev/null @@ -1,131 +0,0 @@ -.. meta:: - :description: Prerequisite system validation before using ROCm for AI. - :keywords: ROCm, AI, LLM, train, megatron, Llama, tutorial, docker, torch, pytorch, jax - -.. _train-a-model-system-validation: -.. _rocm-for-ai-system-optimization: - -********************************************************** -Prerequisite system validation before running AI workloads -********************************************************** - -Complete the following system validation and optimization steps to set up your system before starting training and inference. - -Disable NUMA auto-balancing ---------------------------- - -Generally, application performance can benefit from disabling NUMA auto-balancing. However, -it might be detrimental to performance with certain types of workloads. - -Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform -Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or -the output is ``1``, run the following command to disable NUMA auto-balancing. - -.. code-block:: shell - - sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - -See `Disable NUMA auto-balancing `_ -in the Instinct documentation for more information. - -Hardware verification with ROCm -------------------------------- - -Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz -instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable -GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature. -You can restore this setting to its default value with the ``rocm-smi -r`` command. - -Run the command: - -.. code-block:: shell - - rocm-smi --setperfdeterminism 1900 - -See `Hardware verfication for ROCm `_ -in the Instinct documentation for more information. - -RCCL Bandwidth Test for multi-node setups ------------------------------------------ - -ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication -routines for GPUs. See the :doc:`RCCL documentation ` for more information. Before starting -pretraining, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized -for efficient distributed training. - -Running the RCCL bandwidth test helps verify that: - -- The GPUs can communicate across nodes or within a single node. - -- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and - provides adequate bandwidth for communication. - -- No hardware setup or cabling issues could affect the communication between GPUs - -Tuning and optimizing hyperparameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In distributed training, specific hyperparameters related to distributed communication can be tuned based on -the results of the RCCL bandwidth test. These variables are already set in the Docker image: - -.. code-block:: shell - - # force all RCCL streams to be high priority - export TORCH_NCCL_HIGH_PRIORITY=1 - - # specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - - # define the Global ID index used in RoCE mode - export NCCL_IB_GID_INDEX=3 - - # avoid data corruption/mismatch issue that existed in past releases - export RCCL_MSCCL_ENABLE=0 - -Running the RCCL Bandwidth Test -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It's recommended you run the RCCL bandwidth test before launching training. It ensures system -performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker -image; follow the instructions in ``__ to get started. -See :ref:`mi300x-rccl` for more information. - -Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB: - -.. code-block:: shell - - ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8 - -.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-8-gpu.png - :width: 800 - -Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is -recommended. So, a run on 8 GPUs looks something like: - -.. code-block:: shell - - mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1 - -.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png - :width: 800 - -Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial -for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like -PyTorch and TensorFlow. - -Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed. - -.. code-block:: - - /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \ - --mca pml ucx \ - --mca btl ^openib \ - -x NCCL_SOCKET_IFNAME=ens50f0np0 \ - -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \ - -x NCCL_IB_GID_INDEX=3 \ - -x NCCL_MIN_NCHANNELS=40 \ - -x NCCL_DEBUG=version \ - $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1 - -.. image:: ../../../data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png - :width: 800 diff --git a/docs/how-to/rocm-for-ai/system-setup/system-health-check.rst b/docs/how-to/rocm-for-ai/system-setup/system-health-check.rst deleted file mode 100644 index 79563b61f..000000000 --- a/docs/how-to/rocm-for-ai/system-setup/system-health-check.rst +++ /dev/null @@ -1,106 +0,0 @@ -:orphan: - -.. meta:: - :description: System health checks with RVS, RCCL tests, BabelStream, and TransferBench to validate AMD hardware performance running AI workloads. - :keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training, inference - -.. _rocm-for-ai-system-health-bench: - -***************************************** -System health benchmarks for AI workloads -***************************************** - -Before running AI workloads, it is important to validate that your AMD hardware is configured correctly and is performing optimally. This topic outlines several system health benchmarks you can use to test key aspects like GPU compute capabilities (FLOPS), memory bandwidth, and interconnect performance. Many of these tests are part of the ROCm Validation Suite (RVS). - -ROCm Validation Suite (RVS) tests -================================= - -RVS provides a collection of tests, benchmarks, and qualification tools, each -targeting a specific subsystem of the system under test. It includes tests for -GPU stress and memory bandwidth. - -.. _healthcheck-install-rvs: - -Install ROCm Validation Suite ------------------------------ - -To get started, install RVS. For example, on an Ubuntu system with ROCm already -installed, run the following command: - -.. code-block:: shell - - sudo apt update - sudo apt install rocm-validation-suite - -See the `ROCm Validation Suite installation instructions `_, -and `System validation tests `_ -in the Instinct documentation for more detailed instructions. - -Benchmark, stress, and qualification tests ------------------------------------------- - -The GPU stress test runs various GEMM computations as workloads to stress the GPU FLOPS performance and check whether it -meets the configured target GFLOPS. - -Run the benchmark, stress, and qualification tests included with RVS. See the `Benchmark, stress, qualification -`_ -section of the Instinct documentation for usage instructions. - -BabelStream test ----------------- - -BabelStream is a synthetic GPU benchmark based on the STREAM benchmark for -CPUs, measuring memory transfer rates to and from global device memory. -BabelStream tests are included with the RVS package as part of the `BABEL module -`_. - -For more information, see `Performance benchmarking -`_ -in the Instinct documentation. - -RCCL tests -========== - -The ROCm Communication Collectives Library (RCCL) enables efficient multi-GPU -communication. The ``__ suite benchmarks -the performance and verifies the correctness of these collective operations. -This helps ensure optimal scaling for multi-GPU tasks. - -1. To get started, build RCCL-tests using the official instructions in the README at - ``__ or use the - following commands: - - .. code-block:: shell - - git clone https://github.com/ROCm/rccl-tests.git - cd rccl-tests - make - -2. Run the suggested RCCL tests -- see `RCCL benchmarking - `_ - in the AMD Instinct customer acceptance guide. - -TransferBench test -================== - -TransferBench is a standalone utility for benchmarking simultaneous data -transfer performance between various devices in the system, including -CPU-to-GPU and GPU-to-GPU (peer-to-peer). This helps identify potential -bottlenecks in data movement between the host system and the GPUs, or between -GPUs, which can impact end-to-end latency. - -.. _healthcheck-install-transferbench: - -1. To get started, use the instructions in the `TransferBench documentation - `_ - or use the following commands: - - .. code:: shell - - git clone https://github.com/ROCm/TransferBench.git - cd TransferBench - CC=hipcc make - -2. Run the suggested TransferBench tests -- see `TransferBench benchmarking - `_ - in the Instinct performance benchmarking documentation for instructions. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst deleted file mode 100644 index eec785b7b..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst +++ /dev/null @@ -1,359 +0,0 @@ -.. meta:: - :description: How to train a model using JAX MaxText for ROCm. - :keywords: ROCm, AI, LLM, train, jax, torch, Llama, flux, tutorial, docker - -****************************************** -Training a model with JAX MaxText on ROCm -****************************************** - -MaxText is a high-performance, open-source framework built on the Google JAX -machine learning library to train LLMs at scale. The MaxText framework for -ROCm is an optimized fork of the upstream -``__ enabling efficient AI workloads -on AMD MI300X series GPUs. - -The MaxText for ROCm training Docker image -provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs, -including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. -It includes the following software components: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml - - {% set dockers = data.dockers %} - .. tab-set:: - - {% for docker in dockers %} - {% set jax_version = docker.components["JAX"] %} - - .. tab-item:: ``{{ docker.pull_tag }}`` - :sync: {{ docker.pull_tag }} - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - - {% endfor %} - {% if jax_version == "0.6.0" %} - .. note:: - - Shardy is a new config in JAX 0.6.0. You might get related errors if it's - not configured correctly. For now you can turn it off by setting - ``shardy=False`` during the training run. You can also follow the `migration - guide `__ to enable - it. - {% endif %} - - {% endfor %} - -MaxText with on ROCm provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- Flash Attention (FA) 3 -- with or without sequence input packing - -- GEMM tuning - -- Multi-node support - -- NANOO FP8 quantization support - -.. _amd-maxtext-model-support-v257: - -Supported models -================ - -The following models are pre-optimized for performance on AMD Instinct MI300 -series GPUs. Some instructions, commands, and available training -configurations in this documentation might vary by model -- select one to get -started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -Environment setup -================= - -This Docker image is optimized for specific model configurations outlined -as follows. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -Pull the Docker image ---------------------- - -Use the following command to pull the Docker image from Docker Hub. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml - - {% set dockers = data.dockers %} - .. tab-set:: - - {% for docker in dockers %} - {% set jax_version = docker.components["JAX"] %} - - .. tab-item:: JAX {{ jax_version }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endfor %} - -.. _amd-maxtext-multi-node-setup-v257: - -Multi-node configuration ------------------------- - -See :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your -environment for multi-node training. - -.. _amd-maxtext-get-started-v257: - -Benchmarking -============ - -Once the setup is complete, choose between two options to reproduce the -benchmark results: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml - - .. _vllm-benchmark-mad: - - {% set dockers = data.dockers %} - {% set model_groups = data.model_groups %} - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{model.mad_tag}} - - .. tab-set:: - - {% if model.mad_tag and "single-node" in model.doc_options %} - .. tab-item:: MAD-integrated benchmarking - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. Use this command to run the performance benchmark test on the {{ model.model }} model - using one GPU with the :literal:`{{model.precision}}` data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{model.mad_tag}} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/perf.csv/``. - {% endif %} - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required scripts - - Run the JAX MaxText benchmark tool independently by starting the - Docker container as shown in the following snippet. - - .. tab-set:: - {% for docker in dockers %} - {% set jax_version = docker.components["JAX"] %} - - .. tab-item:: JAX {{ jax_version }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - {% endfor %} - - {% if model.model_repo and "single-node" in model.doc_options %} - .. rubric:: Single node training - - 1. Set up environment variables. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN= - export HF_HOME= - - ``MAD_SECRETS_HFTOKEN`` is your Hugging Face access token to access models, tokenizers, and data. - See `User access tokens `__. - - ``HF_HOME`` is where ``huggingface_hub`` will store local data. See `huggingface_hub CLI `__. - If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. - Downloaded files typically get cached to ``~/.cache/huggingface``. - - 2. Launch the Docker container. - - .. tab-set:: - {% for docker in dockers %} - {% set jax_version = docker.components["JAX"] %} - - .. tab-item:: JAX {{ jax_version }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker run -it \ - --device=/dev/dri \ - --device=/dev/kfd \ - --network host \ - --ipc host \ - --group-add video \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - -v $HF_HOME:/hf_cache \ - -e HF_HOME=/hf_cache \ - -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN - --shm-size 64G \ - --name training_env \ - {{ docker.pull_tag }} - {% endfor %} - - 3. In the Docker container, clone the ROCm MAD repository and navigate to the - benchmark scripts directory at ``MAD/scripts/jax-maxtext``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/jax-maxtext - - 4. Run the setup scripts to install libraries and datasets needed - for benchmarking. - - .. code-block:: shell - - ./jax-maxtext_benchmark_setup.sh -m {{ model.model_repo }} - - 5. To run the training benchmark without quantization, use the following command: - - .. code-block:: shell - - ./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} - - For quantized training, use the following command: - - .. code-block:: shell - - ./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} -q nanoo_fp8 - - {% endif %} - {% if model.multinode_training_script and "multi-node" in model.doc_options %} - .. rubric:: Multi-node training - - The following examples use SLURM to run on multiple nodes. - - .. note:: - - The following scripts will launch the Docker container and run the - benchmark. Run them outside of any Docker container. - - 1. Make sure ``$HF_HOME`` is set before running the test. See - `ROCm benchmarking `__ - for more details on downloading the Llama models before running the - benchmark. - - 2. To run multi-node training for {{ model.model }}, - use the - `multi-node training script `__ - under the ``scripts/jax-maxtext/gpu-rocm/`` directory. - - 3. Run the multi-node training benchmark script. - - .. code-block:: shell - - sbatch -N {{ model.multinode_training_script }} - - {% else %} - .. rubric:: Multi-node training - - For multi-node training examples, choose a model from :ref:`amd-maxtext-model-support-v257` - with an available `multi-node training script `__. - {% endif %} - {% endfor %} - {% endfor %} - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/jax-maxtext-history` to find documentation for previous releases -of the ``ROCm/jax-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst deleted file mode 100644 index ebd55be17..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst +++ /dev/null @@ -1,994 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM on ROCm -****************************************** - -.. caution:: - - Primus with Megatron is designed to replace this ROCm Megatron-LM training workflow. - To learn how to migrate workloads from Megatron-LM to Primus with Megatron, - see :doc:`previous-versions/megatron-lm-primus-migration-guide`. - -The `Megatron-LM framework for ROCm `_ is -a specialized fork of the robust Megatron-LM, designed to enable efficient -training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series GPUs, Megatron-LM delivers enhanced -scalability, performance, and resource utilization for AI workloads. It is -purpose-built to support models like Llama, DeepSeek, and Mixtral, -enabling developers to train next-generation AI models more -efficiently. - -AMD provides ready-to-use Docker images for MI300X series GPUs containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -.. note:: - - This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with - Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release `. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml - - {% set dockers = data.dockers %} - .. tab-set:: - - {% for docker in dockers %} - .. tab-item:: ``{{ docker.pull_tag }}`` - :sync: {{ docker.pull_tag }} - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - - {% endfor %} - {% endfor %} - - .. _amd-megatron-lm-model-support: - - Supported models - ================ - - The following models are supported for training performance benchmarking with Megatron-LM and ROCm - on AMD Instinct MI300X series GPUs. - Some instructions, commands, and training recommendations in this documentation might - vary by model -- select one to get started. - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -.. _amd-megatron-lm-performance-measurements: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `__ -page provides reference throughput and latency measurements for training -popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `__ - only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-megatron-lm-training: - -Environment setup -================= - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series GPUs with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements: - -Download the Docker image -------------------------- - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml - - {% set dockers = data.dockers %} - 1. Use the following command to pull the Docker image from Docker Hub. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in data.dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endif %} - 2. Launch the Docker container. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endif %} - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -4. **Megatron-LM backward compatibility setup** -- this Docker is primarily intended for use with Primus, but it maintains Megatron-LM compatibility with limited support. - To roll back to using Megatron-LM, follow these steps: - - .. code-block:: shell - - cd /workspace/Megatron-LM/ - pip uninstall megatron-core - pip install -e . - -The Docker container hosts -``__ at verified commit ``e8e9edc``. - -.. _amd-megatron-lm-environment-setup: - -Configuration -============= - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b - - Update the ``train_llama3.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - Update the ``train_llama2.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Multi-node configuration ------------------------- - -Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node -training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands. - -.. _amd-megatron-lm-tokenizer: - -Tokenizer ---------- - -You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples. -If the tokenizer is not found, it'll be downloaded if publicly available. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - If you do not have Llama 3.3 tokenizer locally, you need to use your - personal Hugging Face access token ``HF_TOKEN`` to download the tokenizer. - See `Llama-3.3-70B-Instruct - `_. After you are - authorized, use your ``HF_TOKEN`` to download the tokenizer and set the - variable ``TOKENIZER_MODEL`` to the tokenizer path. - - .. code-block:: shell - - export HF_TOKEN= - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.3-70B-Instruct" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-8B" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-70B" - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - The training script uses either the ``Llama2Tokenizer`` or ``HuggingFaceTokenizer`` by default. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3" - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V2-Lite" - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Download the Mixtral tokenizer. - - .. code-block:: shell - - mkdir tokenizer - cd tokenizer - export HF_TOKEN= - wget --header="Authorization: Bearer $HF_TOKEN" -O ./tokenizer.model https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/tokenizer.model - - Use the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL=tokenizer/tokenizer.model - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-7B" - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-72B" - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - -* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Download the dataset -^^^^^^^^^^^^^^^^^^^^ - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b pyt_megatron_lm_train_llama-3.1-70b-proxy - - For Llama models, use the `prepare_dataset.sh - `_ script - to prepare your dataset. - To download the dataset, set the ``DATASET`` variable to the dataset you'd - like to use. Three datasets are supported: ``DATASET=wiki``, ``DATASET=fineweb``, and - ``DATASET=bookcorpus``. - - .. code-block:: shell - - DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset - DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset - - ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer. - Remember to either pre-download the tokenizer or setup Hugging Face access - otherwise when needed -- see the :ref:`Tokenizer ` section. - - .. note:: - - When training set ``DATA_PATH`` to the specific file name prefix pointing to the ``.bin`` or ``.idx`` - as in the following example: - - .. code-block:: shell - - DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir mixtral-datasets - cd mixtral-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/mixtral-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b pyt_megatron_lm_train_qwen2.5-72b - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir -p temp/qwen-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/qwen-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. _amd-megatron-lm-run-training: - -Run training -============ - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series GPUs with the AMD Megatron-LM environment. - -Single node training --------------------- - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - To run the training on a single node for Llama 3.3 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.3-70B-Instruct \ - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MBS=2 \ - BS=16 \ - TE_FP8=0 \ - TP=1 \ - PP=1 \ - FSDP=1 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - To run the training on a single node for Llama 3.1 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b-proxy - - To run the training on a single node for Llama 3.1 70B with proxy, use the following command. - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - FSDP=1 \ - TOTAL_ITERS=10 \ - NUM_LAYERS=40 \ - bash examples/llama/train_llama3.sh - - .. note:: - - Use two or more nodes to run the *full* Llama 70B model with FP8 precision. - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b - - To run training on a single node for Llama 2 7B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - For Llama 2 7B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-2-70b - - To run the training on a single node for Llama 2 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=7 \ - BS=56 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - FORCE_BALANCE=true \ - RUN_ENV=cluster \ - MODEL_SIZE=671B \ - TRAIN_ITERS=50 \ - SEQ_LEN=4096 \ - NUM_LAYERS=3 \ - MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=32 \ - PR=bf16 \ - TP=1 PP=1 ETP=1 EP=8 \ - GEMM_TUNING=1 \ - NVTE_CK_USES_BWD_V3=1 \ - USE_GROUPED_GEMM=true MOE_USE_LEGACY_GROUPED_GEMM=true \ - GPT_LAYER_IN_TE=true \ - bash examples/deepseek_v3/train_deepseekv3.sh - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - GEMM_TUNING=1 \ - PR=bf16 \ - MBS=4 \ - AC=none \ - SEQ_LEN=4096 \ - PAD_LEN=4096 \ - TRAIN_ITERS=20 \ - bash examples/deepseek_v2/train_deepseekv2.sh - - .. note:: - - Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault - for large iterations. - For stability, it's recommended to use Primus for this workload. - See :doc:`primus-megatron`. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=0 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=none \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=4096 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x7B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x22b-proxy - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=4 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=full \ - NUM_LAYERS=4 \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=8192 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x22B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - To run training on a single node for Qwen 2.5 7B BF16, use the following - command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=0 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - - For FP8, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=1 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - To run the training on a single node for Qwen 2.5 72B BF16, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - FSDP=1 \ - CP=1 \ - PP=1 \ - MBS=3 \ - BS=24 \ - TE_FP8=0 \ - MODEL_SIZE=72 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-72B \ - RECOMPUTE_ACTIVATIONS=full \ - CKPT_FORMAT=torch_dist - -.. _amd-megatron-lm-multi-node-examples: - -Multi-node training examples ----------------------------- - -To run training on multiple nodes, launch the Docker container on each node. -For example, for Llama 3 using a two node setup (``NODE0`` as the master node), -use these commands. - -* On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=0 \ - bash examples/llama/train_llama3.sh - -* On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=1 \ - bash examples/llama/train_llama3.sh - -Or, for DeepSeek-V3, an example script ``train_deepseek_v3_slurm.sh`` is -provided in -``__ to -enable training at scale under a SLURM environment. For example, to run -training on 16 nodes, try the following command: - -.. code-block:: shell - - sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh - -.. _amd-megatron-lm-benchmark-test-vars: - -Key options ------------ - -The benchmark tests support the following sets of variables. - -``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - -``TE_FP8`` - ``0`` for B16 or ``1`` for FP8 -- ``0`` by default. - -``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - -``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - -``FSDP`` - ``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``, - ``--overlap-param-gather``, and ``--sequence-parallel`` are automatically disabled. - -``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - -``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - -``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2, for example. - -``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - -``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - -``MBS`` - Micro batch size. - -``BS`` - Global batch size. - -``TP`` / ``TP_SIZE`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on. - -``EP`` / ``EP_SIZE`` - Expert parallel for MoE models. - -``SEQ_LENGTH`` - Input sequence length. - -``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - -``AC`` - Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default. - -``NUM_LAYERS`` - Use reduced number of layers as a proxy model. - -``RECOMPUTE_NUM_LAYERS`` - Number of layers used for checkpointing recompute. - -Previous versions -================= - -See :doc:`previous-versions/megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst deleted file mode 100644 index de9e44a8c..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst +++ /dev/null @@ -1,188 +0,0 @@ -.. meta:: - :description: How to train a model using LLM Foundry for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -****************************************** -Training MPT-30B with LLM Foundry on ROCm -****************************************** - -MPT-30B is a 30-billion parameter decoder-style transformer-based model from -the Mosaic Pretrained Transformer (MPT) family -- learn more about it in -MosaicML's research blog `MPT-30B: Raising the bar for open-source foundation -models `_. - -ROCm and ``__ provide a pre-configured training -environment for the MPT-30B model using the ``rocm/pytorch-training:v25.5`` -base `Docker image `_ -and the `LLM Foundry `_ framework. -This environment packages the following software components to train -on AMD Instinct MI300X series accelerators: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.4 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git6374332 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0.post1 | -+--------------------------+--------------------------------+ - -Using this image, you can build, run, and test the training process -for MPT-30B with access to detailed logs and performance metrics. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -Getting started -=============== - -The following procedures help you set up the training environment in a -reproducible Docker container. This training environment is tailored for -training MPT-30B using LLM Foundry and the specific model configurations outlined. -Other configurations and run conditions outside those described in this -document are not validated. - -.. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - On your host machine, clone the ROCm Model Automation and Dashboarding - (``__) repository to a local directory and - install the required packages. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - Use this command to initiate the MPT-30B training benchmark. - - .. code-block:: shell - - madengine run \ - --tags pyt_mpt30b_training \ - --keep-model-dir \ - --live-output \ - --clean-docker-cache - - .. tip:: - - If you experience data download failures, set the - ``MAD_SECRETS_HFTOKEN`` variable to your Hugging Face access token. See - `User access tokens `_ - for details. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - - .. note:: - - For improved performance (training throughput), consider enabling TunableOp. - By default, ``pyt_mpt30b_training`` runs with TunableOp disabled. To enable it, - run ``madengine run`` with the ``--tunableop on`` argument or edit the - ``models.json`` configuration before running training. - - Although this might increase the initial training time, it can result in a performance gain. - - .. tab-item:: Standalone benchmarking - - To set up the training environment, clone the - ``__ repo and build the Docker image. In - this snippet, the image is named ``mosaic_mpt30_image``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - - docker build --build-arg MAD_SYSTEM_GPU_ARCHITECTURE=gfx942 -f docker/pyt_mpt30b_training.ubuntu.amd.Dockerfile -t mosaic_mpt30_image . - - Start a ``mosaic_mpt30_image`` container using the following command. - - .. code-block:: shell - - docker run -it --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --shm-size=8G mosaic_mpt30_image - - In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory at - ``/workspace/MAD/scripts/pyt_mpt30b_training``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pyt_mpt30b_training - - To initiate the training process, use the following command. This script uses the hyperparameters defined in - ``mpt-30b-instruct.yaml``. - - .. code-block:: shell - - source run.sh - - .. note:: - - For improved performance (training throughput), consider enabling TunableOp. - To enable it, add the ``--tunableop on`` flag. - - .. code-block:: shell - - source run.sh --tunableop on - - Although this might increase the initial training time, it can result in a performance gain. - -Interpreting the output -======================= - -The training output will be displayed in the terminal and simultaneously saved -to the ``output.txt`` file in the current directory. Key performance metrics will -also be extracted and appended to the ``perf_pyt_mpt30b_training.csv`` file. - -Key performance metrics include: - -- Training logs: Real-time display of loss metrics, accuracy, and training progress. - -- Model checkpoints: Periodically saved model snapshots for potential resume or evaluation. - -- Performance metrics: Detailed summaries of training speed and training loss metrics. - - - Performance (throughput/samples_per_sec) - - Overall throughput, measuring the total samples processed per second. Higher values indicate better hardware utilization. - - - Performance per device (throughput/samples_per_sec) - - Throughput on a per-device basis, showing how each GPU or CPU is performing. - - - Language Cross Entropy (metrics/train/LanguageCrossEntropy) - - Measures prediction accuracy. Lower cross entropy suggests the model’s output is closer to the expected distribution. - - - Training loss (loss/train/total) - - Overall training loss. A decreasing trend indicates the model is learning effectively. - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history.rst deleted file mode 100644 index e4d039356..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history.rst +++ /dev/null @@ -1,43 +0,0 @@ -:orphan: - -******************************************************** -JAX MaxText training performance testing version history -******************************************************** - -This table lists previous versions of the ROCm JAX MaxText Docker image for training -performance testing. For detailed information about available models for -benchmarking, see the version-specific documentation. -You can find tagged -previous releases of the ``ROCm/jax-training`` Docker image on `Docker Hub `_. - -.. list-table:: - :header-rows: 1 - - * - Image version - - Components - - Resources - - * - 25.7 (latest) - - - * ROCm 6.4.1 - * JAX 0.6.0, 0.5.0 - - - * :doc:`Documentation <../jax-maxtext>` - * `Docker Hub (JAX 0.6.0) `__ - * `Docker Hub (JAX 0.5.0) `__ - - * - 25.5 - - - * ROCm 6.3.4 - * JAX 0.4.35 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - 25.4 - - - * ROCm 6.3.0 - * JAX 0.4.31 - - - * :doc:`Documentation ` - * `Docker Hub `__ diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst deleted file mode 100644 index 8b8dd65bd..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst +++ /dev/null @@ -1,356 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using JAX MaxText for ROCm. - :keywords: ROCm, AI, LLM, train, jax, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with MaxText for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm JAX MaxText - training performance documentation. See :doc:`../jax-maxtext` for the latest version. - -MaxText is a high-performance, open-source framework built on the Google JAX -machine learning library to train LLMs at scale. The MaxText framework for -ROCm is an optimized fork of the upstream -``__ enabling efficient AI workloads -on AMD MI300X series accelerators. - -The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.4``) image -provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators, -including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. -It includes the following software components: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.0 | -+--------------------------+--------------------------------+ -| JAX | 0.4.31 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.12.0.dev0+f81a3eb | -+--------------------------+--------------------------------+ -| hipBLASLt | git78ec8622 | -+--------------------------+--------------------------------+ - -Supported features and models -============================= - -MaxText provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- Flash Attention (FA) 3 - -- GEMM tuning - -- Multi-node support - -.. _amd-maxtext-model-support-v254: - -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. - -* Llama 3.1 8B - -* Llama 3.1 70B - -* Llama 3 8B - -* Llama 3 70B - -* Llama 2 7B - -* Llama 2 70B - -* DeepSeek-V2-Lite - -.. note:: - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -Unsupported features --------------------- - -Currently, MaxText's default packed input format is not supported. Using this format -with the current Docker image results in incorrect attention calculations -across different input sequences. Support for packed input format is planned for a future release. - -System validation -================= - -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system -before starting training. - -Environment setup -================= - -This Docker image is optimized for specific model configurations outlined -as follows. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -.. _amd-maxtext-multi-node-setup-v254: - -Multi-node setup ----------------- - -For multi-node environments, ensure you have all the necessary packages for -your network device, such as, RDMA. If you're not using a multi-node setup -with RDMA, skip ahead to :ref:`amd-maxtext-download-docker-v254`. - -1. Install the following packages to build and install the RDMA driver. - - .. code-block:: shell - - sudo apt install iproute2 -y - sudo apt install -y linux-headers-"$(uname-r)" libelf-dev - sudo apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace libibmad5 libibnetdisc5 ibverbs-providers libibumad-dev libibumad3 libibverbs1 libnl-3-dev libnl-route-3-dev - - Refer to your NIC manufacturer's documentation for further steps on - compiling and installing the RoCE driver. For example, for Broadcom, - see `Compiling Broadcom NIC software from source `_ - in `Ethernet networking guide for AMD Instinct MI300X GPU clusters `_. - -2. Set the following environment variables. - - a. Master address - - Change `localhost` to the master node's resolvable hostname or IP address: - - .. code-block:: bash - - export MASTER_ADDR="${MASTER_ADDR:-localhost}" - - b. Number of nodes - - Set the number of nodes you want to train on (for example, ``2``, ``4``, or ``8``): - - .. code-block:: bash - - export NNODES="${NNODES:-1}" - - c. Node ranks - - Set the rank of each node (``0`` for master, ``1`` for the first worker node, and so on) - Node ranks should be unique across all nodes in the cluster. - - .. code-block:: bash - - export NODE_RANK="${NODE_RANK:-0}" - - d. Network interface - - Update the network interface in the script to match your system's network interface. To - find your network interface, run the following (outside of any Docker container): - - .. code-block:: bash - - ip a - - Look for an active interface with an IP address in the same subnet as - your other nodes. Then, update the following variable in the script, for - example: - - .. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - This variable specifies which network interface to use for inter-node communication. - Setting this variable to the incorrect interface can result in communication failures - or significantly reduced performance. - - e. RDMA interface - - Ensure the :ref:`required packages ` are installed on all nodes. - Then, set the RDMA interfaces to use for communication. - - .. code-block:: bash - - # If using Broadcom NIC - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - # If using Mellanox NIC - export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9 - -.. _amd-maxtext-download-docker-v254: - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/jax-training:maxtext-v25.4 - -2. Run the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME/.ssh:/root/.ssh --shm-size 128G --name maxtext_training rocm/jax-training:maxtext-v25.4 - -.. _amd-maxtext-get-started-v254: - -Getting started -=============== - -The following examples demonstrate how to get started with single node -and multi-node training using the benchmarking scripts provided at -``__. - -.. important:: - - The provided scripts launch a Docker container and execute a benchmark. Ensure you run these commands outside of any existing Docker container. - -Before running any benchmarks, ensure the ``$HF_HOME`` environment variable is -set correctly and points to your Hugging Face cache directory. - -Single node training benchmarking examples ------------------------------------------- - -* Example 1: Single node training with Llama 2 7B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b.sh - - Run the single node training benchmark: - - IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama2_7b.sh - -* Example 2: Single node training with Llama 2 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama2_70b.sh - -* Example 3: Single node training with Llama 3 8B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama3_8b.sh - -* Example 4: Single node training with Llama 3 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.4" bash ./llama3_70b.sh - -* Example 5: Single node training with DeepSeek V2 16B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/deepseek_v2_16b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.4" bash ./deepseek_v2_16b.sh - - .. note:: - - The reported TFLOP/s by MaxText for DeepSeek is not accurate. Use - the tokens/s as a performance indicator. - -Multi-node training benchmarking examples ------------------------------------------ - -The following examples use SLURM for running on multiple nodes -- the commands might need to be adjusted for your -own cluster setup. - -* Example 1: Multi-node training with Llama 2 7B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama2_7b_multinode.sh - -* Example 2: Multi-node training with Llama 2 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama2_70b_multinode.sh - -* Example 3: Multi-node training with Llama 3 8B model - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama3_8b_multinode.sh - -* Example 4: Multi-node training with Llama 3 70B model - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama3_70b_multinode.sh - -Previous versions -================= - -See :doc:`jax-maxtext-history` to find documentation for previous releases -of the ``ROCm/jax-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst deleted file mode 100644 index e1581423e..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst +++ /dev/null @@ -1,383 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using JAX MaxText for ROCm. - :keywords: ROCm, AI, LLM, train, jax, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with MaxText for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm JAX MaxText - training performance documentation. See :doc:`../jax-maxtext` for the latest version. - -MaxText is a high-performance, open-source framework built on the Google JAX -machine learning library to train LLMs at scale. The MaxText framework for -ROCm is an optimized fork of the upstream -``__ enabling efficient AI workloads -on AMD MI300X series accelerators. - -The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.5``) image -provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators, -including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. -It includes the following software components: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.4 | -+--------------------------+--------------------------------+ -| JAX | 0.4.35 | -+--------------------------+--------------------------------+ -| Python | 3.10.12 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.12.0.dev0+b8b92dc | -+--------------------------+--------------------------------+ -| hipBLASLt | 0.13.0-ae9c477a | -+--------------------------+--------------------------------+ - -Supported features and models -============================= - -MaxText provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- Flash Attention (FA) 3 - -- GEMM tuning - -- Multi-node support - -.. _amd-maxtext-model-support-v255: - -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. - -* Llama 3.3 70B - -* Llama 3.1 8B - -* Llama 3.1 70B - -* Llama 3 8B - -* Llama 3 70B - -* Llama 2 7B - -* Llama 2 70B - -* DeepSeek-V2-Lite - -.. note:: - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -Unsupported features --------------------- - -Currently, MaxText's default packed input format is not supported. Using this format -with the current Docker image results in incorrect attention calculations -across different input sequences. Support for packed input format is planned for a future release. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -Environment setup -================= - -This Docker image is optimized for specific model configurations outlined -as follows. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -.. _amd-maxtext-multi-node-setup-v255: - -Multi-node setup ----------------- - -For multi-node environments, ensure you have all the necessary packages for -your network device, such as, RDMA. If you're not using a multi-node setup -with RDMA, skip ahead to :ref:`amd-maxtext-download-docker-v255`. - -1. Install the following packages to build and install the RDMA driver. - - .. code-block:: shell - - sudo apt install iproute2 -y - sudo apt install -y linux-headers-"$(uname-r)" libelf-dev - sudo apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core strace libibmad5 libibnetdisc5 ibverbs-providers libibumad-dev libibumad3 libibverbs1 libnl-3-dev libnl-route-3-dev - - Refer to your NIC manufacturer's documentation for further steps on - compiling and installing the RoCE driver. For example, for Broadcom, - see `Compiling Broadcom NIC software from source `_ - in `Ethernet networking guide for AMD Instinct MI300X GPU clusters `_. - -2. Set the following environment variables. - - a. Master address - - Change ``localhost`` to the master node's resolvable hostname or IP address: - - .. code-block:: bash - - export MASTER_ADDR="${MASTER_ADDR:-localhost}" - - b. Number of nodes - - Set the number of nodes you want to train on (for example, ``2``, ``4``, or ``8``): - - .. code-block:: bash - - export NNODES="${NNODES:-1}" - - c. Node ranks - - Set the rank of each node (``0`` for master, ``1`` for the first worker node, and so on) - Node ranks should be unique across all nodes in the cluster. - - .. code-block:: bash - - export NODE_RANK="${NODE_RANK:-0}" - - d. Network interface - - Update the network interface in the script to match your system's network interface. To - find your network interface, run the following (outside of any Docker container): - - .. code-block:: bash - - ip a - - Look for an active interface with an IP address in the same subnet as - your other nodes. Then, update the following variable in the script, for - example: - - .. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - This variable specifies which network interface to use for inter-node communication. - Setting this variable to the incorrect interface can result in communication failures - or significantly reduced performance. - - e. RDMA interface - - Ensure the :ref:`required packages ` are installed on all nodes. - Then, set the RDMA interfaces to use for communication. - - .. code-block:: bash - - # If using Broadcom NIC - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - # If using Mellanox NIC - export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9 - -.. _amd-maxtext-download-docker-v255: - -Pull the Docker image ---------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/jax-training:maxtext-v25.5 - -2. Use the following command to launch the Docker container. Note that the benchmarking scripts - used in the :ref:`following section ` automatically launch the Docker container - and execute the benchmark. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME/.ssh:/root/.ssh --shm-size 128G --name maxtext_training rocm/jax-training:maxtext-v25.5 - -.. _amd-maxtext-get-started-v255: - -Getting started -=============== - -The following examples demonstrate how to get started with single node -and multi-node training using the benchmarking scripts provided at -``__. - -.. important:: - - The provided scripts launch a Docker container and execute a benchmark. Ensure you run these commands outside of any existing Docker container. - -Before running any benchmarks, ensure the ``$HF_HOME`` environment variable is -set correctly and points to your Hugging Face cache directory. - -Single node training benchmarking examples ------------------------------------------- - -* Example 1: Single node training with Llama 2 7B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./llama2_7b.sh - -* Example 2: Single node training with Llama 2 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./llama2_70b.sh - -* Example 3: Single node training with Llama 3 8B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./llama3_8b.sh - -* Example 4: Single node training with Llama 3 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./llama3_70b.sh - -* Example 5: Single node training with Llama 3.3 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3.3_70b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./llama3.3_70b.sh - -* Example 6: Single node training with DeepSeek V2 16B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/deepseek_v2_16b.sh - - Run the single node training benchmark: - - .. code-block:: shell - - IMAGE="rocm/jax-training:maxtext-v25.5" bash ./deepseek_v2_16b.sh - - .. note:: - - The reported TFLOP/s by MaxText for DeepSeek is not accurate. Use - the tokens/s as a performance indicator. - -Multi-node training benchmarking examples ------------------------------------------ - -The following examples use SLURM for running on multiple nodes -- the commands might need to be adjusted for your -own cluster setup. - -* Example 1: Multi-node training with Llama 2 7B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_7b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama2_7b_multinode.sh - -* Example 2: Multi-node training with Llama 2 70B - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama2_70b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama2_70b_multinode.sh - -* Example 3: Multi-node training with Llama 3 8B model - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_8b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama3_8b_multinode.sh - -* Example 4: Multi-node training with Llama 3 70B model - - Download the benchmarking script: - - .. code-block:: shell - - wget https://raw.githubusercontent.com/ROCm/maxtext/refs/heads/main/benchmarks/gpu-rocm/llama3_70b_multinode.sh - - Run the multi-node training benchmark. For example: - - .. code-block:: shell - - sbatch -N llama3_70b_multinode.sh - -Previous versions -================= - -See :doc:`jax-maxtext-history` to find documentation for previous releases -of the ``ROCm/jax-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-history.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-history.rst deleted file mode 100644 index 7f31d482a..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-history.rst +++ /dev/null @@ -1,77 +0,0 @@ -:orphan: - -******************************************************** -Megatron-LM training performance testing version history -******************************************************** - -This table lists previous versions of the ROCm Megatron-LM training Docker image for -inference performance testing. For detailed information about available models -for benchmarking, see the version-specific documentation. You can find tagged -previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub `__. - -.. list-table:: - :header-rows: 1 - - * - Image version - - Components - - Resources - - * - v25.8 (latest) - - - * ROCm 6.4.3 - * PyTorch 2.8.0a0+gitd06a406 - - - * :doc:`Primus Megatron documentation <../primus-megatron>` - * :doc:`Megatron-LM (legacy) documentation <../megatron-lm>` - * `Docker Hub (py310) `__ - - * - v25.7 - - - * ROCm 6.4.2 - * PyTorch 2.8.0a0+gitd06a406 - - - * :doc:`Primus Megatron documentation ` - * :doc:`Megatron-LM (legacy) documentation ` - * `Docker Hub (py310) `__ - - * - v25.6 - - - * ROCm 6.4.1 - * PyTorch 2.8.0a0+git7d205b2 - - - * :doc:`Documentation ` - * `Docker Hub (py312) `__ - * `Docker Hub (py310) `__ - - * - v25.5 - - - * ROCm 6.3.4 - * PyTorch 2.8.0a0+gite2f9759 - - - * :doc:`Documentation ` - * `Docker Hub (py312) `__ - * `Docker Hub (py310) `__ - - * - v25.4 - - - * ROCm 6.3.0 - * PyTorch 2.7.0a0+git637433 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v25.3 - - - * ROCm 6.3.0 - * PyTorch 2.7.0a0+git637433 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v24.12-dev - - - * ROCm 6.1.0 - * PyTorch 2.4.0 - - - * :doc:`Documentation ` - * `Docker Hub `__ diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide.rst deleted file mode 100644 index 409acdb95..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide.rst +++ /dev/null @@ -1,175 +0,0 @@ -:orphan: - -***************************************************************** -Migrating workloads to Primus (Megatron backend) from Megatron-LM -***************************************************************** - -Primus supports Megatron-Core as backend optimization library, -replacing ROCm Megatron-LM. This document outlines the steps to migrate -workload from ROCm Megatron-LM to Primus with the Megatron backend. - -Model architecture -================== - -ROCm Megatron-LM defines model architecture parameters in the training scripts; -for example, the Llama 3 8B model parameters are defined in -`examples/llama/train_llama3.sh `__ -as shown below: - -.. code-block:: bash - - HIDDEN_SIZE=4096 - FFN_HIDDEN_SIZE=14336 - NUM_LAYERS=32 - NUM_HEADS=32 - NUM_KV_HEADS=8 - -Primus defines the model architecture through model YAML configuration files -inside the ``primus/configs/models/megatron/`` repository. For example, Llama 3 8B -model architecture parameters are defined in -`primus/configs/models/megatron/llama3_8B.yaml `__ -as shown below: - -.. code-block:: yaml - - bases: - - llama3_base.yaml - - tokenizer_type: Llama3Tokenizer - tokenizer_model: meta-llama/Llama-3.1-8B - - ffn_hidden_size: 14336 - hidden_size: 4096 - num_attention_heads: 32 - num_layers: 32 - num_query_groups: 8 - -Primus' model config files follow a hierarchical design, meaning that new model -config YAMLs can inherit existing model config files by importing them as -bases. For example, -`llama3.1_8B.yaml `__ -uses ``llama3_8B.yaml`` as a base config and overrides few parameters, as shown below. -In this example, ``llama3.1_8B`` overrides the ``max_position_embeddings`` value: - -.. code-block:: yaml - - bases: - - llama3_8B.yaml - - tokenizer_type: Llama3Tokenizer - tokenizer_model: meta-llama/Llama-3.1-8B - - max_position_embeddings: 131072 - -.. tip:: - - Primus provides ``llama_base.yaml`` as the base configuration, which can be - used as bases for additional model architectures. For example, - `mixtral_base.yaml `__ - and - `deepseek_v3_base.yaml `__ - define ``llama_base.yaml`` as its base. - - .. code-block:: yaml - - # Example mixtral_base.yaml: - - bases: - - llama_base.yaml - - init_method_std: 0.01 - rotary_base: 1000000 - qk_layernorm: false - - group_query_attention: true - num_query_groups: 8 - - # moe parameters - num_experts: 8 - moe_router_topk: 2 - moe_router_load_balancing_type: aux_loss - moe_aux_loss_coeff: 1e-2 - moe_grouped_gemm: true - moe_token_dispatcher_type: alltoall - -It is recommended to add a new ``${MODEL_NAME}_base.yaml`` to add a new -category of model and define new models on top of it. For example, to add -Qwen2.5 models in Primus, we define -`qwen2.5_base.yaml `__ -and build -`qwen2.5_7B.yaml `__ -and -`qwen2.5_72B.yaml `__ -using ``qwen2.5_base.yaml`` as the base config. - -Training parameters -=================== - -ROCm Megatron-LM also defines the training parameters, like batch size, -tensor-parallelism, precision, as so on, in the training scripts. For example, -Llama3 8B model parameters are defined in -`examples/llama/train_llama3.sh `__ -as shown below: - -.. code-block:: bash - - TP="${TP:-8}" - PP="${PP:-1}" - CP="${CP:-1}" - MBS="${MBS:-1}" - BS="${BS:-8}" - -Primus defines the training parameters in top-level YAML files -- see -`examples/megatron/configs/ -`__. -For example, the `llama3.1_8B-pretrain.yaml -`__ -configuration imports the ``llama3.1_8B.yaml`` model architecture file. Users can then override -the default training parameters in ``llama3.1_8B-pretrain.yaml``. - -.. code-block:: yaml - - # model to run - model: llama3.1_8B.yaml # Model architecture yaml - overrides: - # log - # disable_wandb: false - # disable_tensorboard: false - stderr_sink_level: DEBUG - - log_avg_skip_iterations: 2 - log_avg_reset_interval: 50 - - train_iters: 50 - micro_batch_size: 2 - global_batch_size: 128 - - seq_length: 8192 - max_position_embeddings: 8192 - - lr: 1.0e-5 - min_lr: 0.0 - lr_warmup_iters: 2 - lr_decay_iters: null - lr_decay_style: cosine - weight_decay: 0.1 - adam_beta1: 0.9 - adam_beta2: 0.95 - eod_mask_loss: true - init_method_std: 0.008 - norm_epsilon: 1.0e-6 - -Backward compatibility with Megatron-LM -======================================= - -The Dockerized environment used for Primus maintains compatibility with Megatron-LM with -limited support. To roll back to using Megatron-LM, follow these steps. - -.. code-block:: shell - - cd /workspace/Megatron-LM/ - pip uninstall megatron-core - pip install -e . - -Once Megatron-LM is installed, follow :doc:`the documentation <../megatron-lm>` to run workloads as -usual. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst deleted file mode 100644 index c18b1dfea..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst +++ /dev/null @@ -1,516 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using ROCm Megatron-LM - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -************************************** -Training a model with ROCm Megatron-LM -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - -.. _amd-megatron-lm: - -The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to -enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X -accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI -workloads. It is purpose-built to :ref:`support models ` -like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater -efficiency. See the GitHub repository at ``__. - -For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential -components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the -following software to accelerate training workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.1 | -+--------------------------+--------------------------------+ -| PyTorch | 2.4.0 | -+--------------------------+--------------------------------+ -| PyTorch Lightning | 2.4.0 | -+--------------------------+--------------------------------+ -| Megatron Core | 0.9.0 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.5.0 | -+--------------------------+--------------------------------+ -| Flash Attention | v2.6 | -+--------------------------+--------------------------------+ -| Transformers | 4.44.0 | -+--------------------------+--------------------------------+ - -Supported features and models -============================= - -Megatron-LM provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- APEX - -- GEMM tuning - -- Torch.compile - -- 3D parallelism: TP + SP + CP - -- Distributed optimizer - -- Flash Attention (FA) 2 - -- Fused kernels - -- Pre-training - -.. _amd-megatron-lm-model-support-24-12: - -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. - -* Llama 2 7B - -* Llama 2 70B - -* Llama 3 8B - -* Llama 3 70B - -* Llama 3.1 8B - -* Llama 3.1 70B - -Prerequisite system validation steps -==================================== - -Complete the following system validation and optimization steps to set up your system before starting training. - -Disable NUMA auto-balancing ---------------------------- - -Generally, application performance can benefit from disabling NUMA auto-balancing. However, -it might be detrimental to performance with certain types of workloads. - -Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform -Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or -the output is ``1``, run the following command to disable NUMA auto-balancing. - -.. code-block:: shell - - sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - -See :ref:`System validation and optimization ` -for more information. - -Hardware verification with ROCm -------------------------------- - -Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz -instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable -GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature. -You can restore this setting to its default value with the ``rocm-smi -r`` command. - -Run the command: - -.. code-block:: shell - - rocm-smi --setperfdeterminism 1900 - -See `Hardware verification with ROCm `_ for more information. - -RCCL Bandwidth Test -------------------- - -ROCm Collective Communications Library (RCCL) is a standalone library of standard collective communication -routines for GPUs. See the :doc:`RCCL documentation ` for more information. Before starting -pre-training, running a RCCL bandwidth test helps ensure that the multi-GPU or multi-node setup is optimized -for efficient distributed training. - -Running the RCCL bandwidth test helps verify that: - -- The GPUs can communicate across nodes or within a single node. - -- The interconnect (such as InfiniBand, Ethernet, or Infinite fabric) is functioning as expected and - provides adequate bandwidth for communication. - -- No hardware setup or cabling issues could affect the communication between GPUs - -Tuning and optimizing hyperparameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In distributed training, specific hyperparameters related to distributed communication can be tuned based on -the results of the RCCL bandwidth test. These variables are already set in the Docker image: - -.. code-block:: shell - - # force all RCCL streams to be high priority - export TORCH_NCCL_HIGH_PRIORITY=1 - - # specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - - # define the Global ID index used in RoCE mode - export NCCL_IB_GID_INDEX=3 - - # avoid data corruption/mismatch issue that existed in past releases - export RCCL_MSCCL_ENABLE=0 - -Running the RCCL Bandwidth Test -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It's recommended you run the RCCL bandwidth test before launching training. It ensures system -performance is sufficient to launch training. RCCL is not included in the AMD Megatron-LM Docker -image; follow the instructions in ``__ to get started. -See :ref:`mi300x-rccl` for more information. - -Run on 8 GPUs (``-g 8``), scanning from 8 bytes to 10 GB: - -.. code-block:: shell - - ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 8 - -.. image:: /data/how-to/rocm-for-ai/rccl-tests-8-gpu.png - :width: 800 - -Using one MPI process per GPU and ``-g 1`` for performance-oriented runs on both single-node and multi-node is -recommended. So, a run on 8 GPUs looks something like: - -.. code-block:: shell - - mpirun -np 8 --bind-to numa ./build/all_reduce_perf -b 8 -e 10G -f 2 -g 1 - -.. image:: /data/how-to/rocm-for-ai/rccl-tests-1-mpi-process-per-gpu.png - :width: 800 - -Running with one MPI process per GPU ensures a one-to-one mapping for CPUs and GPUs, which can be beneficial -for smaller message sizes. This better represents the real-world use of RCCL in deep learning frameworks like -PyTorch and TensorFlow. - -Use the following script to run the RCCL test for four MI300X GPU nodes. Modify paths and node addresses as needed. - -.. code-block:: - - /home/$USER/ompi_for_gpu/ompi/bin/mpirun -np 32 -H tw022:8,tw024:8,tw010:8, tw015:8 \ - --mca pml ucx \ - --mca btl ^openib \ - -x NCCL_SOCKET_IFNAME=ens50f0np0 \ - -x NCCL_IB_HCA=rdma0:1,rdma1:1,rdma2:1,rdma3:1,rdma4:1,rdma5:1,rdma6:1,rdma7:1 \ - -x NCCL_IB_GID_INDEX=3 \ - -x NCCL_MIN_NCHANNELS=40 \ - -x NCCL_DEBUG=version \ - $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 8g -f 2 -g 1 - -.. image:: /data/how-to/rocm-for-ai/rccl-tests-4-mi300x-gpu-nodes.png - :width: 800 - -.. _mi300x-amd-megatron-lm-training-v2412: - -Start training on MI300X accelerators -===================================== - -The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct -training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1. - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v2412: - -Download the Docker image and required packages ------------------------------------------------ - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/megatron-lm:24.12-dev - -2. Launch the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $CACHE_DIR:/root/.cache --name megatron-dev-env rocm/megatron-lm:24.12-dev /bin/bash - -3. Clone the ROCm Megatron-LM repository to a local directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/Megatron-LM - cd Megatron-LM - - .. note:: - - This release is validated with ``ROCm/Megatron-LM`` commit `bb93ccb `_. - Checking out this specific commit is recommended for a stable and reproducible environment. - - .. code-block:: shell - - git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92 - -Prepare training datasets -------------------------- - -If you already have the preprocessed data, you can skip this section. - -Use the following command to process datasets. We use GPT data as an example. You may change the merge table, use an -end-of-document token, remove sentence splitting, and use the tokenizer type. - -.. code-block:: shell - - python tools/preprocess_data.py \ - --input my-corpus.json \ - --output-prefix my-gpt2 \ - --vocab-file gpt2-vocab.json \ - --tokenizer-type GPT2BPETokenizer \ - --merge-file gpt2-merges.txt \ - --append-eod - -In this case, the automatically generated output files are named ``my-gpt2_text_document.bin`` and -``my-gpt2_text_document.idx``. - -.. image:: /data/how-to/rocm-for-ai/prep-training-datasets-my-gpt2-text-document.png - :width: 800 - -.. _amd-megatron-lm-environment-setup-v2412: - -Environment setup ------------------ - -In the ``examples/llama`` directory of Megatron-LM, if you're working with Llama 2 7B or Llama 2 70 B, use the -``train_llama2.sh`` configuration script. Likewise, if you're working with Llama 3 or Llama 3.1, then use -``train_llama3.sh`` and update the configuration script accordingly. - -Network interface -^^^^^^^^^^^^^^^^^ - -To avoid connectivity issues, ensure the correct network interface is set in your training scripts. - -1. Run the following command to find the active network interface on your system. - - .. code-block:: shell - - ip a - -2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For - example: - - .. code-block:: shell - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -Dataset options -^^^^^^^^^^^^^^^ - -You can use either mock data or real data for training. - -* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: shell - - DATA_DIR="/root/.cache/data" # Change to where your dataset is stored - - DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence - - .. code-block:: shell - - --data-path $DATA_PATH - - Ensure that the files are accessible inside the Docker container. - -* Mock data can be useful for testing and validation. If you're using mock data, replace ``--data-path $DATA_PATH`` with the ``--mock-data`` option. - - .. code-block:: shell - - --mock-data - -Tokenizer -^^^^^^^^^ - -Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama -models, this typically involves sub-word tokenization, where words are broken down into smaller units based on -a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a -fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to -handle a variety of input sequences, including unseen words or domain-specific terms. - -To train any of the Llama 2 models that this Docker image supports, use the ``Llama2Tokenizer``. - -To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``. -Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable. - -For example, if you're using the Llama 3.1 8B model: - -.. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.1-8B - -Run benchmark tests -------------------- - -.. note:: - - If you're running **multi node training**, update the following environment variables. They can - also be passed as command line arguments. - - * Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - - * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - - * Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - -* Use this command to run a performance benchmark test of any of the Llama 2 models that this Docker image supports (see :ref:`variables `). - - .. code-block:: shell - - {variables} bash examples/llama/train_llama2.sh - -* Use this command to run a performance benchmark test of any of the Llama 3 and Llama 3.1 models that this Docker image supports (see :ref:`variables `). - - .. code-block:: shell - - {variables} bash examples/llama/train_llama3.sh - -.. _amd-megatron-lm-benchmark-test-vars-v2412: - -The benchmark tests support the same set of variables: - -+--------------------------+-----------------------+-----------------------+ -| Name | Options | Description | -+==========================+=======================+=======================+ -| ``TEE_OUTPUT`` | 0 or 1 | 0: disable training | -| | | log | -| | | | -| | | 1: enable training | -| | | log | -+--------------------------+-----------------------+-----------------------+ -| ``MBS`` | | Micro batch size | -+--------------------------+-----------------------+-----------------------+ -| ``BS`` | | Batch size | -+--------------------------+-----------------------+-----------------------+ -| ``TP`` | 1, 2, 4, 8 | Tensor parallel | -+--------------------------+-----------------------+-----------------------+ -| ``TE_FP8`` | 0 or 1 | Datatype. | -| | | If it is set to 1, | -| | | FP8. | -| | | | -| | | If it is set to 0. | -| | | BP16 | -+--------------------------+-----------------------+-----------------------+ -| ``NO_TORCH_COMPILE`` | 0 or 1 | If it is set to 1, | -| | | enable torch.compile. | -| | | | -| | | If it is set to 0. | -| | | Disable torch.compile | -| | | (default) | -+--------------------------+-----------------------+-----------------------+ -| ``SEQ_LENGTH`` | | Input sequence length | -+--------------------------+-----------------------+-----------------------+ -| ``GEMM_TUNING`` | 0 or 1 | If it is set to 1, | -| | | enable gemm tuning. | -| | | | -| | | If it is set to 0, | -| | | disable gemm tuning | -+--------------------------+-----------------------+-----------------------+ -| ``USE_FLASH_ATTN`` | 0 or 1 | 0: disable flash | -| | | attention | -| | | | -| | | 1: enable flash | -| | | attention | -+--------------------------+-----------------------+-----------------------+ -| ``ENABLE_PROFILING`` | 0 or 1 | 0: disable torch | -| | | profiling | -| | | | -| | | 1: enable torch | -| | | profiling | -+--------------------------+-----------------------+-----------------------+ -| ``MODEL_SIZE`` | | The size of the mode: | -| | | 7B/70B, etc. | -+--------------------------+-----------------------+-----------------------+ -| ``TOTAL_ITERS`` | | Total number of | -| | | iterations | -+--------------------------+-----------------------+-----------------------+ -| ``transformer-impl`` | transformer_engine or | Enable transformer | -| | local | engine by default | -+--------------------------+-----------------------+-----------------------+ - -Benchmarking examples -^^^^^^^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Single node training - :sync: single - - Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP, - datatype, and so on. - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - See the sample output: - - .. image:: /data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png - :width: 800 - - .. tab-item:: Multi node training - :sync: multi - - Launch the Docker container on each node. - - In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and - so on. - - On the master node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - On the worker node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - Sample output for 2-node training: - - Master node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-master.png - :width: 800 - - Worker node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-worker.png - :width: 800 - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst deleted file mode 100644 index e039aff8a..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst +++ /dev/null @@ -1,536 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM for ROCm -****************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - -The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, -designed to enable efficient training of large-scale language models on AMD -GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers -enhanced scalability, performance, and resource utilization for AI workloads. -It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and -DeepSeek, enabling developers to train next-generation AI models more -efficiently. See the GitHub repository at ``__. - -AMD provides a ready-to-use Docker image for MI300X accelerators containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.0 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git637433 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.11 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | git258a2162 | -+--------------------------+--------------------------------+ -| Triton | 3.1 | -+--------------------------+--------------------------------+ - -Supported features and models -============================= - -Megatron-LM provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- APEX - -- GEMM tuning - -- Torch.compile - -- 3D parallelism: TP + SP + CP - -- Distributed optimizer - -- Flash Attention (FA) 3 - -- Fused kernels - -- Pre-training - -.. _amd-megatron-lm-model-support-25-3: - -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. - -* Llama 2 7B - -* Llama 2 70B - -* Llama 3 8B - -* Llama 3 70B - -* Llama 3.1 8B - -* Llama 3.1 70B - -* DeepSeek-V2-Lite - -.. note:: - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -System validation -================= - -If you have already validated your system settings, skip this step. Otherwise, -complete the :ref:`system validation and optimization steps ` -to set up your system before starting training. - -Disable NUMA auto-balancing ---------------------------- - -Generally, application performance can benefit from disabling NUMA auto-balancing. However, -it might be detrimental to performance with certain types of workloads. - -Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform -Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or -the output is ``1``, run the following command to disable NUMA auto-balancing. - -.. code-block:: shell - - sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - -See :ref:`System validation and optimization ` -for more information. - -.. _mi300x-amd-megatron-lm-training-v253: - -Environment setup -================= - -The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct -training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2. - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v253: - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/megatron-lm:v25.3 - -2. Launch the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name megatron_training_env rocm/megatron-lm:v25.3 - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -The Docker container includes a pre-installed, verified version of Megatron-LM from the `release branch `_. - -.. _amd-megatron-lm-environment-setup-v253: - -Configuration scripts ---------------------- - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - If you're working with Llama 2 7B or Llama 2 70 B, use the ``train_llama2.sh`` configuration - script in the ``examples/llama`` directory of - ``__. - Likewise, if you're working with Llama 3 or Llama 3.1, then use ``train_llama3.sh`` and update - the configuration script accordingly. - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - Use the ``train_deepseek_v2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ - and update the configuration script accordingly. - -Network interface -^^^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - To avoid connectivity issues in multi-node deployments, ensure the correct network interface - is set in your training scripts. - - 1. Run the following command (outside the container) to find the active network interface on your system. - - .. code-block:: shell - - ip a - - 2. Update the ``NCCL_SOCKET_IFNAME`` and ``GLOO_SOCKET_IFNAME`` variables with your system’s network interface. For - example: - - .. code-block:: shell - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -Dataset options -^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - You can use either mock data or real data for training. - - * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - - * If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH=${DATA_PATH:-"/data/bookcorpus_text_sentence"} # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx - - You can use either mock data or real data for training. - - * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - - * If you're using a real dataset, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_DIR="/root/data/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Tokenizer -^^^^^^^^^ - -Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama -models, this typically involves sub-word tokenization, where words are broken down into smaller units based on -a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a -fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to -handle a variety of input sequences, including unseen words or domain-specific terms. - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - To train any of the Llama 2 models that :ref:`this Docker image supports `, use the ``Llama2Tokenizer``. - - To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``. - Set the Hugging Face model link in the ``TOKENIZER_MODEL`` variable. - - For example, if you're using the Llama 3.1 8B model: - - .. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.1-8B - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - To train any of the DeepSeek V2 models that :ref:`this Docker image supports `, use the ``DeepSeekV2Tokenizer``. - -Multi-node training -^^^^^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - If you're running multi-node training, update the following environment variables. They can - also be passed as command line arguments. - - * Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - - * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - - * Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - - * Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an - NFS directory) for multi-node runs: - - .. code-block:: shell - - DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs - - * For multi-node runs, make sure the correct network drivers are installed on the nodes. If - inside a Docker, either install the drivers inside the Docker container or pass the network - drivers from the host while creating the Docker container. - -Start training on AMD Instinct accelerators -=========================================== - -The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama 3.1 and Llama 2. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. - -Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD Megatron-LM Docker image. - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - .. tab-set:: - - .. tab-item:: Single node training - :sync: single-node - - To run training on a single node, navigate to the Megatron-LM folder and use the - following command: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 bash examples/llama/train_llama3.sh - - .. tab-item:: Multi-node training - :sync: multi-node - - To run training on multiple nodes, launch the Docker container on each node. For example, for a two node setup (``NODE0`` as the master node), use these commands. - - * On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh - - * On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh - - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - To run the training on a single node, go to ``/Megatron-LM`` folder and use the following command: - - .. code-block:: shell - - cd /workspace/Megatron-LM - GEMM_TUNING=1 PR=bf16 MBS=4 AC=none bash examples/deepseek_v2/train_deepseekv2.sh - -Key options ------------ - -.. _amd-megatron-lm-benchmark-test-vars-v253: - -The benchmark tests support the following sets of variables: - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - ``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - - ``TE_FP8`` - ``0`` for BP16 (default) or ``1`` for FP8 GEMMs. - - ``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - - ``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - - ``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - - ``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - - ``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2. - - ``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - - ``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data provided by you. - - ``MBS`` - Micro batch size. - - ``BS`` - Global batch size. - - ``TP`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). - - ``SEQ_LENGTH`` - Input sequence length. - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - ``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - - ``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - - ``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - - ``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data provided by you. - - ``MBS`` - Micro batch size. - - ``GBS`` - Global batch size. - -Benchmarking examples ---------------------- - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - .. tab-set:: - - .. tab-item:: Single node training - :sync: single-node - - Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP, - datatype, and so on. - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - See the sample output: - - .. image:: /data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png - :width: 800 - - .. tab-item:: Multi-node training - :sync: multi-node - - Launch the Docker container on each node. - - In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and - so on. - - On the master node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - On the worker node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - Sample output for 2-node training: - - Master node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-master.png - :width: 800 - - Worker node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-worker.png - :width: 800 - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst deleted file mode 100644 index 9d7c7ecd6..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst +++ /dev/null @@ -1,618 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM for ROCm -****************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - -The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, -designed to enable efficient training of large-scale language models on AMD -GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers -enhanced scalability, performance, and resource utilization for AI workloads. -It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and -DeepSeek, enabling developers to train next-generation AI models more -efficiently. See the GitHub repository at ``__. - -AMD provides a ready-to-use Docker image for MI300X series accelerators containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.0 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git637433 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.11 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | git258a2162 | -+--------------------------+--------------------------------+ -| Triton | 3.1 | -+--------------------------+--------------------------------+ - -Supported features and models -============================= - -Megatron-LM provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- APEX - -- GEMM tuning - -- Torch.compile - -- 3D parallelism: TP + SP + CP - -- Distributed optimizer - -- Flash Attention (FA) 3 - -- Fused kernels - -- Pre-training - -.. _amd-megatron-lm-model-support-25-4: - -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. - -* Llama 3.1 8B - -* Llama 3.1 70B - -* Llama 3 8B - -* Llama 3 70B - -* Llama 2 7B - -* Llama 2 70B - -* DeepSeek-V2-Lite - -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -.. _amd-megatron-lm-performance-measurements-v254: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `__ -page provides reference throughput and latency measurements for training -popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `__ - only reflects the :doc:`latest version of this training benchmarking environment <../megatron-lm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system -before starting training. - -.. _mi300x-amd-megatron-lm-training-v254: - -Environment setup -================= - -The prebuilt ROCm Megatron-LM environment allows users to quickly validate system performance, conduct -training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2. - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v254: - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/megatron-lm:v25.4 - -2. Launch the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name megatron_training_env rocm/megatron-lm:v25.4 - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -The Docker container includes a pre-installed, verified version of the ROCm Megatron-LM development branch ``__ -(commit `fd6f01 `_). - -.. _amd-megatron-lm-environment-setup-v254: - -Configuration scripts ---------------------- - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - If you're working with Llama 2 7B or Llama 2 70 B, use the ``train_llama2.sh`` configuration - script in the ``examples/llama`` directory of - ``__. - Likewise, if you're working with Llama 3 or Llama 3.1, use ``train_llama3.sh`` and update - the configuration script accordingly. - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - Use the ``train_deepseek_v2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ - and update the configuration script accordingly. - -Network interface -^^^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - Update the network interface in the script to match your system's network interface. To - find your network interface, run the following (outside of any Docker container): - - .. code-block:: bash - - ip a - - Look for an active interface that has an IP address in the same subnet as - your other nodes. Then, update the following variables in the script, for - example: - - .. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -Dataset options -^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - You can use either mock data or real data for training. - - * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - - * If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - - To download the dataset, set the ``DATASET`` variable to the dataset you'd like to use. Two datasets are supported: ``DATASET=wiki`` and ``DATASET=bookcorpus``. - Use the following command to download the dataset. - - .. code-block:: shell - - DATASET=wiki bash examples/llama/prepare_dataset.sh # For wiki-en dataset - DATASET=bookcorpus bash examples/llama/prepare_dataset.sh # For bookcorpus dataset - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx - - You can use either mock data or real data for training. - - * Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - - * If you're using a real dataset, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_DIR="/root/data/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Tokenizer -^^^^^^^^^ - -Tokenization is the process of converting raw text into tokens that can be processed by the model. For Llama -models, this typically involves sub-word tokenization, where words are broken down into smaller units based on -a fixed vocabulary. The tokenizer is trained along with the model on a large corpus of text, and it learns a -fixed vocabulary that can represent a wide range of text from different domains. This allows Llama models to -handle a variety of input sequences, including unseen words or domain-specific terms. - -You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples. -If the tokenizer is not found, it'll be downloaded to the default tokenizer model path: ``${DATA_DIR}/tokenizer_llama3`` -or ``${DATA_DIR}/tokenizer_llama2``. - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - To train any of the Llama 2 models that :ref:`this Docker image supports `, use the ``Llama2Tokenizer`` - or the default ``HuggingFaceTokenizer``. - - To train any of Llama 3 and Llama 3.1 models that this Docker image supports, use the ``HuggingFaceTokenizer``. - Set the Hugging Face model path in the ``TOKENIZER_MODEL`` variable. - - For example, if you're using the Llama 3.1 8B model: - - .. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.1-8B - - .. note:: - - If you don't already have the Llama 3.1 tokenizer locally, set your - personal Hugging Face access token ``HF_TOKEN`` to download the - tokenizer. If you encounter the following error, set ``HF_TOKEN`` to - your access-authorized Hugging Face token. - - .. code-block:: shell - - OSError: You are trying to access a gated repo. - - # pass your HF_TOKEN - export HF_TOKEN=$your_personal_hf_token - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - To train any of the DeepSeek V2 models that :ref:`this Docker image supports `, use the ``DeepSeekV2Tokenizer``. - -Multi-node training -^^^^^^^^^^^^^^^^^^^ - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - If you're running multi-node training, update the following environment variables. They can - also be passed as command line arguments. - - * Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - - * Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - - * Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - - * Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an - NFS directory) for multi-node runs: - - .. code-block:: shell - - DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs - - * For multi-node runs, make sure the correct network drivers are installed on the nodes. If - inside a Docker container, either install the drivers inside the Docker container or pass the network - drivers from the host while creating the Docker container. - - .. code-block:: shell - - # Specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - -Start training on AMD Instinct accelerators -=========================================== - -The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama 3.1 and Llama 2. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. - -Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD Megatron-LM Docker image. - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - .. tab-set:: - - .. tab-item:: Single node training - :sync: single-node - - To run training on a single node, navigate to the Megatron-LM folder and use one of the - following commands. - - - For Llama 3.1 8B FP8: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - - For Llama 3.1 8B BF16: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=0 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - - For Llama 2 7B FP8: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - - - For Llama 2 7B BF16: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=0 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - - To run training with FSDP2 enabled, add the ``FSDP=1`` argument. For example: - - - For Llama 3 70B BF16: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=3 BS=24 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=8192 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - - For Llama 2 70B BF16: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=3 BS=56 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=4096 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - - .. note:: - - It's suggested to use ``TP=1`` when FSDP is enabled for higher throughput. FSDP2 is not supported with pipeline parallelism, - expert parallelism, MCore's distributed optimizer, gradient accumulation fusion, and ``FP16`` precision. - - .. tab-item:: Multi-node training - :sync: multi-node - - To run training on multiple nodes, launch the Docker container on each node. For example, for a two node setup (``NODE0`` as the master node), use these commands. - - * On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh - - * On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh - - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - To run the training on a single node, go to ``/Megatron-LM`` folder and use the following command: - - .. code-block:: shell - - cd /workspace/Megatron-LM - GEMM_TUNING=1 PR=bf16 MBS=4 AC=none SEQ_LEN=4096 PAD_LEN=4096 TRAIN_ITERS=50 bash examples/deepseek_v2/train_deepseekv2.sh - -Key options ------------ - -.. _amd-megatron-lm-benchmark-test-vars-v254: - -The benchmark tests support the following sets of variables: - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - ``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - - ``TE_FP8`` - ``0`` for B16 or ``1`` for FP8 -- ``0`` by default. - - ``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - - ``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - - ``FSDP`` - ``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``, - ``--overlap-param-gather``, and ``--sequence-parallel`` are automaticallyu disabled. - - ``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - - ``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - - ``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2. - - ``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - - ``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - - ``MBS`` - Micro batch size. - - ``BS`` - Global batch size. - - ``TP`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on. - - ``SEQ_LENGTH`` - Input sequence length. - - .. tab-item:: DeepSeek V2 - :sync: deepseek - - ``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - - ``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - - ``TRAIN_ITERS`` - The total number of iterations. - - ``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - - ``MBS`` - Micro batch size. - - ``GBS`` - Global batch size. - - ``SEQ_LEN`` - Input sequence length. - - ``AC`` - Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default. - -Benchmarking examples ---------------------- - -.. tab-set:: - - .. tab-item:: Llama - :sync: llama - - .. tab-set:: - - .. tab-item:: Single node training - :sync: single-node - - Use this command to run training with Llama 2 7B model on a single node. You can specify MBS, BS, FP, - datatype, and so on. - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - See the sample output: - - .. image:: /data/how-to/rocm-for-ai/llama2-7b-training-log-sample.png - :width: 800 - - .. tab-item:: Multi-node training - :sync: multi-node - - Launch the Docker container on each node. - - In this example, run training with Llama 2 7B model on 2 nodes with specific MBS, BS, FP, datatype, and - so on. - - On the master node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - On the worker node: - - .. code-block:: bash - - TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 - SEQ_LENGTH=4096 bash examples/llama/train_llama2.sh - - You can find the training logs at the location defined in ``$TRAIN_LOG`` in the :ref:`configuration script `. - - Sample output for 2-node training: - - Master node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-master.png - :width: 800 - - Worker node: - - .. image:: /data/how-to/rocm-for-ai/2-node-training-worker.png - :width: 800 - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst deleted file mode 100644 index f2a81b7c6..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst +++ /dev/null @@ -1,775 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM for ROCm -****************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - -The `Megatron-LM framework for ROCm `_ is -a specialized fork of the robust Megatron-LM, designed to enable efficient -training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced -scalability, performance, and resource utilization for AI workloads. It is -purpose-built to support models like Llama, DeepSeek, and Mixtral, -enabling developers to train next-generation AI models more -efficiently. - -AMD provides a ready-to-use Docker image for MI300X series accelerators containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.4 | -+--------------------------+--------------------------------+ -| PyTorch | 2.8.0a0+gite2f9759 | -+--------------------------+--------------------------------+ -| Python | 3.12 or 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.13.0+bb061ade | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | 0.13.0-4f18bf6 | -+--------------------------+--------------------------------+ -| Triton | 3.3.0 | -+--------------------------+--------------------------------+ -| RCCL | 2.22.3 | -+--------------------------+--------------------------------+ - -Megatron-LM provides the following key features to train large language models efficiently: - -- Transformer Engine (TE) - -- APEX - -- GEMM tuning - -- Torch.compile - -- 3D parallelism: TP + SP + CP - -- Distributed optimizer - -- Flash Attention (FA) 3 - -- Fused kernels - -- Pre-training - -.. _amd-megatron-lm-model-support-v255: - -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.5-benchmark-models.yaml - - Supported models - ================ - - The following models are supported for training performance benchmarking with Megatron-LM and ROCm. - Some instructions, commands, and training recommendations in this documentation might - vary by model -- select one to get started. - - {% set model_groups = data["megatron-lm_benchmark"].model_groups %} - - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -.. _amd-megatron-lm-performance-measurements-v255: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `__ -page provides reference throughput and latency measurements for training -popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `__ - only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-megatron-lm-training-v255: - -Environment setup -================= - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v255: - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. tab-set:: - - .. tab-item:: Ubuntu 24.04 + Python 3.12 - :sync: py312 - - .. code-block:: shell - - docker pull rocm/megatron-lm:v25.5_py312 - - .. tab-item:: Ubuntu 22.04 + Python 3.10 - :sync: py310 - - .. code-block:: shell - - docker pull rocm/megatron-lm:v25.5_py310 - -2. Launch the Docker container. - - .. tab-set:: - - .. tab-item:: Ubuntu 24.04 + Python 3.12 - :sync: py312 - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 128G --name megatron_training_env rocm/megatron-lm:v25.5_py312 - - - .. tab-item:: Ubuntu 22.04 + Python 3.10 - :sync: py310 - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 128G --name megatron_training_env rocm/megatron-lm:v25.5_py310 - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -The Docker container includes a pre-installed, verified version of the ROCm -Megatron-LM development branch -``__, including necessary -training scripts. - -.. _amd-megatron-lm-environment-setup-v255: - -Configuration -============= - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b - - Update the ``train_llama3.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - Update the ``train_llama2.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Network interface ------------------ - -Update the network interface in the script to match your system's network interface. To -find your network interface, run the following (outside of any Docker container): - -.. code-block:: bash - - ip a - -Look for an active interface that has an IP address in the same subnet as -your other nodes. Then, update the following variables in the script, for -example: - -.. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -.. _amd-megatron-lm-tokenizer-v255: - -Tokenizer ---------- - -You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples. -If the tokenizer is not found, it'll be downloaded if publicly available. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - If you do not have Llama 3.3 tokenizer locally, you need to use your - personal Hugging Face access token ``HF_TOKEN`` to download the tokenizer. - See `Llama-3.3-70B-Instruct - `_. After you are - authorized, use your ``HF_TOKEN`` to download the tokenizer and set the - variable ``TOKENIZER_MODEL`` to the tokenizer path. - - .. code-block:: shell - - export HF_TOKEN= - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.3-70B-Instruct" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-8B" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-70B" - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - The training script uses either the ``Llama2Tokenizer`` or ``HuggingFaceTokenizer`` by default. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3" - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V2-Lite" - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Download the Mixtral tokenizer. - - .. code-block:: shell - - mkdir tokenizer - cd tokenizer - export HF_TOKEN= - wget --header="Authorization: Bearer $HF_TOKEN" -O ./tokenizer.model https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/tokenizer.model - - Use the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL=tokenizer/tokenizer.model - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - -* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Download the dataset -^^^^^^^^^^^^^^^^^^^^ - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - For Llama models, use the `prepare_dataset.sh - `_ script - to prepare your dataset. - To download the dataset, set the ``DATASET`` variable to the dataset you'd - like to use. Three datasets are supported: ``DATASET=wiki``, ``DATASET=fineweb``, and - ``DATASET=bookcorpus``. - - .. code-block:: shell - - DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset - DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset - - ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer. - Remember to either pre-download the tokenizer or setup Hugging Face access - otherwise when needed -- see the :ref:`Tokenizer ` section. - - .. note:: - - When training set ``DATA_PATH`` to the specific file name prefix pointing to the ``.bin`` or ``.idx`` - as in the following example: - - .. code-block:: shell - - DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/mmap_deepseekv2_datasets_text_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir mixtral-datasets - cd mixtral-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/mixtral-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Multi-node configuration ------------------------- - -If you're running multi-node training, update the following environment variables. They can -also be passed as command line arguments. Refer to the following example configurations. - -* Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - -* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - -* Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - -* Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an - NFS directory) for multi-node runs: - - .. code-block:: shell - - DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs - -* For multi-node runs, make sure the correct network drivers are installed on the nodes. If - inside a Docker container, either install the drivers inside the Docker container or pass the network - drivers from the host while creating the Docker container. - - .. code-block:: shell - - # Specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - -Getting started -=============== - -The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama, DeepSeek, and Mixtral. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. - -.. _amd-megatron-lm-run-training-v255: - -Run training ------------- - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. - -Single node training -^^^^^^^^^^^^^^^^^^^^ - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - To run the training on a single node for Llama 3.3 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 RECOMPUTE=1 SEQ_LENGTH=8192 MBS=2 BS=16 TE_FP8=0 TP=1 PP=1 FSDP=1 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - - Currently, FSDP is only compatible with BF16 precision. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=128 TP=1 TE_FP8=0 SEQ_LENGTH=8192 MODEL_SIZE=8 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - To run the training on a single node for Llama 3.1 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=3 BS=24 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=8192 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - - Currently, FSDP is only compatible with BF16 precision. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b - - To run training on a single node for Llama 2 7B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - - For Llama 2 7B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=4 BS=256 TP=1 TE_FP8=0 SEQ_LENGTH=4096 MODEL_SIZE=7 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-2-70b - - To run the training on a single node for Llama 2 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=7 BS=56 TP=1 TE_FP8=0 FSDP=1 RECOMPUTE=1 SEQ_LENGTH=4096 MODEL_SIZE=70 TOTAL_ITERS=50 bash examples/llama/train_llama2.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - - Currently, FSDP is only compatible with BF16 precision. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - FORCE_BANLANCE=true \ - RUN_ENV=cluster \ - MODEL_SIZE=671B \ - TRAIN_ITERS=50 \ - SEQ_LEN=4096 \ - NUM_LAYERS=3 \ - MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=32 \ - PR=bf16 \ - TP=1 PP=1 ETP=1 EP=8 \ - GEMM_TUNING=1 \ - NVTE_CK_USES_BWD_V3=1 \ - USE_GROUPED_GEMM=true MOE_USE_LEGACY_GROUPED_GEMM=true \ - GPT_LAYER_IN_TE=true \ - bash examples/deepseek_v3/train_deepseekv3.sh - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - GEMM_TUNING=1 PR=bf16 MBS=4 AC=none SEQ_LEN=4096 PAD_LEN=4096 TRAIN_ITERS=50 bash examples/deepseek_v2/train_deepseekv2.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - RECOMPUTE_NUM_LAYERS=0 TEE_OUTPUT=1 MBS=1 GBS=16 TP_SIZE=1 PP_SIZE=1 AC=none PR=bf16 EP_SIZE=8 ETP_SIZE=1 SEQLEN=4096 FORCE_BALANCE=true MOCK_DATA=1 RUN_ENV=cluster MODEL_SIZE=8x7B TRAIN_ITERS=50 bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x22b-proxy - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - RECOMPUTE_NUM_LAYERS=4 TEE_OUTPUT=1 MBS=1 GBS=16 TP_SIZE=1 PP_SIZE=1 AC=full NUM_LAYERS=4 PR=bf16 EP_SIZE=8 ETP_SIZE=1 SEQLEN=8192 FORCE_BALANCE=true MOCK_DATA=1 RUN_ENV=cluster MODEL_SIZE=8x22B TRAIN_ITERS=50 bash examples/mixtral/train_mixtral_moe.sh - -Multi-node training -^^^^^^^^^^^^^^^^^^^ - -To run training on multiple nodes, launch the Docker container on each node. -For example, for Llama 3 using a two node setup (``NODE0`` as the master node), -use these commands. - -* On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=0 bash examples/llama/train_llama3.sh - -* On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 MBS=2 BS=256 TP=1 TE_FP8=1 SEQ_LENGTH=8192 MODEL_SIZE=8 MASTER_ADDR=IP_NODE0 NNODES=2 NODE_RANK=1 bash examples/llama/train_llama3.sh - -Or, for DeepSeek-V3, an example script ``train_deepseek_v3_slurm.sh`` is -provided in -``__ to -enable training at scale under a SLURM environment. For example, to run -training on 16 nodes, try the following command: - -.. code-block:: shell - - sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh - -.. _amd-megatron-lm-benchmark-test-vars-v255: - -Key options ------------ - -The benchmark tests support the following sets of variables. - -``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - -``TE_FP8`` - ``0`` for B16 or ``1`` for FP8 -- ``0`` by default. - -``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - -``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - -``FSDP`` - ``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``, - ``--overlap-param-gather``, and ``--sequence-parallel`` are automatically disabled. - -``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - -``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - -``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2, for example. - -``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - -``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - -``MBS`` - Micro batch size. - -``BS`` - Global batch size. - -``TP`` / ``TP_SIZE`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on. - -``EP`` / ``EP_SIZE`` - Expert parallel for MoE models. - -``SEQ_LENGTH`` - Input sequence length. - -``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - -``AC`` - Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default. - -``NUM_LAYERS`` - Use reduced number of layers as a proxy model. - -``RECOMPUTE_NUM_LAYERS`` - Number of layers used for checkpointing recompute. - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst deleted file mode 100644 index 32d72311b..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst +++ /dev/null @@ -1,1041 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM for ROCm -****************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - -The `Megatron-LM framework for ROCm `__ is -a specialized fork of the robust Megatron-LM, designed to enable efficient -training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced -scalability, performance, and resource utilization for AI workloads. It is -purpose-built to support models like Llama, DeepSeek, and Mixtral, -enabling developers to train next-generation AI models more -efficiently. - -AMD provides ready-to-use Docker images for MI300X series accelerators containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.6-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in data.dockers %} - .. tab-item:: ``{{ docker.pull_tag }}`` - :sync: {{ docker.pull_tag }} - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - - {% endfor %} - {% endfor %} - {% elif dockers|length == 1 %} - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components %} - * - {{ component_name }} - - {{ component_version }} - - {% endfor %} - {% endif %} - - .. _amd-megatron-lm-model-support-v256: - - The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. - - Supported models - ================ - - The following models are supported for training performance benchmarking with Megatron-LM and ROCm. - Some instructions, commands, and training recommendations in this documentation might - vary by model -- select one to get started. - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -.. _amd-megatron-lm-performance-measurements-v256: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `__ -page provides reference throughput and latency measurements for training -popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `__ - only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-megatron-lm-training-v256: - -Environment setup -================= - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v256: - -Download the Docker image -------------------------- - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.6-benchmark-models.yaml - - {% set dockers = data.dockers %} - 1. Use the following command to pull the Docker image from Docker Hub. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in data.dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endif %} - 2. Launch the Docker container. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in data.dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endif %} - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -The Docker container includes a pre-installed, verified version of the ROCm -Megatron-LM development branch -``__, including necessary -training scripts. - -.. _amd-megatron-lm-environment-setup-v256: - -Configuration -============= - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b - - Update the ``train_llama3.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - Update the ``train_llama2.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Network interface ------------------ - -Update the network interface in the script to match your system's network interface. To -find your network interface, run the following (outside of any Docker container): - -.. code-block:: bash - - ip a - -Look for an active interface that has an IP address in the same subnet as -your other nodes. Then, update the following variables in the script, for -example: - -.. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -.. _amd-megatron-lm-tokenizer-v256: - -Tokenizer ---------- - -You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples. -If the tokenizer is not found, it'll be downloaded if publicly available. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - If you do not have Llama 3.3 tokenizer locally, you need to use your - personal Hugging Face access token ``HF_TOKEN`` to download the tokenizer. - See `Llama-3.3-70B-Instruct - `_. After you are - authorized, use your ``HF_TOKEN`` to download the tokenizer and set the - variable ``TOKENIZER_MODEL`` to the tokenizer path. - - .. code-block:: shell - - export HF_TOKEN= - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.3-70B-Instruct" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-8B" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-70B" - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - The training script uses either the ``Llama2Tokenizer`` or ``HuggingFaceTokenizer`` by default. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3" - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V2-Lite" - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Download the Mixtral tokenizer. - - .. code-block:: shell - - mkdir tokenizer - cd tokenizer - export HF_TOKEN= - wget --header="Authorization: Bearer $HF_TOKEN" -O ./tokenizer.model https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/tokenizer.model - - Use the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL=tokenizer/tokenizer.model - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-7B" - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-72B" - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - -* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Download the dataset -^^^^^^^^^^^^^^^^^^^^ - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b pyt_megatron_lm_train_llama-3.1-70b-proxy - - For Llama models, use the `prepare_dataset.sh - `_ script - to prepare your dataset. - To download the dataset, set the ``DATASET`` variable to the dataset you'd - like to use. Three datasets are supported: ``DATASET=wiki``, ``DATASET=fineweb``, and - ``DATASET=bookcorpus``. - - .. code-block:: shell - - DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset - DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset - - ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer. - Remember to either pre-download the tokenizer or setup Hugging Face access - otherwise when needed -- see the :ref:`Tokenizer ` section. - - .. note:: - - When training set ``DATA_PATH`` to the specific file name prefix pointing to the ``.bin`` or ``.idx`` - as in the following example: - - .. code-block:: shell - - DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir mixtral-datasets - cd mixtral-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/mixtral-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b pyt_megatron_lm_train_qwen2.5-72b - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir -p temp/qwen-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/qwen-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Multi-node configuration ------------------------- - -If you're running multi-node training, update the following environment variables. They can -also be passed as command line arguments. Refer to the following example configurations. - -* Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - -* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - -* Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - -* Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an - NFS directory) for multi-node runs: - - .. code-block:: shell - - DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs - -* For multi-node runs, make sure the correct network drivers are installed on the nodes. If - inside a Docker container, either install the drivers inside the Docker container or pass the network - drivers from the host while creating the Docker container. - - .. code-block:: shell - - # Specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - -.. _amd-megatron-lm-run-training-v256: - -Run training -============ - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. - -Single node training --------------------- - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - To run the training on a single node for Llama 3.3 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.3-70B-Instruct \ - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MBS=2 \ - BS=16 \ - TE_FP8=0 \ - TP=1 \ - PP=1 \ - FSDP=1 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - To run the training on a single node for Llama 3.1 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b-proxy - - To run the training on a single node for Llama 3.1 70B with proxy, use the following command. - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - FSDP=1 \ - TOTAL_ITERS=10 \ - NUM_LAYERS=40 \ - bash examples/llama/train_llama3.sh - - .. note:: - - Use two or more nodes to run the *full* Llama 70B model with FP8 precision. - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b - - To run training on a single node for Llama 2 7B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - For Llama 2 7B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-2-70b - - To run the training on a single node for Llama 2 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=7 \ - BS=56 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - FORCE_BALANCE=true \ - RUN_ENV=cluster \ - MODEL_SIZE=671B \ - TRAIN_ITERS=50 \ - SEQ_LEN=4096 \ - NUM_LAYERS=3 \ - MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=32 \ - PR=bf16 \ - TP=1 PP=1 ETP=1 EP=8 \ - GEMM_TUNING=1 \ - NVTE_CK_USES_BWD_V3=1 \ - USE_GROUPED_GEMM=true MOE_USE_LEGACY_GROUPED_GEMM=true \ - GPT_LAYER_IN_TE=true \ - bash examples/deepseek_v3/train_deepseekv3.sh - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - GEMM_TUNING=1 \ - PR=bf16 \ - MBS=4 \ - AC=none \ - SEQ_LEN=4096 \ - PAD_LEN=4096 \ - TRAIN_ITERS=50 \ - bash examples/deepseek_v2/train_deepseekv2.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=0 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=none \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=4096 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x7B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x22b-proxy - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=4 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=full \ - NUM_LAYERS=4 \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=8192 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x22B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - To run training on a single node for Qwen 2.5 7B BF16, use the following - command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=0 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - - For FP8, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=1 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - To run the training on a single node for Qwen 2.5 72B BF16, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - FSDP=1 \ - CP=1 \ - PP=1 \ - MBS=3 \ - BS=24 \ - TE_FP8=0 \ - MODEL_SIZE=72 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-72B \ - RECOMPUTE_ACTIVATIONS=full \ - CKPT_FORMAT=torch_dist - -Multi-node training examples ----------------------------- - -To run training on multiple nodes, launch the Docker container on each node. -For example, for Llama 3 using a two node setup (``NODE0`` as the master node), -use these commands. - -* On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=0 \ - bash examples/llama/train_llama3.sh - -* On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=1 \ - bash examples/llama/train_llama3.sh - -Or, for DeepSeek-V3, an example script ``train_deepseek_v3_slurm.sh`` is -provided in -``__ to -enable training at scale under a SLURM environment. For example, to run -training on 16 nodes, try the following command: - -.. code-block:: shell - - sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh - -.. _amd-megatron-lm-benchmark-test-vars-v256: - -Key options ------------ - -The benchmark tests support the following sets of variables. - -``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - -``TE_FP8`` - ``0`` for B16 or ``1`` for FP8 -- ``0`` by default. - -``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - -``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - -``FSDP`` - ``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``, - ``--overlap-param-gather``, and ``--sequence-parallel`` are automatically disabled. - -``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - -``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - -``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2, for example. - -``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - -``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - -``MBS`` - Micro batch size. - -``BS`` - Global batch size. - -``TP`` / ``TP_SIZE`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on. - -``EP`` / ``EP_SIZE`` - Expert parallel for MoE models. - -``SEQ_LENGTH`` - Input sequence length. - -``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - -``AC`` - Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default. - -``NUM_LAYERS`` - Use reduced number of layers as a proxy model. - -``RECOMPUTE_NUM_LAYERS`` - Number of layers used for checkpointing recompute. - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst deleted file mode 100644 index b804abdf7..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst +++ /dev/null @@ -1,1044 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -****************************************** -Training a model with Megatron-LM for ROCm -****************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../megatron-lm` for the latest version. - - The ROCm Megatron-LM framework now has limited support with this Docker - environment; it now focuses on Primus with Megatron-Core. See :doc:`../primus-megatron`. - - To learn how to migrate your existing workloads to Primus with Megatron-Core, - see :doc:`megatron-lm-primus-migration-guide`. - -The `Megatron-LM framework for ROCm `_ is -a specialized fork of the robust Megatron-LM, designed to enable efficient -training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced -scalability, performance, and resource utilization for AI workloads. It is -purpose-built to support models like Llama, DeepSeek, and Mixtral, -enabling developers to train next-generation AI models more -efficiently. - -AMD provides ready-to-use Docker images for MI300X series accelerators containing -essential components, including PyTorch, ROCm libraries, and Megatron-LM -utilities. It contains the following software components to accelerate training -workloads: - -.. note:: - - This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with - Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release `. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - .. tab-set:: - - {% for docker in dockers %} - .. tab-item:: ``{{ docker.pull_tag }}`` - :sync: {{ docker.pull_tag }} - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - - {% endfor %} - {% endfor %} - - .. _amd-megatron-lm-model-support-v257: - - Supported models - ================ - - The following models are supported for training performance benchmarking with Megatron-LM and ROCm - on AMD Instinct MI300X series accelerators. - Some instructions, commands, and training recommendations in this documentation might - vary by model -- select one to get started. - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -.. _amd-megatron-lm-performance-measurements-v257: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `__ -page provides reference throughput and latency measurements for training -popular AI models. - -.. important:: - - The performance data presented in - `Performance results with AMD ROCm software `__ - only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-megatron-lm-training-v257: - -Environment setup -================= - -Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker -image. - -.. _amd-megatron-lm-requirements-v257: - -Download the Docker image -------------------------- - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - 1. Use the following command to pull the Docker image from Docker Hub. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in data.dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - {% endif %} - 2. Launch the Docker container. - - {% if dockers|length > 1 %} - .. tab-set:: - - {% for docker in dockers %} - .. tab-item:: {{ docker.doc_name }} - :sync: {{ docker.pull_tag }} - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endfor %} - {% elif dockers|length == 1 %} - {% set docker = dockers[0] %} - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 128G \ - --name megatron_training_env \ - {{ docker.pull_tag }} - - {% endif %} - -3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start megatron_training_env - docker exec -it megatron_training_env bash - -4. **Megatron-LM backward compatibility setup** -- this Docker is primarily intended for use with Primus, but it maintains Megatron-LM compatibility with limited support. - To roll back to using Megatron-LM, follow these steps: - - .. code-block:: shell - - cd /workspace/Megatron-LM/ - pip uninstall megatron-core - pip install -e . - -The Docker container hosts -``__ at verified commit ``e8e9edc``. - -.. _amd-megatron-lm-environment-setup-v257: - -Configuration -============= - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b - - Update the ``train_llama3.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - Update the ``train_llama2.sh`` configuration script in the ``examples/llama`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral`` - directory of - ``__ to configure your training run. - Options can also be passed as command line arguments as described in :ref:`Run training `. - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Network interface ------------------ - -Update the network interface in the script to match your system's network interface. To -find your network interface, run the following (outside of any Docker container): - -.. code-block:: bash - - ip a - -Look for an active interface that has an IP address in the same subnet as -your other nodes. Then, update the following variables in the script, for -example: - -.. code-block:: bash - - export NCCL_SOCKET_IFNAME=ens50f0np0 - - export GLOO_SOCKET_IFNAME=ens50f0np0 - -.. _amd-megatron-lm-tokenizer-v257: - -Tokenizer ---------- - -You can assign the path of an existing tokenizer to the ``TOKENIZER_MODEL`` as shown in the following examples. -If the tokenizer is not found, it'll be downloaded if publicly available. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - If you do not have Llama 3.3 tokenizer locally, you need to use your - personal Hugging Face access token ``HF_TOKEN`` to download the tokenizer. - See `Llama-3.3-70B-Instruct - `_. After you are - authorized, use your ``HF_TOKEN`` to download the tokenizer and set the - variable ``TOKENIZER_MODEL`` to the tokenizer path. - - .. code-block:: shell - - export HF_TOKEN= - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.3-70B-Instruct" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-8B" - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="meta-llama/Llama-3.1-70B" - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b - - The training script uses either the ``Llama2Tokenizer`` or ``HuggingFaceTokenizer`` by default. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3" - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="deepseek-ai/DeepSeek-V2-Lite" - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - Download the Mixtral tokenizer. - - .. code-block:: shell - - mkdir tokenizer - cd tokenizer - export HF_TOKEN= - wget --header="Authorization: Bearer $HF_TOKEN" -O ./tokenizer.model https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/tokenizer.model - - Use the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL=tokenizer/tokenizer.model - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-7B" - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - The training script uses the ``HuggingFaceTokenizer``. Set ``TOKENIZER_MODEL`` to the appropriate Hugging Face model path. - - .. code-block:: shell - - TOKENIZER_MODEL="Qwen/Qwen2.5-72B" - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``MOCK_DATA`` variable to toggle between mock and real data. The default - value is ``1`` for enabled. - - .. code-block:: bash - - MOCK_DATA=1 - -* If you're using a real dataset, update the ``DATA_PATH`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 - - DATA_PATH="/data/bookcorpus_text_sentence" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Download the dataset -^^^^^^^^^^^^^^^^^^^^ - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b pyt_megatron_lm_train_llama-3.1-8b pyt_megatron_lm_train_llama-3.1-70b pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b pyt_megatron_lm_train_llama-3.1-70b-proxy - - For Llama models, use the `prepare_dataset.sh - `_ script - to prepare your dataset. - To download the dataset, set the ``DATASET`` variable to the dataset you'd - like to use. Three datasets are supported: ``DATASET=wiki``, ``DATASET=fineweb``, and - ``DATASET=bookcorpus``. - - .. code-block:: shell - - DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset - DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset - - ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer. - Remember to either pre-download the tokenizer or setup Hugging Face access - otherwise when needed -- see the :ref:`Tokenizer ` section. - - .. note:: - - When training set ``DATA_PATH`` to the specific file name prefix pointing to the ``.bin`` or ``.idx`` - as in the following example: - - .. code-block:: shell - - DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - If you don't already have the dataset, download the DeepSeek dataset using the following - commands: - - .. code-block:: shell - - mkdir deepseek-datasets - cd deepseek-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/SlimPajama.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-train.json - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/deepseek-datasets/alpaca_zh-valid.json - cd .. - bash tools/run_make_pretraining_dataset_megatron.sh deepseek-datasets/SlimPajama.json DeepSeekV3Tokenizer text deepseek-datasets deepseek-ai/DeepSeek-V3 - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/deepseek-datasets" # Change to where your dataset is stored - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir mixtral-datasets - cd mixtral-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/mistral-datasets/wudao_mistralbpe_content_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/mixtral-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b pyt_megatron_lm_train_qwen2.5-72b - - If you don't already have the dataset, download the Mixtral dataset using the following - commands: - - .. code-block:: shell - - mkdir -p temp/qwen-datasets - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.bin - wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.idx - - To train on this data, update the ``DATA_DIR`` variable to point to the location of your dataset. - - .. code-block:: bash - - MOCK_DATA=0 # Train on real data - - DATA_DIR="/qwen-datasets" # Change to where your dataset is stored - - Ensure that the files are accessible inside the Docker container. - -Multi-node configuration ------------------------- - -If you're running multi-node training, update the following environment variables. They can -also be passed as command line arguments. Refer to the following example configurations. - -* Change ``localhost`` to the master node's hostname: - - .. code-block:: shell - - MASTER_ADDR="${MASTER_ADDR:-localhost}" - -* Set the number of nodes you want to train on (for instance, ``2``, ``4``, ``8``): - - .. code-block:: shell - - NNODES="${NNODES:-1}" - -* Set the rank of each node (0 for master, 1 for the first worker node, and so on): - - .. code-block:: shell - - NODE_RANK="${NODE_RANK:-0}" - -* Set ``DATA_CACHE_PATH`` to a common directory accessible by all the nodes (for example, an - NFS directory) for multi-node runs: - - .. code-block:: shell - - DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs - -* For multi-node runs, make sure the correct network drivers are installed on the nodes. If - inside a Docker container, either install the drivers inside the Docker container or pass the network - drivers from the host while creating the Docker container. - - .. code-block:: shell - - # Specify which RDMA interfaces to use for communication - export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 - -.. _amd-megatron-lm-run-training-v257: - -Run training -============ - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. - -Single node training --------------------- - -.. container:: model-doc pyt_megatron_lm_train_llama-3.3-70b - - To run the training on a single node for Llama 3.3 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - TOKENIZER_MODEL=meta-llama/Llama-3.3-70B-Instruct \ - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MBS=2 \ - BS=16 \ - TE_FP8=0 \ - TP=1 \ - PP=1 \ - FSDP=1 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-8b - - To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=128 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b - - To run the training on a single node for Llama 3.1 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama3.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b-proxy - - To run the training on a single node for Llama 3.1 70B with proxy, use the following command. - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - RECOMPUTE=1 \ - MBS=3 \ - BS=24 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=70 \ - FSDP=1 \ - TOTAL_ITERS=10 \ - NUM_LAYERS=40 \ - bash examples/llama/train_llama3.sh - - .. note:: - - Use two or more nodes to run the *full* Llama 70B model with FP8 precision. - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_llama-2-7b - - To run training on a single node for Llama 2 7B FP8, navigate to the Megatron-LM folder and use the - following command. - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - For Llama 2 7B BF16, use the following command: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=4 \ - BS=256 \ - TP=1 \ - TE_FP8=0 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=7 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - -.. container:: model-doc pyt_megatron_lm_train_llama-2-70b - - To run the training on a single node for Llama 2 70B BF16 with FSDP-v2 enabled, add the ``FSDP=1`` argument. - For example, use the following command: - - .. code-block:: shell - - CKPT_FORMAT=torch_dist \ - TEE_OUTPUT=1 \ - MBS=7 \ - BS=56 \ - TP=1 \ - TE_FP8=0 \ - FSDP=1 \ - RECOMPUTE=1 \ - SEQ_LENGTH=4096 \ - MODEL_SIZE=70 \ - TOTAL_ITERS=50 \ - bash examples/llama/train_llama2.sh - - .. note:: - - It is suggested to use ``TP=1`` when FSDP is enabled for higher - throughput. FSDP-v2 is not supported with pipeline parallelism, expert - parallelism, MCore's distributed optimizer, gradient accumulation fusion, - or FP16. - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - FORCE_BALANCE=true \ - RUN_ENV=cluster \ - MODEL_SIZE=671B \ - TRAIN_ITERS=50 \ - SEQ_LEN=4096 \ - NUM_LAYERS=3 \ - MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=32 \ - PR=bf16 \ - TP=1 PP=1 ETP=1 EP=8 \ - GEMM_TUNING=1 \ - NVTE_CK_USES_BWD_V3=1 \ - USE_GROUPED_GEMM=true MOE_USE_LEGACY_GROUPED_GEMM=true \ - GPT_LAYER_IN_TE=true \ - bash examples/deepseek_v3/train_deepseekv3.sh - -.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - export NVTE_FUSED_ATTN_CK=0 - GEMM_TUNING=1 \ - PR=bf16 \ - MBS=4 \ - AC=none \ - SEQ_LEN=4096 \ - PAD_LEN=4096 \ - TRAIN_ITERS=50 \ - bash examples/deepseek_v2/train_deepseekv2.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=0 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=none \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=4096 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x7B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_mixtral-8x22b-proxy - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy, - navigate to the Megatron-LM folder and use the following command. - - .. code-block:: shell - - TOKENIZER_MODEL= - RECOMPUTE_NUM_LAYERS=4 \ - TEE_OUTPUT=1 \ - MBS=1 \ - GBS=16 \ - TP_SIZE=1 \ - PP_SIZE=1 \ - AC=full \ - NUM_LAYERS=4 \ - PR=bf16 \ - EP_SIZE=8 \ - ETP_SIZE=1 \ - SEQLEN=8192 \ - FORCE_BALANCE=true \ - MOCK_DATA=1 \ - RUN_ENV=cluster \ - MODEL_SIZE=8x22B \ - TRAIN_ITERS=50 \ - bash examples/mixtral/train_mixtral_moe.sh - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-7b - - To run training on a single node for Qwen 2.5 7B BF16, use the following - command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=0 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - - For FP8, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - TP=1 \ - CP=1 \ - PP=1 \ - MBS=10 \ - BS=640 \ - TE_FP8=1 \ - MODEL_SIZE=7 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-7B - -.. container:: model-doc pyt_megatron_lm_train_qwen2.5-72b - - To run the training on a single node for Qwen 2.5 72B BF16, use the following command. - - .. code-block:: shell - - bash examples/qwen/train_qwen2.sh \ - FSDP=1 \ - CP=1 \ - PP=1 \ - MBS=3 \ - BS=24 \ - TE_FP8=0 \ - MODEL_SIZE=72 \ - SEQ_LENGTH=2048 \ - TOTAL_ITERS=50 \ - MOCK_DATA=1 \ - TOKENIZER_MODEL=Qwen/Qwen2.5-72B \ - RECOMPUTE_ACTIVATIONS=full \ - CKPT_FORMAT=torch_dist - -Multi-node training examples ----------------------------- - -To run training on multiple nodes, launch the Docker container on each node. -For example, for Llama 3 using a two node setup (``NODE0`` as the master node), -use these commands. - -* On the master node ``NODE0``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=0 \ - bash examples/llama/train_llama3.sh - -* On the worker node ``NODE1``: - - .. code-block:: shell - - TEE_OUTPUT=1 \ - MBS=2 \ - BS=256 \ - TP=1 \ - TE_FP8=1 \ - SEQ_LENGTH=8192 \ - MODEL_SIZE=8 \ - MASTER_ADDR=IP_NODE0 \ - NNODES=2 \ - NODE_RANK=1 \ - bash examples/llama/train_llama3.sh - -Or, for DeepSeek-V3, an example script ``train_deepseek_v3_slurm.sh`` is -provided in -``__ to -enable training at scale under a SLURM environment. For example, to run -training on 16 nodes, try the following command: - -.. code-block:: shell - - sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh - -.. _amd-megatron-lm-benchmark-test-vars-v257: - -Key options ------------ - -The benchmark tests support the following sets of variables. - -``TEE_OUTPUT`` - ``1`` to enable training logs or ``0`` to disable. - -``TE_FP8`` - ``0`` for B16 or ``1`` for FP8 -- ``0`` by default. - -``GEMM_TUNING`` - ``1`` to enable GEMM tuning, which boosts performance by using the best GEMM kernels. - -``USE_FLASH_ATTN`` - ``1`` to enable Flash Attention. - -``FSDP`` - ``1`` to enable PyTorch FSDP2. If FSDP is enabled, ``--use-distributed-optimizer``, - ``--overlap-param-gather``, and ``--sequence-parallel`` are automatically disabled. - -``ENABLE_PROFILING`` - ``1`` to enable PyTorch profiling for performance analysis. - -``transformer-impl`` - ``transformer_engine`` to use the Transformer Engine (TE) or ``local`` to disable TE. - -``MODEL_SIZE`` - ``8B`` or ``70B`` for Llama 3 and 3.1. ``7B`` or ``70B`` for Llama 2, for example. - -``TOTAL_ITERS`` - The total number of iterations -- ``10`` by default. - -``MOCK_DATA`` - ``1`` to use mock data or ``0`` to use real data you provide. - -``MBS`` - Micro batch size. - -``BS`` - Global batch size. - -``TP`` / ``TP_SIZE`` - Tensor parallel (``1``, ``2``, ``4``, ``8``). ``TP`` is disabled when ``FSDP`` is turned on. - -``EP`` / ``EP_SIZE`` - Expert parallel for MoE models. - -``SEQ_LENGTH`` - Input sequence length. - -``PR`` - Precision for training. ``bf16`` for BF16 (default) or ``fp8`` for FP8 GEMMs. - -``AC`` - Activation checkpointing (``none``, ``sel``, or ``full``) -- ``sel`` by default. - -``NUM_LAYERS`` - Use reduced number of layers as a proxy model. - -``RECOMPUTE_NUM_LAYERS`` - Number of layers used for checkpointing recompute. - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst deleted file mode 100644 index 86ccd2e17..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst +++ /dev/null @@ -1,604 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -******************************************** -Training a model with Primus and Megatron-LM -******************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm Megatron-LM - training performance documentation. See :doc:`../primus-megatron` for the latest version. - -`Primus `__ is a unified and flexible -LLM training framework designed to streamline training. It streamlines LLM -training on AMD Instinct accelerators using a modular, reproducible configuration paradigm. -Primus is backend-agnostic and supports multiple training engines -- including Megatron. - -.. note:: - - Primus with the Megatron backend is intended to replace ROCm - Megatron-LM in this Dockerized training environment. To learn how to migrate - workloads from Megatron-LM to Primus with Megatron, see - :doc:`megatron-lm-primus-migration-guide`. - -For ease of use, AMD provides a ready-to-use Docker image for MI300 series accelerators -containing essential components for Primus and Megatron-LM. - -.. note:: - - This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with - Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release `. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -.. _amd-primus-megatron-lm-model-support-v257: - -Supported models -================ - -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. -Some instructions, commands, and training examples in this documentation might -vary by model -- select one to get started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.7-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-primus-megatron-lm-training-v257: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - - Environment setup - ================= - - Use the following instructions to set up the environment, configure the script to train models, and - reproduce the benchmark results on MI300X series accelerators with the ``{{ docker.pull_tag }}`` image. - - .. _amd-primus-megatron-lm-requirements-v257: - - Download the Docker image - ------------------------- - - 1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - 2. Launch the Docker container. - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - --shm-size 128G \ - --name primus_training_env \ - {{ docker.pull_tag }} - -3. Use these commands if you exit the ``primus_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start primus_training_env - docker exec -it primus_training_env bash - -The Docker container hosts verified release tag ``v0.1.0-rc1`` of the `Primus -`__ repository. - -.. _amd-primus-megatron-lm-environment-setup-v257: - -Configuration -============= - -Primus defines a training configuration in YAML for each model in -`examples/megatron/configs `__. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.7-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - {% for model_group in model_groups %} - {% for model in model_group.models %} - .. container:: model-doc {{ model.mad_tag }} - - To update training parameters for {{ model.model }}, you can update ``examples/megatron/configs/{{ model.config_name }}``. - Note that training configuration YAML files for other models follow this naming convention. - - {% endfor %} - {% endfor %} - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``mock_data`` field to toggle between mock and real data. The default - value is ``true`` for enabled. - - .. code-block:: yaml - - mock_data: true - -* If you're using a real dataset, update the ``train_data_path`` field to point to the location of your dataset. - - .. code-block:: bash - - mock_data: false - train_data_path: /path/to/your/dataset - - Ensure that the files are accessible inside the Docker container. - -.. _amd-primus-megatron-lm-tokenizer-v257: - -Tokenizer ---------- - -In Primus, each model uses a tokenizer from Hugging Face. For example, Llama -3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and -``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model -`__ -definition. As such, you need to set the ``HF_TOKEN`` environment variable with -right permissions to access the tokenizer for each model. - -.. code-block:: bash - - # Export your HF_TOKEN in the workspace - export HF_TOKEN= - -.. _amd-primus-megatron-lm-run-training-v257: - -Run training -============ - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. - -Single node training --------------------- - -To run training on a single node, navigate to ``/workspace/Primus`` and use the following setup command: - -.. code-block:: shell - - pip install -r requirements.txt - export HSA_NO_SCRATCH_RECLAIM=1 - export NVTE_CK_USES_BWD_V3=1 - -Once setup is complete, run the appropriate training command. - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b - - To run pre-training for Llama 3.3 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 16 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b - - To run pre-training for Llama 3.1 8B FP8, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b - - To run pre-training for Llama 3.1 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 - - To run the training on a single node for Llama 3.1 70B FP8 with proxy, use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --num_layers 40 \ - --fp8 hybrid \ - --no_fp8_weight_transpose_cache true - - .. note:: - - Use two or more nodes to run the *full* Llama 70B model with FP8 precision. - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b - - To run pre-training for Llama 2 7B FP8, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - - To run pre-training for Llama 2 7B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b - - To run pre-training for Llama 2 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --num_layers 3 \ - --moe_layer_freq 1 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --global_batch_size 256 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel) with 4-layer proxy, - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --num_layers 4 \ - --pipeline_model_parallel_size 1 \ - --micro_batch_size 1 \ - --global_batch_size 16 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b - - To run training on a single node for Qwen 2.5 7B BF16, use the following - command: - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - - For FP8, use the following command. - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b - - To run the training on a single node for Qwen 2.5 72B BF16, use the following command. - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - -Multi-node training examples ----------------------------- - -To run training on multiple nodes, you can use the -`run_slurm_pretrain.sh `__ -to launch the multi-node workload. Use the following steps to setup your environment: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - - .. code-block:: shell - - cd /workspace/Primus/ - export DOCKER_IMAGE={{ docker.pull_tag }} - export HF_TOKEN= - export HSA_NO_SCRATCH_RECLAIM=1 - export NVTE_CK_USES_BWD_V3=1 - export NCCL_IB_HCA= # specify which RDMA interfaces to use for communication - export NCCL_SOCKET_IFNAME= # your Network Interface - export GLOO_SOCKET_IFNAME= # your Network Interface - export NCCL_IB_GID_INDEX=3 # Set InfiniBand GID index for NCCL communication. Default is 3 for ROCE - -.. note:: - - * Make sure correct network drivers are installed on the nodes. If inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating Docker container. - * If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster. - * To find your network interface, you can use ``ip a``. - * To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices. - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b - - To train Llama 3.3 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 4 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --no_fp8_weight_transpose_cache true \ - --fp8 hybrid - - To train Llama 3.3 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b - - To train Llama 3.1 8B FP8 on 8 nodes, run: - - .. code-block:: shell - - # Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case - NNODES=8 EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_slurm_pretrain.sh \ - --global_batch_size 1024 \ - --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b - - To train Llama 3.1 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 4 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --no_fp8_weight_transpose_cache true \ - --fp8 hybrid - - To train Llama 3.1 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b - - To train Llama 2 8B FP8 on 8 nodes, run: - - .. code-block:: shell - - # Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case - NNODES=8 EXP=examples/megatron/configs/llama2_7B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh --global_batch_size 2048 --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b - - To train Llama 2 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 10 \ - --global_batch_size 640 \ - --recompute_num_layers 80 \ - --no_fp8_weight_transpose_cache true \ - --fp8 hybrid - - To train Llama 2 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash ./examples/run_slurm_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 1536 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b - - To train Mixtral 8x7B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 256 - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b - - To train Qwen2.5 72B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 8 \ - --global_batch_size 512 \ - --recompute_num_layers 80 \ - --no_fp8_weight_transpose_cache true \ - --fp8 hybrid - -.. _amd-primus-megatron-lm-benchmark-test-vars-v257: - -Key options ------------ - -The following are key options to take note of - -fp8 - ``hybrid`` enables FP8 GEMMs. - -use_torch_fsdp2 - ``use_torch_fsdp2: 1`` enables torch fsdp-v2. If FSDP is enabled, - set ``use_distributed_optimizer`` and ``overlap_param_gather`` to ``false``. - -profile - To enable PyTorch profiling, set these parameters: - - .. code-block:: yaml - - profile: true - use_pytorch_profiler: true - profile_step_end: 7 - profile_step_start: 6 - -train_iters - The total number of iterations (default: 50). - -mock_data - True by default. - -micro_batch_size - Micro batch size. - -global_batch_size - Global batch size. - -recompute_granularity - For activation checkpointing. - -num_layers - For using a reduced number of layers as with proxy models. - -Previous versions -================= - -See :doc:`megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history.rst deleted file mode 100644 index 16d5e3f9d..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history.rst +++ /dev/null @@ -1,66 +0,0 @@ -:orphan: - -**************************************************** -PyTorch training performance testing version history -**************************************************** - -This table lists previous versions of the ROCm PyTorch training Docker image for -inference performance testing. For detailed information about available models -for benchmarking, see the version-specific documentation. You can find tagged -previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub `_. - -.. list-table:: - :header-rows: 1 - - * - Image version - - Components - - Resources - - * - v25.8 (latest) - - - * ROCm 6.4.3 - * PyTorch 2.8.0a0+gitd06a406 - - - * :doc:`Primus PyTorch Training documentation <../primus-pytorch>` - * :doc:`PyTorch training (legacy) documentation <../pytorch-training>` - * `Docker Hub `__ - - * - v25.7 - - - * ROCm 6.4.2 - * PyTorch 2.8.0a0+gitd06a406 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v25.6 - - - * ROCm 6.3.4 - * PyTorch 2.8.0a0+git7d205b2 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v25.5 - - - * ROCm 6.3.4 - * PyTorch 2.7.0a0+git637433 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v25.4 - - - * ROCm 6.3.0 - * PyTorch 2.7.0a0+git637433 - - - * :doc:`Documentation ` - * `Docker Hub `__ - - * - v25.3 - - - * ROCm 6.3.0 - * PyTorch 2.7.0a0+git637433 - - - * :doc:`Documentation ` - * `Docker Hub `__ diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst deleted file mode 100644 index a0e31be9e..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst +++ /dev/null @@ -1,353 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm PyTorch - training performance documentation. See :doc:`../pytorch-training` for the latest version. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.3``) image -provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following -software components to accelerate training workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.0 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git637433 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.11 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | git258a2162 | -+--------------------------+--------------------------------+ -| Triton | 3.1 | -+--------------------------+--------------------------------+ - -.. _amd-pytorch-training-model-support-v253: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. - -* Llama 3.1 8B - -* Llama 3.1 70B - -* FLUX.1-dev - -.. note:: - - Only these models are supported in the following steps. - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -System validation -================= - -If you have already validated your system settings, skip this step. Otherwise, -complete the :ref:`system validation and optimization steps ` -to set up your system before starting training. - -Disable NUMA auto-balancing ---------------------------- - -Generally, application performance can benefit from disabling NUMA auto-balancing. However, -it might be detrimental to performance with certain types of workloads. - -Run the command ``cat /proc/sys/kernel/numa_balancing`` to check your current NUMA (Non-Uniform -Memory Access) settings. Output ``0`` indicates this setting is disabled. If there is no output or -the output is ``1``, run the following command to disable NUMA auto-balancing. - -.. code-block:: shell - - sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - -See :ref:`System validation and optimization ` -for more information. - -Environment setup -================= - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/pytorch-training:v25.3 - -2. Run the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.3 - -3. Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - -4. In the Docker container, clone the ``__ repository and navigate to the benchmark scripts directory. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch-train - -Prepare training datasets and dependencies ------------------------------------------- - -The following benchmarking examples may require downloading models and datasets -from Hugging Face. To ensure successful access to gated repos, set your -``HF_TOKEN``. - -Run the setup script to install libraries and datasets needed for benchmarking. - -.. code-block:: shell - - ./pytorch_benchmark_setup.sh - -``pytorch_benchmark_setup.sh`` installs the following libraries: - -.. list-table:: - :header-rows: 1 - - * - Library - - Benchmark model - - Reference - - * - ``accelerate`` - - Llama 3.1 8B, FLUX - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - Llama 3.1 8B, 70B, FLUX - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - Llama 3.1 70B - - `TorchData `_ - - * - ``tomli`` - - Llama 3.1 70B - - `Tomli `_ - - * - ``tiktoken`` - - Llama 3.1 70B - - `tiktoken `_ - - * - ``blobfile`` - - Llama 3.1 70B - - `blobfile `_ - - * - ``tabulate`` - - Llama 3.1 70B - - `tabulate `_ - - * - ``wandb`` - - Llama 3.1 70B - - `Weights & Biases `_ - - * - ``sentencepiece`` - - Llama 3.1 70B, FLUX - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - Llama 3.1 70 B, FLUX - - `TensorBoard `_ 2.18.0 - - * - ``csvkit`` - - FLUX - - `csvkit `_ 2.0.1 - - * - ``deepspeed`` - - FLUX - - `DeepSpeed `_ 0.16.2 - - * - ``diffusers`` - - FLUX - - `Hugging Face Diffusers `_ 0.31.0 - - * - ``GitPython`` - - FLUX - - `GitPython `_ 3.1.44 - - * - ``opencv-python-headless`` - - FLUX - - `opencv-python-headless `_ 4.10.0.84 - - * - ``peft`` - - FLUX - - `PEFT `_ 0.14.0 - - * - ``protobuf`` - - FLUX - - `Protocol Buffers `_ 5.29.2 - - * - ``pytest`` - - FLUX - - `PyTest `_ 8.3.4 - - * - ``python-dotenv`` - - FLUX - - `python-dotenv `_ 1.0.1 - - * - ``seaborn`` - - FLUX - - `Seaborn `_ 0.13.2 - - * - ``transformers`` - - FLUX - - `Transformers `_ 4.47.0 - -``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face: - -* `meta-llama/Llama-3.1-70B-Instruct `_ - -* `black-forest-labs/FLUX.1-dev `_ - -Along with the following datasets: - -* `WikiText `_ - -* `bghira/pseudo-camera-10k `_ - -Start training on AMD Instinct accelerators -=========================================== - -The prebuilt PyTorch with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama 3.1 and Llama 2. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. - -Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD PyTorch training Docker image. - -Once your environment is set up, use the following commands and examples to start benchmarking. - -Pretraining ------------ - -To start the pretraining benchmark, use the following command with the -appropriate options. See the following list of options and their descriptions. - -.. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length - -Options and available models -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - * - ``$training_mode`` - - ``pretrain`` - - Benchmark pretraining - - * - - - ``finetune_fw`` - - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16) - - * - - - ``finetune_lora`` - - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16) - - * - ``$datatype`` - - FP8 or BF16 - - Only Llama 3.1 8B supports FP8 precision. - - * - ``$model_repo`` - - Llama-3.1-8B - - `Llama 3.1 8B `_ - - * - - - Llama-3.1-70B - - `Llama 3.1 70B `_ - - * - - - Flux - - `FLUX.1 [dev] `_ - -Fine-tuning ------------ - -To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 2 70B -with the WikiText dataset using the AMD fork of `torchtune `_. - -.. code-block:: shell - - ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B - -Benchmarking examples ---------------------- - -Here are some examples of how to use the command. - -* Example 1: Llama 3.1 70B with BF16 precision with `torchtitan `_. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192 - -* Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192 - -* Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux - -* Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B - -* Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B - -Previous versions -================= - -See :doc:`pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst deleted file mode 100644 index 772c6dc4a..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst +++ /dev/null @@ -1,397 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm PyTorch - training performance documentation. See :doc:`../pytorch-training` for the latest version. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.4``) image -provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following -software components to accelerate training workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.0 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git637433 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.11 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | git258a2162 | -+--------------------------+--------------------------------+ -| Triton | 3.1 | -+--------------------------+--------------------------------+ - -.. _amd-pytorch-training-model-support-v254: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. - -* Llama 3.1 8B - -* Llama 3.1 70B - -* Llama 2 70B - -* FLUX.1-dev - -.. note:: - - Only these models are supported in the following steps. - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -.. _amd-pytorch-training-performance-measurements-v254: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and latency measurements for training -popular AI models. - -.. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system -before starting training. - -Environment setup -================= - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -Download the Docker image -------------------------- - -1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/pytorch-training:v25.4 - -2. Run the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.4 - -3. Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - -4. In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - -Prepare training datasets and dependencies ------------------------------------------- - -The following benchmarking examples require downloading models and datasets -from Hugging Face. To ensure successful access to gated repos, set your -``HF_TOKEN``. - -.. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - -Run the setup script to install libraries and datasets needed for benchmarking. - -.. code-block:: shell - - ./pytorch_benchmark_setup.sh - -``pytorch_benchmark_setup.sh`` installs the following libraries: - -.. list-table:: - :header-rows: 1 - - * - Library - - Benchmark model - - Reference - - * - ``accelerate`` - - Llama 3.1 8B, FLUX - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - Llama 3.1 8B, 70B, FLUX - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - Llama 3.1 70B - - `TorchData `_ - - * - ``tomli`` - - Llama 3.1 70B - - `Tomli `_ - - * - ``tiktoken`` - - Llama 3.1 70B - - `tiktoken `_ - - * - ``blobfile`` - - Llama 3.1 70B - - `blobfile `_ - - * - ``tabulate`` - - Llama 3.1 70B - - `tabulate `_ - - * - ``wandb`` - - Llama 3.1 70B - - `Weights & Biases `_ - - * - ``sentencepiece`` - - Llama 3.1 70B, FLUX - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - Llama 3.1 70 B, FLUX - - `TensorBoard `_ 2.18.0 - - * - ``csvkit`` - - FLUX - - `csvkit `_ 2.0.1 - - * - ``deepspeed`` - - FLUX - - `DeepSpeed `_ 0.16.2 - - * - ``diffusers`` - - FLUX - - `Hugging Face Diffusers `_ 0.31.0 - - * - ``GitPython`` - - FLUX - - `GitPython `_ 3.1.44 - - * - ``opencv-python-headless`` - - FLUX - - `opencv-python-headless `_ 4.10.0.84 - - * - ``peft`` - - FLUX - - `PEFT `_ 0.14.0 - - * - ``protobuf`` - - FLUX - - `Protocol Buffers `_ 5.29.2 - - * - ``pytest`` - - FLUX - - `PyTest `_ 8.3.4 - - * - ``python-dotenv`` - - FLUX - - `python-dotenv `_ 1.0.1 - - * - ``seaborn`` - - FLUX - - `Seaborn `_ 0.13.2 - - * - ``transformers`` - - FLUX - - `Transformers `_ 4.47.0 - -``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face: - -* `meta-llama/Llama-3.1-70B-Instruct `_ - -* `black-forest-labs/FLUX.1-dev `_ - -Along with the following datasets: - -* `WikiText `_ - -* `UltraChat 200k `_ - -* `bghira/pseudo-camera-10k `_ - -Getting started -=============== - -The prebuilt PyTorch with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama 3.1 and Llama 2. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. - -Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI325X and MI300X -accelerators with the AMD PyTorch training Docker image. - -Once your environment is set up, use the following commands and examples to start benchmarking. - -Pretraining ------------ - -To start the pretraining benchmark, use the following command with the -appropriate options. See the following list of options and their descriptions. - -.. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length - -Options and available models -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - * - ``$training_mode`` - - ``pretrain`` - - Benchmark pretraining - - * - - - ``finetune_fw`` - - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16) - - * - - - ``finetune_lora`` - - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16) - - * - - - ``HF_finetune_lora`` - - Benchmark LoRA fine-tuning with Hugging Face PEFT (Llama 2 70B with BF16) - - * - ``$datatype`` - - ``FP8`` or ``BF16`` - - Only Llama 3.1 8B supports FP8 precision. - - * - ``$model_repo`` - - ``Llama-3.1-8B`` - - `Llama 3.1 8B `_ - - * - - - ``Llama-3.1-70B`` - - `Llama 3.1 70B `_ - - * - - - ``Llama-2-70B`` - - `Llama 2 70B `_ - - * - - - ``Flux`` - - `FLUX.1 [dev] `_ - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - -.. note:: - - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. - -Fine-tuning ------------ - -To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 3.1 70B -with the WikiText dataset using the AMD fork of `torchtune `_. - -.. code-block:: shell - - ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B - -Use the following command to run the benchmarking example of Llama 2 70B with the UltraChat 200k dataset using -`Hugging Face PEFT `_. - -.. code-block:: shell - - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B - -Benchmarking examples ---------------------- - -Here are some examples of how to use the command. - -* Example 1: Llama 3.1 70B with BF16 precision with `torchtitan `_. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192 - -* Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192 - -* Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux - -* Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B - -* Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B - -* Example 6: Hugging Face PEFT LoRA fine-tuning with Llama 2 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B - -Previous versions -================= - -See :doc:`pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst deleted file mode 100644 index e68a1092b..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst +++ /dev/null @@ -1,444 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - performance benchmark documentation. See :doc:`../pytorch-training` for the latest version. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -The `PyTorch for ROCm training Docker `_ -(``rocm/pytorch-training:v25.5``) image -provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following -software components to accelerate training workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.4 | -+--------------------------+--------------------------------+ -| PyTorch | 2.7.0a0+git637433 | -+--------------------------+--------------------------------+ -| Python | 3.10 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.12.0.dev0+25a33da | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0 | -+--------------------------+--------------------------------+ -| hipBLASLt | git53b53bf | -+--------------------------+--------------------------------+ -| Triton | 3.2.0 | -+--------------------------+--------------------------------+ - -.. _amd-pytorch-training-model-support-v255: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. - -* Llama 3.3 70B - -* Llama 3.1 8B - -* Llama 3.1 70B - -* Llama 2 70B - -* FLUX.1-dev - -.. note:: - - Only these models are supported in the following steps. - - Some models, such as Llama 3, require an external license agreement through - a third party (for example, Meta). - -.. _amd-pytorch-training-performance-measurements-v255: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and latency measurements for training -popular AI models. - -.. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t validate configurations and run conditions outside those described. - -Benchmarking -============ - -Once the setup is complete, choose between two options to start benchmarking: - -.. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - For example, use this command to run the performance benchmark test on the Llama 3.1 8B model - using one GPU with the float16 data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - python3 tools/run_models.py --tags pyt_train_llama-3.1-8b --keep-model-dir --live-output --timeout 28800 - - The available models for MAD-integrated benchmarking are: - - * ``pyt_train_llama-3.3-70b`` - - * ``pyt_train_llama-3.1-8b`` - - * ``pyt_train_llama-3.1-70b`` - - * ``pyt_train_flux`` - - MAD launches a Docker container with the name - ``container_ci-pyt_train_llama-3.1-8b``, for example. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/perf.csv``. - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required packages - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull rocm/pytorch-training:v25.5 - - Run the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.5 - - Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - - In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - - .. rubric:: Prepare training datasets and dependencies - - The following benchmarking examples require downloading models and datasets - from Hugging Face. To ensure successful access to gated repos, set your - ``HF_TOKEN``. - - .. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - - Run the setup script to install libraries and datasets needed for benchmarking. - - .. code-block:: shell - - ./pytorch_benchmark_setup.sh - - ``pytorch_benchmark_setup.sh`` installs the following libraries: - - .. list-table:: - :header-rows: 1 - - * - Library - - Benchmark model - - Reference - - * - ``accelerate`` - - Llama 3.1 8B, FLUX - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - Llama 3.1 8B, 70B, FLUX - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - Llama 3.1 70B - - `TorchData `_ - - * - ``tomli`` - - Llama 3.1 70B - - `Tomli `_ - - * - ``tiktoken`` - - Llama 3.1 70B - - `tiktoken `_ - - * - ``blobfile`` - - Llama 3.1 70B - - `blobfile `_ - - * - ``tabulate`` - - Llama 3.1 70B - - `tabulate `_ - - * - ``wandb`` - - Llama 3.1 70B - - `Weights & Biases `_ - - * - ``sentencepiece`` - - Llama 3.1 70B, FLUX - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - Llama 3.1 70 B, FLUX - - `TensorBoard `_ 2.18.0 - - * - ``csvkit`` - - FLUX - - `csvkit `_ 2.0.1 - - * - ``deepspeed`` - - FLUX - - `DeepSpeed `_ 0.16.2 - - * - ``diffusers`` - - FLUX - - `Hugging Face Diffusers `_ 0.31.0 - - * - ``GitPython`` - - FLUX - - `GitPython `_ 3.1.44 - - * - ``opencv-python-headless`` - - FLUX - - `opencv-python-headless `_ 4.10.0.84 - - * - ``peft`` - - FLUX - - `PEFT `_ 0.14.0 - - * - ``protobuf`` - - FLUX - - `Protocol Buffers `_ 5.29.2 - - * - ``pytest`` - - FLUX - - `PyTest `_ 8.3.4 - - * - ``python-dotenv`` - - FLUX - - `python-dotenv `_ 1.0.1 - - * - ``seaborn`` - - FLUX - - `Seaborn `_ 0.13.2 - - * - ``transformers`` - - FLUX - - `Transformers `_ 4.47.0 - - ``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face: - - * `meta-llama/Llama-3.1-70B-Instruct `_ - - * `black-forest-labs/FLUX.1-dev `_ - - Along with the following datasets: - - * `WikiText `_ - - * `UltraChat 200k `_ - - * `bghira/pseudo-camera-10k `_ - - .. rubric:: Pretraining - - To start the pretraining benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - * - ``$training_mode`` - - ``pretrain`` - - Benchmark pretraining - - * - - - ``finetune_fw`` - - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16) - - * - - - ``finetune_lora`` - - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16) - - * - - - ``HF_finetune_lora`` - - Benchmark LoRA fine-tuning with Hugging Face PEFT (Llama 2 70B with BF16) - - * - ``$datatype`` - - ``FP8`` or ``BF16`` - - Only Llama 3.1 8B supports FP8 precision. - - * - ``$model_repo`` - - ``Llama-3.3-70B`` - - `Llama 3.3 70B `_ - - * - - - ``Llama-3.1-8B`` - - `Llama 3.1 8B `_ - - * - - - ``Llama-3.1-70B`` - - `Llama 3.1 70B `_ - - * - - - ``Llama-2-70B`` - - `Llama 2 70B `_ - - * - - - ``Flux`` - - `FLUX.1 [dev] `_ - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - - .. note:: - - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. - - .. rubric:: Fine-tuning - - To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 3.1 70B - with the WikiText dataset using the AMD fork of `torchtune `_. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B - - Use the following command to run the benchmarking example of Llama 2 70B with the UltraChat 200k dataset using - `Hugging Face PEFT `_. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B - - .. rubric:: Benchmarking examples - - Here are some example commands to get started pretraining and fine-tuning with various model configurations. - - * Example 1: Llama 3.1 70B with BF16 precision with `torchtitan `_. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192 - - * Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192 - - * Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux - - * Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B - - * Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B - - * Example 6: Torchtune full weight fine-tuning with Llama-3.3-70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.3-70B - - * Example 7: Torchtune LoRA fine-tuning with Llama-3.3-70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.3-70B - - * Example 8: Torchtune QLoRA fine-tuning with Llama-3.3-70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t finetune_qlora -p BF16 -m Llama-3.3-70B - - * Example 9: Hugging Face PEFT LoRA fine-tuning with Llama 2 70B - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B - -Previous versions -================= - -See :doc:`pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst deleted file mode 100644 index f9bc57a43..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst +++ /dev/null @@ -1,456 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - performance benchmark documentation. See :doc:`../pytorch-training` for the latest version. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -The `PyTorch for ROCm training Docker `_ -(``rocm/pytorch-training:v25.6``) image provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate -training workloads: - -+--------------------------+--------------------------------+ -| Software component | Version | -+==========================+================================+ -| ROCm | 6.3.4 | -+--------------------------+--------------------------------+ -| PyTorch | 2.8.0a0+git7d205b2 | -+--------------------------+--------------------------------+ -| Python | 3.10.17 | -+--------------------------+--------------------------------+ -| Transformer Engine | 1.14.0+2f85f5f2 | -+--------------------------+--------------------------------+ -| Flash Attention | 3.0.0.post1 | -+--------------------------+--------------------------------+ -| hipBLASLt | 0.15.0-8c6919d | -+--------------------------+--------------------------------+ -| Triton | 3.3.0 | -+--------------------------+--------------------------------+ - -.. _amd-pytorch-training-model-support-v256: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.6-benchmark-models.yaml - - {% set unified_docker = data.unified_docker.latest %} - {% set model_groups = data.model_groups %} - - .. raw:: html - -
-
-
Workload
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - .. note:: - - Some models require an external license agreement through a third party (for example, Meta). - - .. _amd-pytorch-training-performance-measurements-v256: - - Performance measurements - ======================== - - To evaluate performance, the - `Performance results with AMD ROCm software `_ - page provides reference throughput and latency measurements for training - popular AI models. - - .. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. - - System validation - ================= - - Before running AI workloads, it's important to validate that your AMD hardware is configured - correctly and performing optimally. - - If you have already validated your system settings, including aspects like NUMA auto-balancing, you - can skip this step. Otherwise, complete the procedures in the :ref:`System validation and - optimization ` guide to properly configure your system settings - before starting training. - - To test for optimal performance, consult the recommended :ref:`System health benchmarks - `. This suite of tests will help you verify and fine-tune your - system's configuration. - - This Docker image is optimized for specific model configurations outlined - below. Performance can vary for other training workloads, as AMD - doesn’t validate configurations and run conditions outside those described. - - Benchmarking - ============ - - Once the setup is complete, choose between two options to start benchmarking: - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - For example, use this command to run the performance benchmark test on the {{ model.model }} model - using one GPU with the {{ model.precision }} data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{ model.mad_tag }} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{ model.mad_tag }}``, for example. The latency and throughput reports of the - model are collected in the following path: ``~/MAD/perf.csv``. - - {% endfor %} - {% endfor %} - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required packages - - Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - Run the Docker container. - - .. code-block:: shell - - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env {{ unified_docker.pull_tag }} - - Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - - In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - - .. rubric:: Prepare training datasets and dependencies - - The following benchmarking examples require downloading models and datasets - from Hugging Face. To ensure successful access to gated repos, set your - ``HF_TOKEN``. - - .. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - - Run the setup script to install libraries and datasets needed for benchmarking. - - .. code-block:: shell - - ./pytorch_benchmark_setup.sh - - .. container:: model-doc pyt_train_llama-3.1-8b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - .. container:: model-doc pyt_train_llama-3.1-70b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - `TorchData `_ - - * - ``tomli`` - - `Tomli `_ - - * - ``tiktoken`` - - `tiktoken `_ - - * - ``blobfile`` - - `blobfile `_ - - * - ``tabulate`` - - `tabulate `_ - - * - ``wandb`` - - `Weights & Biases `_ - - * - ``sentencepiece`` - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `_ 2.18.0 - - .. container:: model-doc pyt_train_flux - - ``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - * - ``sentencepiece`` - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `_ 2.18.0 - - * - ``csvkit`` - - `csvkit `_ 2.0.1 - - * - ``deepspeed`` - - `DeepSpeed `_ 0.16.2 - - * - ``diffusers`` - - `Hugging Face Diffusers `_ 0.31.0 - - * - ``GitPython`` - - `GitPython `_ 3.1.44 - - * - ``opencv-python-headless`` - - `opencv-python-headless `_ 4.10.0.84 - - * - ``peft`` - - `PEFT `_ 0.14.0 - - * - ``protobuf`` - - `Protocol Buffers `_ 5.29.2 - - * - ``pytest`` - - `PyTest `_ 8.3.4 - - * - ``python-dotenv`` - - `python-dotenv `_ 1.0.1 - - * - ``seaborn`` - - `Seaborn `_ 0.13.2 - - * - ``transformers`` - - `Transformers `_ 4.47.0 - - ``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face: - - * `bghira/pseudo-camera-10k `_ - - {% for model_group in model_groups %} - {% for model in model_group.models %} - {% if model_group.tag == "pre-training" and model.mad_tag in ["pyt_train_llama-3.1-8b", "pyt_train_llama-3.1-70b", "pyt_train_flux"] %} - - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Pretraining - - To start the pre-training benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain -m {{ model.model_repo }} -p $datatype -s $sequence_length - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% if model.mad_tag == "pyt_train_llama-3.1-8b" %} - * - ``$datatype`` - - ``BF16`` or ``FP8`` - - Only Llama 3.1 8B supports FP8 precision. - {% else %} - * - ``$datatype`` - - ``BF16`` - - Only Llama 3.1 8B supports FP8 precision. - {% endif %} - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - - {% if model.mad_tag == "pyt_train_flux" %} - .. container:: model-doc {{ model.mad_tag }} - - .. note:: - - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. - {% endif %} - {% endif %} - - {% if model_group.tag == "fine-tuning" %} - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Fine-tuning - - To start the fine-tuning benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode -m {{ model.model_repo }} -p BF16 -s $sequence_length - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - * - ``$training_mode`` - - ``finetune_fw`` - - Full weight fine-tuning (BF16 supported) - - * - - - ``finetune_lora`` - - LoRA fine-tuning (BF16 supported) - - * - - - ``finetune_qlora`` - - QLoRA fine-tuning (BF16 supported) - - * - - - ``HF_finetune_lora`` - - LoRA fine-tuning with Hugging Face PEFT - - * - ``$datatype`` - - ``BF16`` - - All models support BF16. - - * - ``$sequence_length`` - - Between 2048 and 16384. - - Sequence length for the language model. - - .. note:: - - {{ model.model }} currently supports the following fine-tuning methods: - - {% for method in model.training_modes %} - * ``{{ method }}`` - {% endfor %} - {% if model.training_modes|length < 4 %} - - The upstream `torchtune `_ repository - does not currently provide YAML configuration files for other combinations of - model to fine-tuning method - However, you can still configure your own YAML files to enable support for - fine-tuning methods not listed here by following existing patterns in the - ``/workspace/torchtune/recipes/configs`` directory. - {% endif %} - {% endif %} - {% endfor %} - {% endfor %} - - .. rubric:: Benchmarking examples - - For examples of benchmarking commands, see ``__. - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst deleted file mode 100644 index 43b9a02e5..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst +++ /dev/null @@ -1,567 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch for ROCm -************************************** - -.. caution:: - - This documentation does not reflect the latest version of ROCm vLLM - performance benchmark documentation. See :doc:`../pytorch-training` for the latest version. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.7-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__ - (``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a - model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate - training workloads: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -.. _amd-pytorch-training-model-support-v257: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. -Some instructions, commands, and training recommendations in this documentation might -vary by model -- select one to get started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.7-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - - .. _amd-pytorch-training-supported-training-modes-v257: - - The following table lists supported training modes per model. - - .. dropdown:: Supported training modes - - .. list-table:: - :header-rows: 1 - - * - Model - - Supported training modes - - {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - * - {{ model.model }} - - ``{{ model.training_modes | join('``, ``') }}`` - - {% endfor %} - {% endfor %} - - .. note:: - - Some model and fine-tuning combinations are not listed. This is - because the `upstream torchtune repository `__ - doesn't provide default YAML configurations for them. - For advanced usage, you can create a custom configuration to enable - unlisted fine-tuning methods by using an existing file in the - ``/workspace/torchtune/recipes/configs`` directory as a template. - -.. _amd-pytorch-training-performance-measurements-v257: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and latency measurements for training -popular AI models. - -.. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t test configurations and run conditions outside those described. - -Run training -============ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.7-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Once the setup is complete, choose between two options to start benchmarking training: - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - 2. For example, use this command to run the performance benchmark test on the {{ model.model }} model - using one node with the {{ model.precision }} data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{ model.mad_tag }} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the - model are collected in ``~/MAD/perf.csv``. - - {% endfor %} - {% endfor %} - - .. tab-item:: Standalone benchmarking - - .. rubric:: Download the Docker image and required packages - - 1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - 2. Run the Docker container. - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --network host \ - --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 64G \ - --name training_env \ - {{ unified_docker.pull_tag }} - - Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - - 3. In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - - .. rubric:: Prepare training datasets and dependencies - - 1. The following benchmarking examples require downloading models and datasets - from Hugging Face. To ensure successful access to gated repos, set your - ``HF_TOKEN``. - - .. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - - 2. Run the setup script to install libraries and datasets needed for benchmarking. - - .. code-block:: shell - - ./pytorch_benchmark_setup.sh - - .. container:: model-doc pyt_train_llama-3.1-8b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - .. container:: model-doc pyt_train_llama-3.1-70b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - `TorchData `_ - - * - ``tomli`` - - `Tomli `_ - - * - ``tiktoken`` - - `tiktoken `_ - - * - ``blobfile`` - - `blobfile `_ - - * - ``tabulate`` - - `tabulate `_ - - * - ``wandb`` - - `Weights & Biases `_ - - * - ``sentencepiece`` - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `_ 2.18.0 - - .. container:: model-doc pyt_train_flux - - ``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - * - ``sentencepiece`` - - `SentencePiece `_ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `_ 2.18.0 - - * - ``csvkit`` - - `csvkit `_ 2.0.1 - - * - ``deepspeed`` - - `DeepSpeed `_ 0.16.2 - - * - ``diffusers`` - - `Hugging Face Diffusers `_ 0.31.0 - - * - ``GitPython`` - - `GitPython `_ 3.1.44 - - * - ``opencv-python-headless`` - - `opencv-python-headless `_ 4.10.0.84 - - * - ``peft`` - - `PEFT `_ 0.14.0 - - * - ``protobuf`` - - `Protocol Buffers `_ 5.29.2 - - * - ``pytest`` - - `PyTest `_ 8.3.4 - - * - ``python-dotenv`` - - `python-dotenv `_ 1.0.1 - - * - ``seaborn`` - - `Seaborn `_ 0.13.2 - - * - ``transformers`` - - `Transformers `_ 4.47.0 - - ``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face: - - * `bghira/pseudo-camera-10k `_ - - {% for model_group in model_groups %} - {% for model in model_group.models %} - {% set training_modes = model.training_modes %} - {% set training_mode_descs = { - "pretrain": "Benchmark pre-training.", - "HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision." - } %} - {% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %} - {% if available_modes %} - - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Pre-training - - To start the pre-training benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \ - -m {{ model.model_repo }} \ - -p $datatype \ - -s $sequence_length - - {% if model.mad_tag == "pyt_train_flux" %} - .. container:: model-doc {{ model.mad_tag }} - - .. note:: - - Currently, FLUX models are not supported out-of-the-box on {{ unified_docker.pull_tag }}. - To use FLUX, refer to the previous version of the ``pytorch-training`` Docker: :doc:`pytorch-training-v25.6` - - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. - {% endif %} - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% for mode in available_modes %} - * - {% if loop.first %}``$training_mode``{% endif %} - - ``{{ mode }}`` - - {{ training_mode_descs[mode] }} - {% endfor %} - - * - ``$datatype`` - - ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %} - - Only Llama 3.1 8B supports FP8 precision. - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - {% endif %} - - {% set training_mode_descs = { - "finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).", - "finetune_lora": "LoRA fine-tuning (BF16 supported).", - "finetune_qlora": "QLoRA fine-tuning (BF16 supported).", - "HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.", - } %} - {% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %} - {% if available_modes %} - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Fine-tuning - - To start the fine-tuning benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - See :ref:`supported training modes `. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode \ - -m {{ model.model_repo }} \ - -p $datatype \ - -s $sequence_length - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% for mode in available_modes %} - * - {% if loop.first %}``$training_mode``{% endif %} - - ``{{ mode }}`` - - {{ training_mode_descs[mode] }} - {% endfor %} - - * - ``$datatype`` - - ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %} - - All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %} - - * - ``$sequence_length`` - - Between 2048 and 16384. - - Sequence length for the language model. - - {% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %} - .. note:: - - For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B), - use the following torchtune commit for compatibility: - - .. code-block:: shell - - git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e - - {% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %} - .. note:: - - You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of - input tensor should be smaller than max_seq_len (4096)``. - This error indicates that an input sequence is longer than the model's maximum context window. - - Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096 - tokens in this case). You can resolve this by truncating the input or splitting - it into smaller chunks before passing it to the model. - - Note on reproducibility: The results in this guide are based on - commit ``b4c98ac`` from the upstream - ``__ repository. For the - latest updates, you can use the main branch. - - {% endif %} - {% endif %} - {% endfor %} - {% endfor %} - - .. rubric:: Benchmarking examples - - For examples of benchmarking commands, see ``__. - -Multi-node training -------------------- - -Pre-training -~~~~~~~~~~~~ - -Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B. - -To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository. - -.. code-block:: shell - - # In the MAD repository - cd scripts/pytorch_train - sbatch run_slurm_train.sh - -Fine-tuning -~~~~~~~~~~~ - -Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B. - -To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository. - -.. code-block:: shell - - huggingface-cli login # Get access to HF Llama model space - huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally - # In the MAD repository - cd scripts/pytorch_train - sbatch Torchtune_Multinode.sh - -.. note:: - - Information regarding benchmark setup: - - * By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``. - * You can adjust the torchtune `YAML configuration file - `__ - if you're using a different model. - * The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``. - * Set the ``mounting_paths`` inside the SLURM script. - -Once the run is finished, you can find the log files in the ``result_torchtune/`` directory. - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst deleted file mode 100644 index 65ac5e50c..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst +++ /dev/null @@ -1,665 +0,0 @@ -.. meta:: - :description: How to train a model using Megatron-LM for ROCm. - :keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch - -******************************************** -Training a model with Primus and Megatron-LM -******************************************** - -`Primus `__ is a unified and flexible -LLM training framework designed to streamline training. It streamlines LLM -training on AMD Instinct GPUs using a modular, reproducible configuration paradigm. -Primus is backend-agnostic and supports multiple training engines -- including Megatron. - -.. note:: - - Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM training ` workflow. - To learn how to migrate workloads from Megatron-LM to Primus with Megatron, - see :doc:`previous-versions/megatron-lm-primus-migration-guide`. - -For ease of use, AMD provides a ready-to-use Docker image for MI300 series GPUs -containing essential components for Primus and Megatron-LM. This Docker is powered by Primus -Turbo optimizations for performance; this release adds support for Primus Turbo -with optimized attention and grouped GEMM kernels. - -.. note:: - - This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with - Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release `. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -.. _amd-primus-megatron-lm-model-support: - -Supported models -================ - -The following models are pre-optimized for performance on AMD Instinct MI300X series GPUs. -Some instructions, commands, and training examples in this documentation might -vary by model -- select one to get started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. note:: - - Some models, such as Llama, require an external license agreement through - a third party (for example, Meta). - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -.. _mi300x-amd-primus-megatron-lm-training: - -Environment setup -================= - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - - Use the following instructions to set up the environment, configure the script to train models, and - reproduce the benchmark results on MI300X series GPUs with the ``{{ docker.pull_tag }}`` image. - - .. _amd-primus-megatron-lm-requirements: - -Pull the Docker image -===================== - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - - 1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ docker.pull_tag }} - - 2. Launch the Docker container. - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --device /dev/infiniband \ - --network host --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - --shm-size 128G \ - --name primus_training_env \ - {{ docker.pull_tag }} - -3. Use these commands if you exit the ``primus_training_env`` container and need to return to it. - - .. code-block:: shell - - docker start primus_training_env - docker exec -it primus_training_env bash - -The Docker container hosts verified commit ``927a717`` of the `Primus -`__ repository. - -.. _amd-primus-megatron-lm-environment-setup: - -Configuration -============= - -Primus defines a training configuration in YAML for each model in -`examples/megatron/configs `__. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set model_groups = data.model_groups %} - {% for model_group in model_groups %} - {% for model in model_group.models %} - .. container:: model-doc {{ model.mad_tag }} - - To update training parameters for {{ model.model }}, you can update ``examples/megatron/configs/{{ model.config_name }}``. - Note that training configuration YAML files for other models follow this naming convention. - - {% endfor %} - {% endfor %} - -.. note:: - - See :ref:`Key options ` for more information on configuration options. - -Dataset options ---------------- - -You can use either mock data or real data for training. - -* Mock data can be useful for testing and validation. Use the ``mock_data`` field to toggle between mock and real data. The default - value is ``true`` for enabled. - - .. code-block:: yaml - - mock_data: true - -* If you're using a real dataset, update the ``train_data_path`` field to point to the location of your dataset. - - .. code-block:: bash - - mock_data: false - train_data_path: /path/to/your/dataset - - Ensure that the files are accessible inside the Docker container. - -.. _amd-primus-megatron-lm-tokenizer: - -Tokenizer ---------- - -Set the ``HF_TOKEN`` environment variable with -right permissions to access the tokenizer for each model. - -.. code-block:: bash - - # Export your HF_TOKEN in the workspace - export HF_TOKEN= - -.. note:: - - In Primus, each model uses a tokenizer from Hugging Face. For example, Llama - 3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and - ``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model - `__ - definition. - -.. _amd-primus-megatron-lm-run-training: - -Run training -============ - -Use the following example commands to set up the environment, configure -:ref:`key options `, and run training on -MI300X series GPUs with the AMD Megatron-LM environment. - -Single node training --------------------- - -To run training on a single node, navigate to ``/workspace/Primus`` and use the following setup command: - -.. code-block:: shell - - pip install -r requirements.txt - export HSA_NO_SCRATCH_RECLAIM=1 - export NVTE_CK_USES_BWD_V3=1 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Llama 3.3 70B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run pre-training for Llama 3.3 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 16 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Llama 3.1 8B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run pre-training for Llama 3.1 8B FP8, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - - For Llama 3.1 8B BF16, use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Llama 3.1 70B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run pre-training for Llama 3.1 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 - - To run the training on a single node for Llama 3.1 70B FP8 with proxy, use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --num_layers 40 \ - --fp8 hybrid - - .. note:: - - Use two or more nodes to run the *full* Llama 70B model with FP8 precision. - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Llama 2 7B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run pre-training for Llama 2 7B FP8, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \ - bash ./examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - - To run pre-training for Llama 2 7B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Llama 2 70B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run pre-training for Llama 2 70B BF16, run: - - .. code-block:: shell - - EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash ./examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to DeepSeek-V3. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --num_layers 3 \ - --moe_layer_freq 1 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to DeepSeek-V2-Lite. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --global_batch_size 256 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Mixtral 8x7B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run training on a single node for Mixtral 8x7B (MoE with expert parallel), - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Mixtral 8x22B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run training on a single node for Mixtral 8x22B (MoE with expert parallel) with 4-layer proxy, - use the following command: - - .. code-block:: shell - - EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --num_layers 4 \ - --pipeline_model_parallel_size 1 \ - --micro_batch_size 1 \ - --global_batch_size 16 \ - --train_iters 50 - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Qwen 2.5 7B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run training on a single node for Qwen 2.5 7B BF16, use the following - command: - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - - For FP8, use the following command. - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \ - bash examples/run_pretrain.sh \ - --train_iters 50 \ - --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b - - Once setup is complete, run the appropriate training command. - The following run commands are tailored to Qwen 2.5 72B. - See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model. - - To run the training on a single node for Qwen 2.5 72B BF16, use the following command. - - .. code-block:: shell - - EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \ - bash examples/run_pretrain.sh --train_iters 50 - -.. _amd-primus-megatron-multi-node-examples: - -Multi-node training examples ----------------------------- - -Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node -training. - -To run training on multiple nodes, you can use the -`run_slurm_pretrain.sh `__ -to launch the multi-node workload. Use the following steps to setup your environment: - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - - .. code-block:: shell - - cd /workspace/Primus/ - export DOCKER_IMAGE={{ docker.pull_tag }} - export HF_TOKEN= - export HSA_NO_SCRATCH_RECLAIM=1 - export NVTE_CK_USES_BWD_V3=1 - export NCCL_IB_HCA= # specify which RDMA interfaces to use for communication - export NCCL_SOCKET_IFNAME= # your Network Interface - export GLOO_SOCKET_IFNAME= # your Network Interface - export NCCL_IB_GID_INDEX=3 # Set InfiniBand GID index for NCCL communication. Default is 3 for ROCE - -.. note:: - - * Make sure correct network drivers are installed on the nodes. If inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating Docker container. - * If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster. - * To find your network interface, you can use ``ip a``. - * To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices. - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b - - To train Llama 3.3 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --fp8 hybrid - - To train Llama 3.3 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b - - To train Llama 3.1 8B FP8 on 8 nodes, run: - - .. code-block:: shell - - # Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case - NNODES=8 EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \ - bash ./examples/run_slurm_pretrain.sh \ - --global_batch_size 1024 \ - --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b - - To train Llama 3.1 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --fp8 hybrid - - To train Llama 3.1 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 1 \ - --global_batch_size 256 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b - - To train Llama 2 8B FP8 on 8 nodes, run: - - .. code-block:: shell - - # Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case - NNODES=8 EXP=examples/megatron/configs/llama2_7B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh --global_batch_size 2048 --fp8 hybrid - -.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b - - To train Llama 2 70B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --fp8 hybrid - - To train Llama 2 70B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \ - bash ./examples/run_slurm_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 1536 \ - --recompute_num_layers 12 - -.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b - - To train Mixtral 8x7B BF16 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 2 \ - --global_batch_size 256 - -.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b - - To train Qwen2.5 72B FP8 on 8 nodes, run: - - .. code-block:: shell - - NNODES=8 EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \ - bash examples/run_slurm_pretrain.sh \ - --micro_batch_size 4 \ - --global_batch_size 256 \ - --recompute_num_layers 80 \ - --fp8 hybrid - -.. _amd-primus-megatron-lm-benchmark-test-vars: - -Key options ------------ - -The following are key options to take note of - -fp8 - ``hybrid`` enables FP8 GEMMs. - -use_torch_fsdp2 - ``use_torch_fsdp2: 1`` enables torch fsdp-v2. If FSDP is enabled, - set ``use_distributed_optimizer`` and ``overlap_param_gather`` to ``false``. - -profile - To enable PyTorch profiling, set these parameters: - - .. code-block:: yaml - - profile: true - use_pytorch_profiler: true - profile_step_end: 7 - profile_step_start: 6 - -train_iters - The total number of iterations (default: 50). - -mock_data - True by default. - -micro_batch_size - Micro batch size. - -global_batch_size - Global batch size. - -recompute_granularity - For activation checkpointing. - -num_layers - For using a reduced number of layers as with proxy models. - -Further reading -=============== - -- For an introduction to Primus, see `Primus: A Lightweight, Unified Training - Framework for Large Models on AMD GPUs `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/megatron-lm-history` to find documentation for previous releases -of the ``ROCm/megatron-lm`` Docker image. - -This training environment now uses Primus with Megatron as the primary -configuration. Limited support for the legacy ROCm Megatron-LM is still -available; see the :doc:`megatron-lm` documentation. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst deleted file mode 100644 index 5c99776ee..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst +++ /dev/null @@ -1,308 +0,0 @@ -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -**************************************** -Training a model with Primus and PyTorch -**************************************** - -`Primus `__ is a unified and flexible -LLM training framework designed to streamline training. It streamlines LLM -training on AMD Instinct GPUs using a modular, reproducible configuration paradigm. -Primus now supports the PyTorch torchtitan backend. - -.. note:: - - Primus with the PyTorch torchtitan backend is designed to replace the :doc:`ROCm PyTorch training ` workflow. - See :doc:`pytorch-training` to see steps to run workloads without Primus. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - For ease of use, AMD provides a ready-to-use Docker image -- ``{{ - docker.pull_tag }}`` -- for MI300X series GPUs containing essential - components for Primus and PyTorch training with - Primus Turbo optimizations. - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -.. _amd-primus-pytorch-model-support-v258: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. -Some instructions, commands, and training recommendations in this documentation might -vary by model -- select one to get started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - .. raw:: html - -
- - -
-
Model
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- -.. seealso:: - - For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models, - see the documentation :doc:`pytorch-training` (without Primus) - -.. _amd-primus-pytorch-performance-measurements-v258: - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t test configurations and run conditions outside those described. - -Pull the Docker image -===================== - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - - Use the following command to pull the `Docker image <{{ unified_docker.docker_hub_url }}>`_ from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - -Run training -============ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Once the setup is complete, choose between the following two workflows to start benchmarking training. - For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus). - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - The following run command is tailored to {{ model.model }}. - See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model. - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. For example, use this command to run the performance benchmark test on the {{ model.model }} model - using one node with the {{ model.precision }} data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{ model.mad_tag }} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the - model are collected in ``~/MAD/perf.csv``. - - {% endfor %} - {% endfor %} - - .. tab-item:: Standalone benchmarking - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - The following run commands are tailored to {{ model.model }}. - See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model. - - .. rubric:: Download the Docker image and required packages - - 1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - 2. Run the Docker container. - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --network host \ - --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 64G \ - --name training_env \ - {{ unified_docker.pull_tag }} - - Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - - 3. In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - - .. rubric:: Prepare training datasets and dependencies - - 1. The following benchmarking examples require downloading models and datasets - from Hugging Face. To ensure successful access to gated repos, set your - ``HF_TOKEN``. - - .. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - - 2. Run the setup script to install libraries and datasets needed for benchmarking. - - .. code-block:: shell - - ./pytorch_benchmark_setup.sh - - .. rubric:: Pretraining - - To start the pretraining benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t pretrain \ - -m {{ model.model_repo }} \ - -p $datatype \ - -s $sequence_length - - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% for mode in available_modes %} - * - {% if loop.first %}``$training_mode``{% endif %} - - ``{{ mode }}`` - - {{ training_mode_descs[mode] }} - {% endfor %} - - * - ``$datatype`` - - ``BF16``{% if model.mad_tag == "primus_pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %} - - Currently, only Llama 3.1 8B supports FP8 precision. - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - - .. rubric:: Benchmarking examples - - Use the following command to run train {{ model.model }} with BF16 precision using Primus torchtitan. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -m {{ model.model_repo }} - - To train {{ model.model }} with FP8 precision, use the following command. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -m {{ model.model_repo }} -p FP8 - {% endfor %} - {% endfor %} - -Further reading -=============== - -- For an introduction to Primus, see `Primus: A Lightweight, Unified Training - Framework for Large Models on AMD GPUs `__. - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst deleted file mode 100644 index 88842418e..000000000 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst +++ /dev/null @@ -1,588 +0,0 @@ -:orphan: - -.. meta:: - :description: How to train a model using PyTorch for ROCm. - :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker - -************************************** -Training a model with PyTorch on ROCm -************************************** - -.. note:: - - Primus with the PyTorch torchtitan backend is designed to replace :doc:`ROCm PyTorch training ` workflow. - See :doc:`primus-pytorch` for details. - -PyTorch is an open-source machine learning framework that is widely used for -model training with GPU-optimized components for transformer-based models. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml - - {% set dockers = data.dockers %} - {% set docker = dockers[0] %} - The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__ - (``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a - model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate - training workloads: - - .. list-table:: - :header-rows: 1 - - * - Software component - - Version - - {% for component_name, component_version in docker.components.items() %} - * - {{ component_name }} - - {{ component_version }} - {% endfor %} - -.. _amd-pytorch-training-model-support: - -Supported models -================ - -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. -Some instructions, commands, and training recommendations in this documentation might -vary by model -- select one to get started. - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - .. raw:: html - -
-
-
Model
-
- {% for model_group in model_groups %} -
{{ model_group.group }}
- {% endfor %} -
-
- -
-
Variant
-
- {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if models|length % 3 == 0 %} -
{{ model.model }}
- {% else %} -
{{ model.model }}
- {% endif %} - {% endfor %} - {% endfor %} -
-
-
- - - .. _amd-pytorch-training-supported-training-modes: - - The following table lists supported training modes per model. - - .. dropdown:: Supported training modes - - .. list-table:: - :header-rows: 1 - - * - Model - - Supported training modes - - {% for model_group in model_groups %} - {% set models = model_group.models %} - {% for model in models %} - {% if model.training_modes %} - * - {{ model.model }} - - ``{{ model.training_modes | join('``, ``') }}`` - - {% endif %} - {% endfor %} - {% endfor %} - - .. note:: - - Some model and fine-tuning combinations are not listed. This is - because the `upstream torchtune repository `__ - doesn't provide default YAML configurations for them. - For advanced usage, you can create a custom configuration to enable - unlisted fine-tuning methods by using an existing file in the - ``/workspace/torchtune/recipes/configs`` directory as a template. - -.. _amd-pytorch-training-performance-measurements: - -Performance measurements -======================== - -To evaluate performance, the -`Performance results with AMD ROCm software `_ -page provides reference throughput and latency measurements for training -popular AI models. - -.. note:: - - The performance data presented in - `Performance results with AMD ROCm software `_ - should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X GPUs or ROCm software. - -System validation -================= - -Before running AI workloads, it's important to validate that your AMD hardware is configured -correctly and performing optimally. - -If you have already validated your system settings, including aspects like NUMA auto-balancing, you -can skip this step. Otherwise, complete the procedures in the :ref:`System validation and -optimization ` guide to properly configure your system settings -before starting training. - -To test for optimal performance, consult the recommended :ref:`System health benchmarks -`. This suite of tests will help you verify and fine-tune your -system's configuration. - -This Docker image is optimized for specific model configurations outlined -below. Performance can vary for other training workloads, as AMD -doesn’t test configurations and run conditions outside those described. - -Run training -============ - -.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml - - {% set unified_docker = data.dockers[0] %} - {% set model_groups = data.model_groups %} - - Once the setup is complete, choose between two options to start benchmarking training: - - .. tab-set:: - - .. tab-item:: MAD-integrated benchmarking - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - The following run command is tailored to {{ model.model }}. - See :ref:`amd-pytorch-training-model-support` to switch to another available model. - - 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local - directory and install the required packages on the host machine. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD - pip install -r requirements.txt - - 2. For example, use this command to run the performance benchmark test on the {{ model.model }} model - using one node with the {{ model.precision }} data type on the host machine. - - .. code-block:: shell - - export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" - madengine run \ - --tags {{ model.mad_tag }} \ - --keep-model-dir \ - --live-output \ - --timeout 28800 - - MAD launches a Docker container with the name - ``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the - model are collected in ``~/MAD/perf.csv``. - - {% endfor %} - {% endfor %} - - .. tab-item:: Standalone benchmarking - - {% for model_group in model_groups %} - {% for model in model_group.models %} - - .. container:: model-doc {{ model.mad_tag }} - - The following commands are tailored to {{ model.model }}. - See :ref:`amd-pytorch-training-model-support` to switch to another available model. - - {% endfor %} - {% endfor %} - - .. rubric:: Download the Docker image and required packages - - 1. Use the following command to pull the Docker image from Docker Hub. - - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} - - 2. Run the Docker container. - - .. code-block:: shell - - docker run -it \ - --device /dev/dri \ - --device /dev/kfd \ - --network host \ - --ipc host \ - --group-add video \ - --cap-add SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --privileged \ - -v $HOME:$HOME \ - -v $HOME/.ssh:/root/.ssh \ - --shm-size 64G \ - --name training_env \ - {{ unified_docker.pull_tag }} - - Use these commands if you exit the ``training_env`` container and need to return to it. - - .. code-block:: shell - - docker start training_env - docker exec -it training_env bash - - 3. In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. - - .. code-block:: shell - - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train - - .. rubric:: Prepare training datasets and dependencies - - 1. The following benchmarking examples require downloading models and datasets - from Hugging Face. To ensure successful access to gated repos, set your - ``HF_TOKEN``. - - .. code-block:: shell - - export HF_TOKEN=$your_personal_hugging_face_access_token - - 2. Run the setup script to install libraries and datasets needed for benchmarking. - - .. code-block:: shell - - ./pytorch_benchmark_setup.sh - - .. container:: model-doc pyt_train_llama-3.1-8b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - .. container:: model-doc pyt_train_llama-3.1-70b - - ``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``datasets`` - - `Hugging Face Datasets `_ 3.2.0 - - * - ``torchdata`` - - `TorchData `__ - - * - ``tomli`` - - `Tomli `__ - - * - ``tiktoken`` - - `tiktoken `__ - - * - ``blobfile`` - - `blobfile `__ - - * - ``tabulate`` - - `tabulate `__ - - * - ``wandb`` - - `Weights & Biases `__ - - * - ``sentencepiece`` - - `SentencePiece `__ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `__ 2.18.0 - - .. container:: model-doc pyt_train_flux - - ``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX: - - .. list-table:: - :header-rows: 1 - - * - Library - - Reference - - * - ``accelerate`` - - `Hugging Face Accelerate `_ - - * - ``datasets`` - - `Hugging Face Datasets `__ 3.2.0 - - * - ``sentencepiece`` - - `SentencePiece `__ 0.2.0 - - * - ``tensorboard`` - - `TensorBoard `__ 2.18.0 - - * - ``csvkit`` - - `csvkit `__ 2.0.1 - - * - ``deepspeed`` - - `DeepSpeed `__ 0.16.2 - - * - ``diffusers`` - - `Hugging Face Diffusers `__ 0.31.0 - - * - ``GitPython`` - - `GitPython `__ 3.1.44 - - * - ``opencv-python-headless`` - - `opencv-python-headless `__ 4.10.0.84 - - * - ``peft`` - - `PEFT `__ 0.14.0 - - * - ``protobuf`` - - `Protocol Buffers `__ 5.29.2 - - * - ``pytest`` - - `PyTest `__ 8.3.4 - - * - ``python-dotenv`` - - `python-dotenv `__ 1.0.1 - - * - ``seaborn`` - - `Seaborn `__ 0.13.2 - - * - ``transformers`` - - `Transformers `__ 4.47.0 - - ``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face: - - * `bghira/pseudo-camera-10k `__ - - {% for model_group in model_groups %} - {% for model in model_group.models %} - {% set training_modes = model.training_modes %} - {% set training_mode_descs = { - "pretrain": "Benchmark pre-training.", - "HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision." - } %} - {% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %} - {% if available_modes %} - - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Pre-training - - To start the pre-training benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \ - -m {{ model.model_repo }} \ - -p $datatype \ - -s $sequence_length - - {% if model.mad_tag == "pyt_train_flux" %} - .. container:: model-doc {{ model.mad_tag }} - - .. note:: - - Currently, FLUX models are not supported out-of-the-box on {{ unified_docker.pull_tag }}. - To use FLUX, refer to ``rocm/pytorch-training`` Docker: :doc:`previous-versions/pytorch-training-v25.6` - - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. - {% endif %} - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% for mode in available_modes %} - * - {% if loop.first %}``$training_mode``{% endif %} - - ``{{ mode }}`` - - {{ training_mode_descs[mode] }} - {% endfor %} - - * - ``$datatype`` - - ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %} - - Only Llama 3.1 8B supports FP8 precision. - - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. - {% endif %} - - {% set training_mode_descs = { - "finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).", - "finetune_lora": "LoRA fine-tuning (BF16 supported).", - "finetune_qlora": "QLoRA fine-tuning (BF16 supported).", - "HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.", - } %} - {% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %} - {% if available_modes %} - .. container:: model-doc {{ model.mad_tag }} - - .. rubric:: Fine-tuning - - To start the fine-tuning benchmark, use the following command with the - appropriate options. See the following list of options and their descriptions. - See :ref:`supported training modes `. - - .. code-block:: shell - - ./pytorch_benchmark_report.sh -t $training_mode \ - -m {{ model.model_repo }} \ - -p $datatype \ - -s $sequence_length - - .. list-table:: - :header-rows: 1 - - * - Name - - Options - - Description - - {% for mode in available_modes %} - * - {% if loop.first %}``$training_mode``{% endif %} - - ``{{ mode }}`` - - {{ training_mode_descs[mode] }} - {% endfor %} - - * - ``$datatype`` - - ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %} - - All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %} - - * - ``$sequence_length`` - - Between 2048 and 16384. - - Sequence length for the language model. - - {% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %} - .. note:: - - For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B), - use the following torchtune commit for compatibility: - - .. code-block:: shell - - git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e - - {% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %} - .. note:: - - You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of - input tensor should be smaller than max_seq_len (4096)``. - This error indicates that an input sequence is longer than the model's maximum context window. - - Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096 - tokens in this case). You can resolve this by truncating the input or splitting - it into smaller chunks before passing it to the model. - - Note on reproducibility: The results in this guide are based on - commit ``b4c98ac`` from the upstream - ``__ repository. For the - latest updates, you can use the main branch. - - {% endif %} - {% endif %} - {% endfor %} - {% endfor %} - - .. rubric:: Benchmarking examples - - For examples of benchmarking commands, see ``__. - -.. _amd-pytorch-training-multinode-examples: - -Multi-node training -------------------- - -Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node -training. See :ref:`rocm-for-ai-multi-node-setup-pyt-train-example` for example Slurm run commands. - -Pre-training -~~~~~~~~~~~~ - -Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B. - -To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository. - -.. code-block:: shell - - # In the MAD repository - cd scripts/pytorch_train - sbatch run_slurm_train.sh - -Fine-tuning -~~~~~~~~~~~ - -Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B. - -To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository. - -.. code-block:: shell - - huggingface-cli login # Get access to HF Llama model space - huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally - # In the MAD repository - cd scripts/pytorch_train - sbatch Torchtune_Multinode.sh - -.. note:: - - Information regarding benchmark setup: - - * By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``. - * You can adjust the torchtune `YAML configuration file - `__ - if you're using a different model. - * The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``. - * Set the ``mounting_paths`` inside the SLURM script. - -Once the run is finished, you can find the log files in the ``result_torchtune/`` directory. - -Further reading -=============== - -- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - -- To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. - -- For a list of other ready-made Docker images for AI with ROCm, see - `AMD Infinity Hub `_. - -Previous versions -================= - -See :doc:`previous-versions/pytorch-training-history` to find documentation for previous releases -of the ``ROCm/pytorch-training`` Docker image. diff --git a/docs/how-to/rocm-for-ai/training/index.rst b/docs/how-to/rocm-for-ai/training/index.rst deleted file mode 100644 index 7f2ce1d97..000000000 --- a/docs/how-to/rocm-for-ai/training/index.rst +++ /dev/null @@ -1,34 +0,0 @@ -.. meta:: - :description: How to use ROCm for training models - :keywords: ROCm, LLM, training, GPUs, training model, scaling model, usage, tutorial - -======================= -Use ROCm for training -======================= - -Training models is the process of teaching a computer program to recognize patterns in data. This involves providing the computer with large amounts of labeled data and allowing it to learn from that data, adjusting the model's parameters. - -The process of training models is computationally intensive, requiring specialized hardware like GPUs to accelerate computations and reduce training time. Training models on AMD GPUs involves leveraging the parallel processing capabilities of these GPUs to significantly speed up the model training process in machine learning and deep learning tasks. - -Training models on AMD GPUs with the ROCm™ software platform allows you to use the powerful parallel processing capabilities and efficient compute resource management, significantly improving training time and overall performance in machine learning applications. - -The ROCm software platform makes it easier to train models on AMD GPUs while maintaining compatibility with existing code and tools. The platform also provides features like multi-GPU support, allowing for scaling and parallelization of model training across multiple GPUs to enhance performance. - -The AI Developer Hub contains `AMD ROCm tutorials `_ for -training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs. - -In this guide, you'll learn about: - -- Training a model - - - :doc:`With Primus (Megatron-LM backend) ` - - - :doc:`With Megatron-LM ` - - - :doc:`With PyTorch ` - - - :doc:`With JAX MaxText ` - - - :doc:`With LLM Foundry ` - -- :doc:`Scaling model training ` diff --git a/docs/how-to/rocm-for-ai/training/scale-model-training.rst b/docs/how-to/rocm-for-ai/training/scale-model-training.rst deleted file mode 100644 index 405206aaa..000000000 --- a/docs/how-to/rocm-for-ai/training/scale-model-training.rst +++ /dev/null @@ -1,135 +0,0 @@ -.. meta:: - :description: How to scale and accelerate model training - :keywords: ROCm, AI, LLM, train, fine-tune, deploy, FSDP, DeepSpeed, LLaMA, tutorial - -********************** -Scaling model training -********************** - -To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the -model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or -accelerator can simultaneously store and process the entire model's parameters during training. PyTorch -provides an answer to this computational constraint through its distributed training frameworks. - -.. _rocm-for-ai-pytorch-distributed: - -PyTorch distributed -=================== - -Features in ``torch.distributed`` are categorized into three main components: - -- `Distributed data-parallel training - `_ (DDP) - -- `RPC-Based distributed training `_ (RPC) - -- `Collective communication `_ - -In this topic, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP, -you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs. - -The DDP workflow on multiple accelerators or GPUs is as follows: - -#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and - the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples. - -#. Copy the model to every device so each can process its local batches independently. - -#. Run a forward pass, then a backward pass, and output the gradient of the weights with respect to the loss of the - model for that local batch. This happens in parallel on multiple devices. - -#. Synchronize the local gradients computed by each device and combine them to update the model weights. The updated - weights are then redistributed to each device. - -In DDP training, each process or worker owns a replica of the model and processes a batch of data, and then the reducer uses -``allreduce`` to sum up gradients over different workers. - -See the following developer blogs for more in-depth explanations and examples. - -* `Multi GPU training with DDP — PyTorch Tutorials `_ - -* `Building a decoder transformer model on AMD GPUs — ROCm Blogs - `_ - -.. _rocm-for-ai-pytorch-fsdp: - -PyTorch FSDP ------------- - -As noted in :ref:`PyTorch distributed `, DDP model weights and optimizer states -are evenly replicated across all workers. Fully Sharded Data Parallel (FSDP) is a type of data parallelism that shards -model parameters, optimizer states, and gradients across DDP ranks. - -When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes -training some very large models feasible by allowing larger models or batch sizes to fit on-device. However, this -comes with the cost of increased communication volume. The communication overhead is reduced by internal optimizations -like overlapping communication and computation. - -For a high-level overview of how FSDP works, review `Getting started with Fully Sharded Data Parallel -`_. - -For detailed training steps, see `PyTorch FSDP examples -`_. - -.. _rocm-for-ai-deepspeed: - -DeepSpeed ---------- - -`DeepSpeed `_ offers system innovations that make large-scale deep learning training effective, -efficient, and easy to use. Innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and so on fall under -the training pillar. - -See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs -`_ for a detailed example of -training with DeepSpeed on an AMD accelerator or GPU. - -.. _rocm-for-ai-automatic-mixed-precision: - -Automatic mixed precision (AMP) -------------------------------- - -As models increase in size, so do the time and memory needed to train them; their cost also increases. Any measure we -can take to reduce training time and memory usage through `automatic mixed precision -`_ (AMP) is highly beneficial for most use cases. - -See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs -`_ -for more information about running AMP on an AMD accelerator. - -.. _rocm-for-ai-fine-tune: - -Fine-tuning your model -====================== - -ROCm supports multiple techniques for :ref:`optimizing fine-tuning `, for -example, LoRA, QLoRA, PEFT, and FSDP. - -Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning/index`. - -The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU. - -* Fine-tuning Llama2 with LoRA - - * `Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering - `_ - -* Fine-tuning Llama2 with QLoRA - - * `Enhancing LLM accessibility: A deep dive into QLoRA through fine-tuning Llama 2 on a single AMD GPU - `_ - -* Fine-tuning a BERT-based LLM for a text classification task using JAX - - * `LLM distributed supervised fine-tuning with JAX - `_ - -* Fine-tuning StarCoder using PEFT - - * `Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs - `_ - -* Recipes for fine-tuning Llama2 and 3 with ``llama-recipes`` - - * `meta-llama/llama-recipes: Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover - single/multi-node GPUs `_ diff --git a/docs/how-to/rocm-for-hpc/index.rst b/docs/how-to/rocm-for-hpc/index.rst deleted file mode 100644 index 9545b9a04..000000000 --- a/docs/how-to/rocm-for-hpc/index.rst +++ /dev/null @@ -1,281 +0,0 @@ -.. meta:: - :description: How to use ROCm for high-performance computing (HPC). - :keywords: ROCm, AI, high performance computing, HPC, science, scientific - -****************** -Using ROCm for HPC -****************** - -The ROCm open-source software stack is optimized to extract high-performance -computing (HPC) workload performance from AMD Instinct™ accelerators -while maintaining compatibility with industry software frameworks. - -ROCm enhances support and access for developers by providing streamlined and -improved tools that significantly increase productivity. Being open-source, ROCm -fosters innovation, differentiation, and collaboration within the developer -community, making it a powerful and accessible solution for leveraging the full -potential of AMD accelerators' capabilities in diverse computational -applications. - -* For more information, see :doc:`What is ROCm? <../../what-is-rocm>`. - -* For guidance on installing ROCm, see :doc:`rocm-install-on-linux:index`. See - the :doc:`Compatibility matrix <../../compatibility/compatibility-matrix>` for details on hardware - and operating system support. - -Some of the most popular HPC frameworks are part of the ROCm platform, including -those to help parallelize operations across multiple accelerators and servers, -handle memory hierarchies, and solve linear systems. - -.. image:: ../../data/how-to/rocm-for-hpc/hpc-stack-2024_6_20.png - :align: center - :alt: Software and hardware ecosystem surrounding ROCm and AMD Instinct for HPC - -The following catalog of GPU-accelerated solutions includes a vast set of -platform-compatible HPC applications, including those for astrophysics, climate -and weather, computational chemistry, computational fluid dynamics, earth -science, genomics, geophysics, molecular dynamics, and physics computing. - -Refer to the resources in the following table for instructions on building, -running, and deploying these applications on ROCm-capable systems with AMD -Instinct accelerators. Each build container provides parameters to specify -different source code branches, release versions of ROCm, OpenMPI, UCX, and -Ubuntu versions. - -.. _hpc-apps: - -.. - Reduce font size of HPC app descriptions slightly. - -.. raw:: html - - - -.. container:: - :name: hpc-apps-table - - .. list-table:: - :header-rows: 1 - :stub-columns: 1 - :widths: 2 2 5 - - * - Application domain - - HPC application - - Description - - * - Physics - - `Chroma `_ - - The Chroma package supports data-parallel programming constructs for lattice - field theory and in particular lattice QCD. It uses the SciDAC QDP++ data-parallel - programming (in C++) that presents a single high-level code image to the user, - but can generate highly optimized code for many architectural systems including - single node workstations, multi and many-core nodes, clusters of nodes via - QMP, and classic vector computers. - - * - - - `MILC `_ - - The MILC Code is a set of research codes developed by MIMD Lattice Computation - (MILC) collaboration for doing simulations of four dimensional SU(3) lattice gauge - theory on MIMD parallel machines scaling from single-processor workstations - to HPC systems. The MILC Code is publicly available for research purposes. - Publications of work done using this code or derivatives of this code should - acknowledge this use. - - * - - - `QUDA `_ - - Library designed for efficient lattice QCD computations on - accelerators. It includes optimized Dirac operators and a variety of - fermion solvers and conjugate gradient (CG) implementations, enhancing - performance and accuracy in lattice QCD simulations. - - * - - - `PIConGPU `_ - - PIConGPU (Particle-in-cell on Graphics Processing Units) is an Open Source - simulations framework for plasma and laser-plasma physics used to develop - advanced particle accelerators for radiation therapy of cancer, high energy - physics and photon science. - - * - Astrophysics - - `Cholla `_ - - An astrophysical simulation code developed for the extreme environments - encountered in astrophysical systems. - - * - Geophysics - - `SPECFEM3D Cartesian `_ - - SPECFEM3D Cartesian simulates acoustic (fluid), elastic (solid), coupled - acoustic/elastic, poroelastic or seismic wave propagation in any type of - conforming mesh of hexahedra (structured or not.) It can, for instance, - model seismic waves propagating in sedimentary basins or any other - regional geological model following earthquakes. It can also be used - for non-destructive testing or for ocean acoustics. - - * - Molecular dynamics - - `Amber `_ - - Amber is a suite of biomolecular simulation programs. It is a set of molecular mechanical force fields for - simulating biomolecules. Amber is also a package of molecular simulation - programs which includes source code and demos. - - * - - - `GROMACS with HIP (AMD implementation) `_ - - GROMACS is a versatile package to perform molecular dynamics, i.e. - simulate the Newtonian equations of motion for systems with hundreds - to millions of particles. This AMD container is based on a released - version of GROMACS modified by AMD. This container only supports up - to a 8 GPU configuration - - * - - - `LAMMPS `_ - - LAMMPS is a classical molecular dynamics code with a focus on materials - modeling. It's an acronym for Large-scale Atomic/Molecular Massively - Parallel Simulator. - - * - Computational fluid dynamics - - `Ansys Fluent `_ - - Ansys Fluent is an advanced computational fluid dynamics (CFD) tool for - simulating and analyzing fluid flow, heat transfer, and related phenomena in complex systems. - It offers a range of powerful features for detailed and accurate modeling of various physical - processes, including turbulence, chemical reactions, and multiphase flows. - - * - - - `NEKO `_ - - Neko is a portable framework for high-order spectral element flow simulations. - Written in modern Fortran, Neko adopts an object-oriented approach, allowing - multi-tier abstractions of the solver stack and facilitating various hardware - backends ranging from general-purpose processors, CUDA and HIP enabled - accelerators to SX-Aurora vector processors. - - * - - - `nekRS `_ - - nekRS is an open-source Navier Stokes solver based on the spectral element - method targeting classical processors and accelerators like GPUs. - - * - - - `OpenFOAM `_ - - OpenFOAM is a free, open-source computational fluid dynamics (CFD) - tool developed primarily by OpenCFD Ltd. It has a large user - base across most areas of engineering and science, from both commercial and - academic organizations. OpenFOAM has extensive features to solve - anything from complex fluid flows involving chemical reactions, turbulence, and - heat transfer, to acoustics, solid mechanics, and electromagnetics. - - * - - - `PeleC `_ - - PeleC is an adaptive mesh refinement(AMR) solver for compressible reacting flows. - - * - - - `Simcenter Star-CCM+ `_ - - Simcenter Star-CCM+ is a comprehensive computational fluid dynamics (CFD) and multiphysics - simulation tool developed by Siemens Digital Industries Software. It is designed to - help engineers and researchers analyze and optimize the performance of products and - systems across various industries. - - * - Quantum Monte Carlo Simulation - - `QMCPACK `_ - - QMCPACK is an open-source production-level many-body ab initio Quantum - Monte Carlo code for computing the electronic structure of atoms, molecules, 2D - nanomaterials and solids. The solid-state capabilities include metallic systems - as well as insulators. QMCPACK is expected to run well on workstations through - to the latest generation supercomputers. Besides high performance, particular - emphasis is placed on code quality and reproducibility. - - * - Climate and weather - - `MPAS `_ - - The Model for Prediction Across Scales (MPAS) is a collaborative project for - developing atmosphere, ocean, and other earth-system simulation components - for use in climate, regional climate, and weather studies. - - * - Energy, Oil, and Gas - - `DevitoPRO `_ - - DevitoPRO is an advanced extension of the open-source Devito platform with added - features tailored for high-demand production workflows. It supports - high-performance computing (HPC) needs, especially in seismic imaging and inversion. - It is used to perform optimized finite difference (FD) computations - from high-level symbolic problem definitions. DevitoPro performs automated - code generation and Just-In-time (JIT) compilation based on symbolic equations - defined in SymPy to create and execute highly optimized Finite Difference stencil - kernels on multiple computer platforms. - - * - - - `ECHELON `_ - - ECHELON by Stone Ridge Technology is a reservoir simulation tool. With - fast processing, it retains precise accuracy and preserves legacy simulator results. - Faster reservoir simulation enables reservoir engineers to produce many realizations, - address larger models, and use advanced physics. It opens new workflows based on - ensemble methodologies for history matching and forecasting that yield - increased accuracy and more predictive results. - - * - Benchmark - - `rocHPL `_ - - HPL, or High-Performance Linpack, is a benchmark which solves a uniformly - random system of linear equations and reports floating-point execution rate. - - * - - - `rocHPL-MxP `_ - - Benchmark that highlights the convergence of HPC and AI workloads by - solving a system of linear equations using novel, mixed-precision - algorithms. - - * - - - `HPCG `_ - - HPCG, or the High Performance Conjugate Gradient Benchmark complements - the High Performance LINPACK (HPL) benchmark. The computational and data - access patterns of HPCG are designed to closely match a broad set of important - applications not represented by HPL, and to incentivize computer system - designers to invest in capabilities that will benefit the collective performance - of these applications. - - * - Tools and libraries - - `AMD ROCm with OpenMPI container `_ - - Base container for GPU-aware MPI with ROCm for HPC applications. This - project provides a boilerplate for building and running a Docker - container with ROCm supporting GPU-aware MPI implementations using - OpenMPI or UCX. - - * - - - `AMD ROCm with MPICH container `_ - - Base container for GPU-aware MPI with ROCm for HPC applications. This - project provides a boilerplate for building and running a Docker - container with ROCm supporting GPU-aware MPI implementations using MPICH. - - * - - - `Kokkos `_ - - Kokkos is a programming model in C++ for writing performance portable - applications for use across HPC platforms. It provides abstractions for both - parallel execution of code and data management. Kokkos is designed to target - complex node architectures with N-level memory hierarchies and multiple types - of execution resources. - - * - - - `PyFR `_ - - PyFR is an open-source Python based framework for solving advection-diffusion - type problems on streaming architectures using the Flux Reconstruction approach of - Huynh. The framework is designed to solve a range of governing systems on mixed - unstructured grids containing various element types. It is also designed to target a - range of hardware platforms via use of an in-built domain specific language derived - from the Mako templating engine. - - * - - - `PETSc `_ - - Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures - and routines for the scalable (parallel) solution of scientific applications modeled by partial - differential equations. It supports MPI, GPUs through CUDA, HIP, and OpenCL, - as well as hybrid MPI-GPU parallelism. It also supports the NEC-SX Tsubasa Vector Engine. - PETSc also includes the Toolkit for Advanced Optimization (TAO) library. - - * - - - `RAJA `_ - - RAJA is a library of C++ software abstractions, primarily developed at Lawrence - Livermore National Laboratory (LLNL), that enables architecture and programming - model portability for HPC applications. - - * - - - `Trilinos `_ - - The Trilinos Project is an effort to develop algorithms and enabling technologies - within an object-oriented software framework for the solution of large-scale, - complex multi-physics engineering and scientific problems. - -To learn about ROCm for AI applications, see :doc:`../rocm-for-ai/index`. diff --git a/docs/how-to/setting-cus.rst b/docs/how-to/setting-cus.rst deleted file mode 100644 index 7e923da23..000000000 --- a/docs/how-to/setting-cus.rst +++ /dev/null @@ -1,42 +0,0 @@ -.. meta:: - :description: Setting the number of CUs - :keywords: CU, CUs, number of CUs, compute units - -.. _settings-cus-reference: - -************************************************************* -Setting the number of compute units -************************************************************* - -The GPU driver provides two environment variables to set the number of CUs used: - -- ``HSA_CU_MASK`` -- ``ROC_GLOBAL_CU_MASK`` - -The ``ROC_GLOBAL_CU_MASK`` variable sets the CU mask on queues created by HIP or OpenCL runtimes. The ``HSA_CU_MASK`` variable sets the mask on a lower level of queue creation in the driver. It also sets the mask on the queues being profiled. - -.. tip:: - - When using GPUs to accelerate compute workloads, it sometimes becomes necessary to configure the hardware's usage of compute units (CU). This is a more advanced option, so please read this page before experimentation. - -The environment variables have the following syntax: - -:: - - ID = [0-9][0-9]* ex. base 10 numbers - ID_list = (ID | ID-ID)[, (ID | ID-ID)]* ex. 0,2-4,7 - GPU_list = ID_list ex. 0,2-4,7 - CU_list = 0x[0-F]* | ID_list ex. 0x337F OR 0,2-4,7 - CU_Set = GPU_list : CU_list ex. 0,2-4,7:0-15,32-47 OR 0,2-4,7:0x337F - HSA_CU_MASK = CU_Set [; CU_Set]* ex. 0,2-4,7:0-15,32-47; 3-9:0x337F - -The GPU indices are taken post ``ROCR_VISIBLE_DEVICES`` reordering. The listed or masked CUs are enabled for listed GPUs, and the others are disabled. Unlisted GPUs are not be affected, and their CUs are enabled. - -The variable parsing stops when a syntax error occurs. The erroneous set and the following are ignored. Repeating GPU or CU IDs results in a syntax error. Specifying a mask with no usable CUs (CU_list is 0x0) results in a syntax error. To exclude GPU devices, use ``ROCR_VISIBLE_DEVICES``. - -.. note:: - - These environment variables only affect ROCm software, not graphics applications. - -Not all CU configurations are valid on all devices. For example, on devices where two CUs can be combined into a WGP (for kernels running in WGP mode), it’s not valid to disable only a single CU in a WGP. - diff --git a/docs/how-to/system-debugging.md b/docs/how-to/system-debugging.md deleted file mode 100644 index 99435445b..000000000 --- a/docs/how-to/system-debugging.md +++ /dev/null @@ -1,68 +0,0 @@ ---- -myst: - html_meta: - "description": "Learn more about common system-level debugging measures for ROCm." - "keywords": "env, var, sys, PCIe, troubleshooting, admin, error" ---- - -# System debugging - -## ROCm language and system-level debug, flags, and environment variables - -Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, `net.ifnames=0 biosdevname=0` - -## ROCr error code - -* 2 Invalid Dimension -* 4 Invalid Group Memory -* 8 Invalid (or Null) Code -* 32 Invalid Format -* 64 Group is too large -* 128 Out of VGPRs -* 0x80000000 Debug Options - -## Command to dump firmware version and get Linux kernel version - -`sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info` - -`uname -a` - -## Debug flags - -Debug messages when developing/debugging base ROCm driver. You could enable the printing from `libhsakmt.so` by setting an environment variable, `HSAKMT_DEBUG_LEVEL`. Available debug levels are 3-7. The higher level you set, the more messages will print. - -* `export HSAKMT_DEBUG_LEVEL=3` : Only pr_err() prints. - -* `export HSAKMT_DEBUG_LEVEL=4` : pr_err() and pr_warn() print. - -* `export HSAKMT_DEBUG_LEVEL=5` : We currently do not implement “notice”. Setting to 5 is same as setting to 4. - -* `export HSAKMT_DEBUG_LEVEL=6` : pr_err(), pr_warn(), and pr_info print. - -* `export HSAKMT_DEBUG_LEVEL=7` : Everything including pr_debug prints. - -## ROCr level environment variables for debug - -`HSA_ENABLE_SDMA=0` - -`HSA_ENABLE_INTERRUPT=0` - -`HSA_SVM_GUARD_PAGES=0` - -`HSA_DISABLE_CACHE=1` - -## Turn off page retry on GFX9/Vega devices - -`sudo -s` - -`echo 1 > /sys/module/amdkfd/parameters/noretry` - -## HIP environment variables 3.x - -### OpenCL debug flags - -`AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)` - -## PCIe-debug - -For information on how to debug and profile HIP applications, see {doc}`hip:how-to/debugging` diff --git a/docs/how-to/system-optimization/index.rst b/docs/how-to/system-optimization/index.rst deleted file mode 100644 index d8e46b52a..000000000 --- a/docs/how-to/system-optimization/index.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. meta:: - :description: Learn about AMD hardware optimization for HPC-specific and workstation workloads. - :keywords: high-performance computing, HPC, Instinct accelerators, Radeon, - tuning, tuning guide, AMD, ROCm - -******************* -System optimization -******************* - -This guide outlines system setup and tuning suggestions for AMD hardware to -optimize performance for specific types of workloads or use-cases. The contents are structured according to the hardware: - -.. grid:: 2 - - .. grid-item-card:: AMD RDNA - - * :doc:`AMD RDNA2 system optimization ` - - .. grid-item-card:: AMD Instinct - - * `AMD Instinct MI300X `_ - * `AMD Instinct MI300A `_ - * `AMD Instinct MI200 `_ - * `AMD Instinct MI100 `_ - - - - diff --git a/docs/how-to/system-optimization/w6000-v620.md b/docs/how-to/system-optimization/w6000-v620.md deleted file mode 100644 index 8d5af6404..000000000 --- a/docs/how-to/system-optimization/w6000-v620.md +++ /dev/null @@ -1,200 +0,0 @@ ---- -myst: - html_meta: - "description": "Learn about system settings and performance tuning for RDNA2-based GPUs." - "keywords": "RDNA2, workstation, desktop, BIOS, installation, Radeon, pro, v620, w6000" ---- -:orphan: - -# AMD RDNA2 system optimization - -## Workstation workloads - -Workstation workloads, much like those for HPC, have a unique set of -requirements: a blend of both graphics and compute, certification, stability and -others. - -The document covers specific software requirements and processes needed to use -these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning -tasks. - -The main purpose of this document is to help users utilize the RDNA™ 2 GPUs to -their full potential. - -```{list-table} - :header-rows: 1 - :stub-columns: 1 - - * - System Guide - - - Architecture reference - - - White papers - - * - [System settings](#system-settings) - - - [AMD RDNA 2 instruction set architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf) - - - [RDNA 2 architecture](https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf) -``` - -## System settings - -This chapter reviews system settings that are required to configure the system -for ROCm virtualization on RDNA2-based AMD Radeon™ PRO GPUs. Installing ROCm on -Bare Metal follows the routine ROCm -{doc}`installation procedure`. - -To enable ROCm virtualization on V620, one has to setup Single Root I/O -Virtualization (SR-IOV) in the BIOS via setting found in the following -({ref}`bios-settings`). A tested configuration can be followed in -({ref}`os-settings`). - -:::{attention} -SR-IOV is supported on V620 and unsupported on W6800. -::: - -(bios-settings)= - -### System BIOS settings - -```{list-table} Settings for the system BIOS in an ASrock platform. -:header-rows: 1 -:name: v620-bios - -* - - Advanced / North Bridge Configuration - - IOMMU - - Enabled - - Input-output Memory Management Unit -* - - Advanced / North Bridge Configuration - - ACS Enable - - Enabled - - Access Control Service -* - - Advanced / PCIe/PCI/PnP Configuration - - SR-IOV Support - - Enabled - - Single Root I/O Virtualization -* - - Advanced / ACPI settings - - PCI AER Support - - Enabled - - Advanced Error Reporting -``` - -To set up the host, update SBIOS to version 1.2a. - -(os-settings)= - -### Operating system settings - -```{list-table} System Configuration Prerequisites -:header-rows: 1 -:name: v620-prereq - -* - - Server - - [SMC 4124](https://www.supermicro.com/en/Aplus/system/4U/4124/AS-4124GS-TNR.cfm) [AS -4124GS-TNR] -* - - Host OS - - Ubuntu 20.04.3 LTS -* - - Host Kernel - - 5.4.0-97-generic -* - - CPU - - AMD EPYC 7552 48-Core Processor -* - - GPU - - RDNA2 V620 (D603GLXE) -* - - SBIOS - - Version SMC_r_1.2a -* - - VBIOS - - 113-D603GLXE-077 -* - - Guest OS 1 - - Ubuntu 20.04.5 LTS -* - - Guest OS 2 - - RHEL 9.0 -* - - GIM Driver - - gim-dkms_1.0.0.1234577_all -* - - VM CPU Cores - - 32 -* - - VM RAM - - 64 GB -``` - -Install the following Kernel-based Virtual Machine (KVM) Hypervisor packages: - -```shell -sudo apt-get -y install qemu-kvm qemu-utils bridge-utils virt-manager gir1.2-spiceclientgtk* gir1.2-spice-client-gtk* libvirt-daemon-system dnsmasq-base -sudo virsh net-start default /*to enable Virtual network by default -``` - -Enable input-output memory management unit (IOMMU) in GRUB settings by adding the following line to `/etc/default/grub`: - -```none -GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" for AMD CPU -``` - -Update grub and reboot - -```shell -sudo update=grub -sudo reboot -``` - -Install the GPU-IOV Module (GIM, where IOV is I/O Virtualization) driver and -follow the steps below.z - -```shell -sudo dpkg -i -sudo reboot -# Load Host Driver to Create 1VF -sudo modprobe gim vf_num=1 -# Note: If GIM driver loaded successfully, we could see "gim info:(gim_init:213) *****Running GIM*****" in dmesg -lspci -d 1002: -``` - -Which should output something like: - -```none -01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 -02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479 -03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73a1 -03:02.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73ae → VF -``` - -### Guest OS installation - -First, assign GPU virtual function (VF) to VM using the following steps. - -1. Shut down the VM. - -2. Run `virt-manager` - -3. In the **Virtual Machine Manager** GUI, select the **VM** and click **Open**. - - ![Virtual Machine Manager](../../data/how-to/tuning-guides/tuning014.png "Virtual Machine Manager") - -4. In the VM GUI, go to **Show Virtual Hardware Details > Add Hardware** to - configure hardware. - - ![Show virtual hardware details](../../data/how-to/tuning-guides/tuning015.png "Virtual Machine Manager: show virtual hardware details") - -5. Go to **Add Hardware > PCI Host Device > VF** and click **Finish**. - - ![VF Selection](../../data/how-to/tuning-guides/tuning016.png "VF Selection") - -Then start the VM. - -Finally install ROCm on the virtual machine (VM). For detailed instructions, -refer to the {doc}`Linux install guide`. diff --git a/docs/how-to/tuning-guides/mi300x/index.rst b/docs/how-to/tuning-guides/mi300x/index.rst deleted file mode 100644 index 61bc3c4ee..000000000 --- a/docs/how-to/tuning-guides/mi300x/index.rst +++ /dev/null @@ -1,20 +0,0 @@ -:orphan: - -.. meta:: - :description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance. - :keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning - -************************ -AMD MI300X tuning guides -************************ - -The tuning guides in this section provide a comprehensive summary of the -necessary steps to properly configure your system for AMD Instinct™ MI300X -accelerators. They include detailed instructions on system settings and -application tuning suggestions to help you fully leverage the capabilities of -these accelerators, thereby achieving optimal performance. - -* :doc:`/how-to/rocm-for-ai/inference-optimization/workload` -* `AMD Instinct MI300X system optimization `_ - - diff --git a/docs/reference/api-libraries.md b/docs/reference/api-libraries.md deleted file mode 100644 index e64a963aa..000000000 --- a/docs/reference/api-libraries.md +++ /dev/null @@ -1,70 +0,0 @@ - - - - - - -# ROCm libraries - -::::{grid} 1 2 2 2 -:gutter: 3 -:class-container: rocm-doc-grid - -:::{grid-item-card} Machine Learning and Computer Vision -:class-body: rocm-card-banner rocm-hue-3 - -(artificial-intelligence-apis)= - -* {doc}`Composable Kernel ` -* {doc}`MIGraphX ` -* {doc}`MIOpen ` -* {doc}`MIVisionX ` -* {doc}`rocAL ` -* {doc}`rocDecode ` -* {doc}`rocPyDecode ` -* {doc}`rocJPEG ` -* {doc}`ROCm Performance Primitives (RPP) ` -::: - -:::{grid-item-card} Primitives -:class-body: rocm-card-banner rocm-hue-12 - -(cpp-primitives)= - -* {doc}`hipCUB ` -* {doc}`hipTensor ` -* {doc}`rocPRIM ` -* {doc}`rocThrust ` -::: - -:::{grid-item-card} Communication -:class-body: rocm-card-banner rocm-hue-7 - -(communication-libraries)= - -* {doc}`RCCL ` -* {doc}`rocSHMEM ` -::: - -:::{grid-item-card} Math -:class-body: rocm-card-banner rocm-hue-6 - -(math-apis)= - -* [half](https://github.com/ROCm/half) -* {doc}`hipBLAS ` / {doc}`rocBLAS ` -* {doc}`hipBLASLt ` -* {doc}`hipFFT ` / {doc}`rocFFT ` -* {doc}`hipfort ` -* {doc}`hipRAND ` / {doc}`rocRAND ` -* {doc}`hipSOLVER ` / {doc}`rocSOLVER ` -* {doc}`hipSPARSE ` / {doc}`rocSPARSE ` -* {doc}`hipSPARSELt ` -* {doc}`rocALUTION ` -* {doc}`rocWMMA ` -* {doc}`Tensile ` -::: - -:::: diff --git a/docs/reference/gpu-arch-specs.rst b/docs/reference/gpu-arch-specs.rst deleted file mode 100644 index 4f695dfdb..000000000 --- a/docs/reference/gpu-arch-specs.rst +++ /dev/null @@ -1,1076 +0,0 @@ -.. meta:: - :description: AMD Instinct™ accelerator, AMD Radeon PRO™, and AMD Radeon™ GPU architecture information - :keywords: Instinct, Radeon, accelerator, GCN, CDNA, RDNA, GPU, architecture, VRAM, Compute Units, Cache, Registers, LDS, Register File - -AMD GPU hardware specifications -=============================== - -The following tables provide an overview of the hardware specifications for AMD Instinct™ accelerators, and AMD Radeon™ PRO and Radeon™ GPUs. - -For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix `_. - -.. tab-set:: - - .. tab-item:: AMD Instinct accelerators - - .. list-table:: - :header-rows: 1 - :name: instinct-arch-spec-table - - * - - Model - - Architecture - - LLVM target name - - VRAM (GiB) - - Compute Units - - Wavefront Size - - LDS (KiB) - - L3 Cache (MiB) - - L2 Cache (MiB) - - L1 Vector Cache (KiB) - - L1 Scalar Cache (KiB) - - L1 Instruction Cache (KiB) - - VGPR File (KiB) - - SGPR File (KiB) - - GFXIP Major version - - GFXIP Minor version - * - - MI355X - - CDNA4 - - gfx950 - - 288 - - 256 (32 per XCD) - - 64 - - 160 - - 256 - - 32 (4 per XCD) - - 32 - - 16 per 2 CUs - - 64 per 2 CUs - - 512 - - 12.5 - - 9 - - 5 - * - - MI350X - - CDNA4 - - gfx950 - - 288 - - 256 (32 per XCD) - - 64 - - 160 - - 256 - - 32 (4 per XCD) - - 32 - - 16 per 2 CUs - - 64 per 2 CUs - - 512 - - 12.5 - - 9 - - 5 - * - - MI325X - - CDNA3 - - gfx942 - - 256 - - 304 (38 per XCD) - - 64 - - 64 - - 256 - - 32 (4 per XCD) - - 32 - - 16 per 2 CUs - - 64 per 2 CUs - - 512 - - 12.5 - - 9 - - 4 - * - - MI300X - - CDNA3 - - gfx942 - - 192 - - 304 (38 per XCD) - - 64 - - 64 - - 256 - - 32 (4 per XCD) - - 32 - - 16 per 2 CUs - - 64 per 2 CUs - - 512 - - 12.5 - - 9 - - 4 - * - - MI300A - - CDNA3 - - gfx942 - - 128 - - 228 (38 per XCD) - - 64 - - 64 - - 256 - - 24 (4 per XCD) - - 32 - - 16 per 2 CUs - - 64 per 2 CUs - - 512 - - 12.5 - - 9 - - 4 - * - - MI250X - - CDNA2 - - gfx90a - - 128 - - 220 (110 per GCD) - - 64 - - 64 - - - - 16 (8 per GCD) - - 16 - - 16 per 2 CUs - - 32 per 2 CUs - - 512 - - 12.5 - - 9 - - 0 - * - - MI250 - - CDNA2 - - gfx90a - - 128 - - 208 (104 per GCD) - - 64 - - 64 - - - - 16 (8 per GCD) - - 16 - - 16 per 2 CUs - - 32 per 2 CUs - - 512 - - 12.5 - - 9 - - 0 - * - - MI210 - - CDNA2 - - gfx90a - - 64 - - 104 - - 64 - - 64 - - - - 8 - - 16 - - 16 per 2 CUs - - 32 per 2 CUs - - 512 - - 12.5 - - 9 - - 0 - * - - MI100 - - CDNA - - gfx908 - - 32 - - 120 - - 64 - - 64 - - - - 8 - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 VGPR and 256 AccVGPR - - 12.5 - - 9 - - 0 - * - - MI60 - - GCN5.1 - - gfx906 - - 32 - - 64 - - 64 - - 64 - - - - 4 - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - * - - MI50 (32GB) - - GCN5.1 - - gfx906 - - 32 - - 60 - - 64 - - 64 - - - - 4 - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - * - - MI50 (16GB) - - GCN5.1 - - gfx906 - - 16 - - 60 - - 64 - - 64 - - - - 4 - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - * - - MI25 - - GCN5.0 - - gfx900 - - 16  - - 64 - - 64 - - 64  - - - - 4  - - 16  - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - * - - MI8 - - GCN3.0 - - gfx803 - - 4 - - 64 - - 64 - - 64 - - - - 2 - - 16 - - 16 per 4 CUs - - 32 per 4 CUs - - 256 - - 12.5 - - 8 - - 0 - * - - MI6 - - GCN4.0 - - gfx803 - - 16 - - 36 - - 64 - - 64 - - - - 2 - - 16 - - 16 per 4 CUs - - 32 per 4 CUs - - 256 - - 12.5 - - 8 - - 0 - - .. tab-item:: AMD Radeon PRO GPUs - - .. list-table:: - :header-rows: 1 - :name: radeon-pro-arch-spec-table - - * - - Model - - Architecture - - LLVM target name - - - VRAM (GiB) - - Compute Units - - Wavefront Size - - LDS (KiB) - - Infinity Cache (MiB) - - L2 Cache (MiB) - - Graphics L1 Cache (KiB) - - L0 Vector Cache (KiB) - - L0 Scalar Cache (KiB) - - L0 Instruction Cache (KiB) - - VGPR File (KiB) - - SGPR File (KiB) - - GFXIP Major version - - GFXIP Minor version - * - - Radeon AI PRO R9700 - - RDNA4 - - gfx1201 - - 32 - - 64 - - 32 or 64 - - 128 - - 64 - - 8 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon PRO V710 - - RDNA3 - - gfx1101 - - 28 - - 54 - - 32 or 64 - - 128 - - 56 - - 4 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W7900 Dual Slot - - RDNA3 - - gfx1100 - - 48 - - 96 - - 32 or 64 - - 128 - - 96 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W7900 - - RDNA3 - - gfx1100 - - 48 - - 96 - - 32 or 64 - - 128 - - 96 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W7800 48GB - - RDNA3 - - gfx1100 - - 48 - - 70 - - 32 or 64 - - 128 - - 96 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W7800 - - RDNA3 - - gfx1100 - - 32 - - 70 - - 32 or 64 - - 128 - - 64 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W7700 - - RDNA3 - - gfx1101 - - 16 - - 48 - - 32 or 64 - - 128 - - 64 - - 4 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon PRO W6800 - - RDNA2 - - gfx1030 - - 32 - - 60 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon PRO W6600 - - RDNA2 - - gfx1032 - - 8 - - 28 - - 32 or 64 - - 128 - - 32 - - 2 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon PRO V620 - - RDNA2 - - gfx1030 - - 32 - - 72 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon Pro W5500 - - RDNA - - gfx1012 - - 8 - - 22 - - 32 or 64 - - 128 - - - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 20 - - 10 - - 1 - * - - Radeon Pro VII - - GCN5.1 - - gfx906 - - 16 - - 60 - - 64 - - 64 - - - - 4 - - - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - - .. tab-item:: AMD Radeon GPUs - - .. list-table:: - :header-rows: 1 - :name: radeon-arch-spec-table - - * - - Model - - Architecture - - LLVM target name - - VRAM (GiB) - - Compute Units - - Wavefront Size - - LDS (KiB) - - Infinity Cache (MiB) - - L2 Cache (MiB) - - Graphics L1 Cache (KiB) - - L0 Vector Cache (KiB) - - L0 Scalar Cache (KiB) - - L0 Instruction Cache (KiB) - - VGPR File (KiB) - - SGPR File (KiB) - - GFXIP Major version - - GFXIP Minor version - * - - Radeon RX 9070 XT - - RDNA4 - - gfx1201 - - 16 - - 64 - - 32 or 64 - - 128 - - 64 - - 8 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon RX 9070 GRE - - RDNA4 - - gfx1201 - - 16 - - 48 - - 32 or 64 - - 128 - - 48 - - 6 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon RX 9070 - - RDNA4 - - gfx1201 - - 16 - - 56 - - 32 or 64 - - 128 - - 64 - - 8 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon RX 9060 XT - - RDNA4 - - gfx1200 - - 16 - - 32 - - 32 or 64 - - 128 - - 32 - - 4 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon RX 9060 - - RDNA4 - - gfx1200 - - 8 - - 28 - - 32 or 64 - - 128 - - 32 - - 4 - - N/A - - 32 - - 16 - - 32 - - 768 - - 32 - - 12 - - 0 - * - - Radeon RX 7900 XTX - - RDNA3 - - gfx1100 - - 24 - - 96 - - 32 or 64 - - 128 - - 96 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon RX 7900 XT - - RDNA3 - - gfx1100 - - 20 - - 84 - - 32 or 64 - - 128 - - 80 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon RX 7900 GRE - - RDNA3 - - gfx1100 - - 16 - - 80 - - 32 or 64 - - 128 - - 64 - - 6 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon RX 7800 XT - - RDNA3 - - gfx1101 - - 16 - - 60 - - 32 or 64 - - 128 - - 64 - - 4 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon RX 7700 XT - - RDNA3 - - gfx1101 - - 12 - - 54 - - 32 or 64 - - 128 - - 48 - - 4 - - 256 - - 32 - - 16 - - 32 - - 768 - - 32 - - 11 - - 0 - * - - Radeon RX 7600 - - RDNA3 - - gfx1102 - - 8 - - 32 - - 32 or 64 - - 128 - - 32 - - 2 - - 256 - - 32 - - 16 - - 32 - - 512 - - 32 - - 11 - - 0 - * - - Radeon RX 6950 XT - - RDNA2 - - gfx1030 - - 16 - - 80 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6900 XT - - RDNA2 - - gfx1030 - - 16 - - 80 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6800 XT - - RDNA2 - - gfx1030 - - 16 - - 72 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6800 - - RDNA2 - - gfx1030 - - 16 - - 60 - - 32 or 64 - - 128 - - 128 - - 4 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6750 XT - - RDNA2 - - gfx1031 - - 12 - - 40 - - 32 or 64 - - 128 - - 96 - - 3 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6700 XT - - RDNA2 - - gfx1031 - - 12 - - 40 - - 32 or 64 - - 128 - - 96 - - 3 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6700 - - RDNA2 - - gfx1031 - - 10 - - 36 - - 32 or 64 - - 128 - - 80 - - 3 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6650 XT - - RDNA2 - - gfx1032 - - 8 - - 32 - - 32 or 64 - - 128 - - 32 - - 2 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6600 XT - - RDNA2 - - gfx1032 - - 8 - - 32 - - 32 or 64 - - 128 - - 32 - - 2 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon RX 6600 - - RDNA2 - - gfx1032 - - 8 - - 28 - - 32 or 64 - - 128 - - 32 - - 2 - - 128 - - 16 - - 16 - - 32 - - 512 - - 32 - - 10 - - 3 - * - - Radeon VII - - GCN5.1 - - gfx906 - - 16 - - 60 - - 64 - - 64 per CU - - - - 4 - - - - 16 - - 16 per 3 CUs - - 32 per 3 CUs - - 256 - - 12.5 - - 9 - - 0 - -Glossary -======== - -For more information about the terms used, see the -:ref:`specific documents and guides `, or -:doc:`Understanding the HIP programming model`. - -**LLVM target name** - -Argument to pass to clang in ``--offload-arch`` to compile code for the given -architecture. - -**VRAM** - -Amount of memory available on the GPU. - -**Compute Units** - -Number of compute units on the GPU. - -**Wavefront Size** - -Amount of work items that execute in parallel on a single compute unit. This -is equivalent to the warp size in HIP. - -**LDS** - -The Local Data Share (LDS) is a low-latency, high-bandwidth scratch pad -memory. It is local to the compute units, and can be shared by all work items -in a work group. In HIP, the LDS can be used for shared memory, which is -shared by all threads in a block. - -**L3 Cache (CDNA/GCN only)** - -Size of the level 3 cache. Shared by all compute units on the same GPU. Caches -data and instructions. Similar to the Infinity Cache on RDNA architectures. - -**Infinity Cache (RDNA only)** - -Size of the infinity cache. Shared by all compute units on the same GPU. Caches -data and instructions. Similar to the L3 Cache on CDNA/GCN architectures. - -**L2 Cache** - -Size of the level 2 cache. Shared by all compute units on the same GCD. Caches -data and instructions. - -**Graphics L1 Cache (RDNA only)** - -An additional cache level that only exists in RDNA architectures. Local to a -shader array. - -**L1 Vector Cache (CDNA/GCN only)** - -Size of the level 1 vector data cache. Local to a compute unit. This is the L0 -vector cache in RDNA architectures. - -**L1 Scalar Cache (CDNA/GCN only)** - -Size of the level 1 scalar data cache. Usually shared by several compute -units. This is the L0 scalar cache in RDNA architectures. - -**L1 Instruction Cache (CDNA/GCN only)** - -Size of the level 1 instruction cache. Usually shared by several compute -units. This is the L0 instruction cache in RDNA architectures. - -**L0 Vector Cache (RDNA only)** - -Size of the level 0 vector data cache. Local to a compute unit. This is the L1 -vector cache in CDNA/GCN architectures. - -**L0 Scalar Cache (RDNA only)** - -Size of the level 0 scalar data cache. Usually shared by several compute -units. This is the L1 scalar cache in CDNA/GCN architectures. - -**L0 Instruction Cache (RDNA only)** - -Size of the level 0 instruction cache. Usually shared by several compute -units. This is the L1 instruction cache in CDNA/GCN architectures. - -**VGPR File** - -Size of the Vector General Purpose Register (VGPR) file and. It holds data used in -vector instructions. -GPUs with matrix cores also have AccVGPRs, which are Accumulation General -Purpose Vector Registers, used specifically in matrix instructions. - -**SGPR File** - -Size of the Scalar General Purpose Register (SGPR) file. Holds data used in -scalar instructions. - -**GFXIP** - -GFXIP (Graphics IP) is a versioning system used by AMD to identify the GPU -architecture and its instruction set. It helps categorize different generations -of GPUs and their feature sets. - -**GFXIP major version** - -Defines the GPU's core instruction set and architecture, which determines -compatibility with software stacks such as HIP and OpenCL. For example, a GFXIP -11 major version corresponds to the RDNA 3 (Navi 3x) architecture, influencing -driver support and available compute features. - -**GFXIP minor version** - -Represents specific variations within a GFXIP major version and affects feature sets, -optimizations, and driver behavior in software stacks such as HIP and OpenCL. Different -GPU models within the same major version can have unique capabilities, impacting -performance and supported instructions. - -**GCD** - -Graphics Compute Die. - -**XCD** - -Accelerator Complex Die. diff --git a/docs/reference/gpu-atomics-operation.rst b/docs/reference/gpu-atomics-operation.rst deleted file mode 100644 index faa9a8320..000000000 --- a/docs/reference/gpu-atomics-operation.rst +++ /dev/null @@ -1,801 +0,0 @@ -.. meta:: - :description: AMD Instinct accelerator, AMD Radeon PRO, and AMD Radeon GPU - atomics operations information - :keywords: Atomics operations, atomic bitwise functions, atomics add, atomics - subtraction, atomics exchange, atomics min, atomics max - -.. _hw_atomics_operation_support: - -Hardware atomics operation support -================================================================================ - -:ref:`Atomic operations ` guarantee that the operation is -completed as an indivisible unit, preventing race conditions where simultaneous -access to the same memory location could lead to incorrect or undefined -behavior. - -This topic summarizes the support of atomic read-modify-write -(atomicRMW) operations on AMD GPUs and accelerators. This includes gfx9, gfx10, -gfx11, and gfx12 targets and the following series of Instinct™ series: - -- MI100 - -- MI200 - -- MI300 - -- MI350 - -The atomics operation type behavior is affected by the memory locations, memory -granularity, and scope of operations. - -Memory locations: - -- :ref:`Device memory `, that is, VRAM, the RAM on a discrete - GPU device or in framebuffer carveout for APUs. This includes peer-device - memory within an Infinity Fabric™ hive. - -- :ref:`Host memory `: in DRAM associated with the CPU (or - peer device memory using PCIe® (PCI Express) peer-to-peer). This can be two sub-types: - - - Migratable memory: memory that is currently residing in host DRAM, but - which can be migrated back to device memory. For example, - ``hipMallocManaged()`` or :ref:`unified memory ` - allocations. - - - :ref:`Pinned memory `: memory that is in host memory - and cannot be migrated to the device (not necessarily pinned to a particular - physical address, but can't be moved to device memory). ``hipHostMalloc()``, - for example. - -Memory granularity or :ref:`coherence `: - -- Coarse-grained memory - - - This memory can be used for device-scope synchronization during the - execution of a single GPU kernel. Any system-scope atomics sent to this type - of memory will not achieve system-scope coherency and will instead be - downgraded to device-scope as per the programming model. - - - This type of memory only available on AMD GPUs. - -- Fine-grained memory - - - This memory can be used for device and system-scope synchronization during - the execution of a single GPU kernel. - -Scopes of operations: - -- Device-scope or agent-scope - - - This atomic should happen atomically from the point of view of every thread - within the device that the atomic-executing thread is in. - -- System-scope - - - This atomic should happen atomically from the point of view of every thread - in all devices and in the CPUs. - -Support summary -================================================================================ - -AMD Instinct accelerators --------------------------------------------------------------------------------- - -**MI300 and MI350 series** - -- All atomicRMW operations are forwarded out to the Infinity Fabric. -- Infinity Fabric supports common integer and bitwise atomics, FP32 atomic add, - packed-FP16 atomic add, packed-BF16 atomic add, and FP64 add, min, and max. -- In discrete GPUs (dGPUs), if the data is stored in host memory, the atomic - will be forwarded from the Infinity Fabric to PCIe. -- If the PCIe bus does not support the requested atomic, the GPU's PCIe - controller changes it into a load-op-store sequence. All waves on the chip - submitting atomics to that address will stall waiting for the load-op-store. - It will seem like atomics to the wave, but the CPU sees it as a non-atomic - load-op-store sequence. This downgrades system-scope atomics to device-scope. - -**MI200 series** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. -- L2 cache supports FP32 atomic add, packed-FP16 atomic add, and FP64 add, - min, and max. -- The Infinity Fabric does not support FP32 atomic add, packed-FP16 atomic add, - and FP64 add, min, and max atomics and these commands cannot be sent to the - Infinity Fabric. -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. -- Fine-grained memory is marked write-uncacheable through the page tables. -- Atomics that hit write-uncached memory are forwarded to the Infinity Fabric. -- If the uncached data is stored in host memory on a PCIe system, the atomic - will be forwarded from Infinity Fabric to PCIe. Any atomic not supported by - the PCIe bus will be a NOP and give incorrect result. -- If the uncached data is stored in host memory on an A+A system (system with - AMD CPU and AMD GPU connected via Infinity Fabric), the atomic operation will - be forwarded to the remote location and will succeed if supported by Infinity - Fabric. -- If the float atomics access write-uncached memory, they cannot be forwarded to - the Infinity Fabric, resulting in a NOP and an incorrect outcome. - -**MI100** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. -- L2 cache supports no returns (NoReturn) versions of packed-FP16 and FP32 - atomic adds, that cannot return data. -- The Infinity Fabric does not support packed-FP16 or FP32 atomic adds, - preventing these commands from being transmitted through it. -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. -- Fine-grained memory is marked uncacheable through the page tables. -- Atomics that hit uncached memory are forwarded to the Infinity Fabric. -- If the uncached data is stored in host memory, the atomic will be forwarded - from Infinity Fabric to PCIe. Any atomic not supported by the PCIe bus will - be a NOP and give incorrect result. -- If an float atomic add hits uncached memory, it cannot be forwarded to the - Infinity Fabric so it will NOP and give incorrect result. - -AMD gfx generic targets --------------------------------------------------------------------------------- - -**gfx9** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. -- Fine-grained memory is marked uncacheable through the page tables. -- Atomics that hit uncached memory are forwarded to the Infinity Fabric. -- In a dGPU: if the uncached data is stored in host memory, the atomic will be - forwarded from Infinity Fabric to PCIe. Any atomic not supported by the PCIe - bus will be a NOP and. - -**gfx10** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. -- Fine-grained memory is marked uncacheable through the page tables. -- Atomics that hit uncached memory are forwarded to the Infinity Fabric. -- In a dGPU: if the uncached data is stored in host memory, the atomic will be - forwarded from Infinity Fabric to PCIe. Any atomic not supported by the PCIe - bus will be a NOP and give incorrect result. -- Supports floating-point atomic min/max. -- The Infinity Fabric does not support floating-point atomic min/max atomics - and these commands cannot be sent to the Infinity Fabric. -- If the floating-point atomics hit uncached memory, they cannot be forwarded to - the Infinity Fabric, so they will NOP and give incorrect result. - -**gfx11** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. -- L2 cache supports FP32 atomic add, min and max. -- The Infinity Fabric does not support FP32 atomic add, min and max atomics and - these commands cannot be sent to the Infinity Fabric. -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. -- Fine-grained memory is marked uncacheable through the page tables. -- Atomics that hit write-uncached memory are forwarded to the Infinity Fabric. -- In a dGPU: if the uncached data is stored in host memory, the atomic will be - forwarded from Infinity Fabric to PCIe. Any atomic not supported by the PCIe - bus will be a NOP and give incorrect result. -- If the float atomics hit uncached memory, they cannot be forwarded to the - Infinity Fabric, so they will NOP and give incorrect result. - -**gfx12** - -- L2 cache and Infinity Fabric both support common integer and bitwise atomics. - -- L2 cache and Infinity Fabric both also support FP32 atomic add, min and max, - and packed-FP16 atomic add, and packed-BF16 atomic add. - -- Coarse-grained memory is marked as cacheable, and atomic operations will be - processed in the L2 cache. - -- Fine-grained device memory is marked uncacheable through the page tables. - - - Atomics that hit write-uncached memory are forwarded to the Infinity Fabric. - -- Fine-grained system memory is marked as cacheable through the page tables. - - - Device-scope atomic operations will process in the L2 cache. - - - System-scope atomic operations will bypass the L2 cache and be forwarded to - the Infinity Fabric. - -- Atomics that hit write-uncached memory are forwarded to the Infinity Fabric. - -- In dGPUs, if the data is stored in host memory, the atomic will be forwarded - from the Infinity Fabric to PCIe. - -- If the PCIe bus does not support the requested atomic, the GPU's PCIe - controller changes it into a load-op-store sequence. All waves on the chip - submitting atomics to that address will stall waiting for the load-op-store. - It will seem like atomics to the wave, but the CPU sees it as a non-atomic - load-op-store sequence. This downgrades system-scope atomics to device-scope. - -GPUs atomics support -================================================================================ - -This section presents a series of tables that show the level of atomic -operations support for the different hardware devices described above, and -different datatypes, different operations and different scopes. - -Hardware atomics support refers to the ability of GPUs to natively perform -atomic operations—special low-level operations that ensure data consistency when -multiple threads access and modify memory concurrently. - -CAS (Compare-and-Swap) atomic support refers to the hardware or software -capability to perform an atomic Compare-and-Swap operation. - -PCIe atomics are a feature of the PCIe interface that enable -atomic operations between devices and hosts across the PCIe bus. For further -information, please check `How ROCm uses PCIe atomics `_. - -The tables that follow show the correctness of atomics operations on the -hardware using the following notations: - -- ✅: Produces the correct answer. - -- ⚠️: Produces the correct answer, but works only at a weaker scope. - -- ❌: The atomics operation fails. - -The tables show the different types of atomic operations used by specific -devices: - -- Native: Computes the correct result using a hardware-native atomic - instruction. - -- CAS: Generates the correct result, but the atomic operation is implemented by - the compiler for this ISA using a compare-and-swap emulation loop. - -- ✅ NoReturn: Produces the correct correct result but does not precisely - conform to the atomic API. - -- Scope Downgrade: Generates the correct result but operates at a weaker scope - than requested. For example, if a user specifies a system-scope atomic, the - operation may only function at the device scope. - -- NOP: The atomic operation is not executed on the target location, and the - requesting thread receives back 0 as a return value. - -- n/a: The atomic type is not supported and cannot be executed on the specific - hardware. - -The tables selectors or options are the following: - -- Highest level option: - - - "HW atomics", where software attempts to use hardware atomics. - - - "CAS emulation", where software attempts to use CAS emulation. - -- Second-level option: - - - "No PCIe atomics" means the system does not support PCIe atomics between - the accelerator and peer/host-memory. - - - "PCIe atomics" means the system supports PCIe atomics between the - accelerator and peer/host-memory. - -- The third-level option is the memory granularity of the memory target. - -- The final option is the scope of atomic access. - -Integer atomics operations --------------------------------------------------------------------------------- - -The integer type atomic operations that are supported by different hardware. - -- 32 bit integer - - - Add - - - Subtract - - - Min - - - Max - - - IncDec - -- 64 bit integer - - - Add - - - Min - - - Max - -AMD Instinct accelerators -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The integer type atomic operations that are supported by different AMD -Instinct accelerators listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=2, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_instinct.csv" %} - {# If we have a new CSV file, reset the offset #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 2 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Compute the row numbers for this leaf #} - {% set start = ns.offset %} - {% set end = ns.offset + 8 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset: block (9 rows) plus gap (18 rows) #} - {% set ns.offset = ns.offset + 9 + 18 %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. - -AMD gfx generic targets -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The integer type atomic operations that are supported by different gfx generic -targets listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=2, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_gfx.csv" %} - {# If we switch CSV files, reset the offset to 2 (to skip the header row) #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 2 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Determine the increment based on atomics_type_key #} - {% if atomics_type_key == "hw-atomics" %} - {% set increment = 20 %} - {% elif atomics_type_key == "cas-atomics" %} - {% set increment = 18 %} - {% endif %} - - {# Compute start and end rows (end is inclusive) #} - {% set start = ns.offset %} - {% set end = ns.offset + 8 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset for the next table in this CSV #} - {% set ns.offset = ns.offset + 9 + increment %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. - -Bitwise atomics operations --------------------------------------------------------------------------------- - -The bitwise atomic operations that are supported by different hardware. - -- 32 bit bitwise - - - Exchange - - - Compare-and-Swap (CAS) - - - AND - - - OR - - - XOR - -- 64 bit bitwise - - - Exchange - - - CAS - - - AND - - - OR - - - XOR - - -.. note:: - - 128-bit bitwise Exchange and CAS are not supported on AMD GPUs - -AMD Instinct accelerators -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The bitwise atomic operations that are supported by different AMD Instinct -accelerators listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=19, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_instinct.csv" %} - {# If we have a new CSV file, reset the offset #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 19 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Compute the row numbers for this leaf #} - {% set start = ns.offset %} - {% set end = ns.offset + 9 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset: block (10 rows) plus gap (17 rows) #} - {% set ns.offset = ns.offset + 10 + 17 %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. - -AMD gfx generic targets -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The bitwise atomic operations that are supported by different AMD gfx generic -targets listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=19, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_gfx.csv" %} - {# If we switch CSV files, reset the offset to 2 (to skip the header row) #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 19 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Determine the increment based on atomics_type_key #} - {% if atomics_type_key == "hw-atomics" %} - {% set increment = 19 %} - {% elif atomics_type_key == "cas-atomics" %} - {% set increment = 17 %} - {% endif %} - - {# Compute start and end rows (end is inclusive) #} - {% set start = ns.offset %} - {% set end = ns.offset + 9 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset for the next table in this CSV #} - {% set ns.offset = ns.offset + 10 + increment %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. - -Float atomics operations --------------------------------------------------------------------------------- - -The float types atomic operations that are supported by different hardware. - -- 32-bit IEEE 754 floating point ('single precision', FP32) - - - Add - - - Min - - - Max - -- 64-bit IEEE 754 floating point ('double precision', FP64) - - - Add - - - Min - - - Max - -- 16-bit IEEE 754 floating point ('half precision", FP16) - - - Add - -- 2xPacked 16-bit IEEE 754 floating point ('half precision', FP16) - - - Add - -- BrainFloat-16 floating point (BF16) - - - Add - -- 2xPacked BrainFloat-16 floating point (BF16) - - - Add - -AMD Instinct accelerators -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The float type atomic operations that are supported by different AMD Instinct -accelerators listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=11, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_instinct.csv" %} - {# If we have a new CSV file, reset the offset #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 11 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Compute the row numbers for this leaf #} - {% set start = ns.offset %} - {% set end = ns.offset + 7 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset: block (8 rows) plus gap (19 rows) #} - {% set ns.offset = ns.offset + 8 + 19 %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. - -AMD gfx generic targets -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The float types atomic operations that are supported by different AMD gfx -generic targets listed in the following table. - -.. - -.. The relative path not working in datatemplate, that's why we also need the absolute path of docs folder. - -.. datatemplate:nodata:: - - {% set ns = namespace(offset=11, previous_csv='') %} - - .. tab-set:: - {% for (atomics_type_text, atomics_type_key) in config.html_context['atomics_type'] %} - .. tab-item:: {{ atomics_type_text }} - :sync: {{ atomics_type_key }} - - .. tab-set:: - {% for (pcie_type_text, pcie_type_key) in config.html_context['pcie_type'] %} - .. tab-item:: {{ pcie_type_text }} - :sync: {{ pcie_type_key }} - - .. tab-set:: - {% for (memory_type_text, memory_type_key) in config.html_context['memory_type'] %} - .. tab-item:: {{ memory_type_text }} - :sync: {{ memory_type_key }} - - .. tab-set:: - {% for (granularity_type_text, granularity_type_key) in config.html_context['granularity_type'] %} - .. tab-item:: {{ granularity_type_text }} - :sync: {{ granularity_type_key }} - - .. tab-set:: - {% for (scope_type_text, scope_type_key) in config.html_context['scope_type'] %} - .. tab-item:: {{ scope_type_text }} - :sync: {{ scope_type_key }} - - {# Build the CSV file path for this branch #} - {% set current_csv = "data/reference/gpu-atomics-operation/" - ~ atomics_type_key ~ "_" ~ pcie_type_key ~ "_gfx.csv" %} - {# If we switch CSV files, reset the offset to 2 (to skip the header row) #} - {% if current_csv != ns.previous_csv %} - {% set ns.offset = 11 %} - {% endif %} - {% set ns.previous_csv = current_csv %} - - {# Determine the increment based on atomics_type_key #} - {% if atomics_type_key == "hw-atomics" %} - {% set increment = 21 %} - {% elif atomics_type_key == "cas-atomics" %} - {% set increment = 19 %} - {% endif %} - - {# Compute start and end rows (end is inclusive) #} - {% set start = ns.offset %} - {% set end = ns.offset + 7 %} - - .. csv-to-list-table:: - :file: {{ current_csv }} - :rows: {{ start }}-{{ end }} - - {# Update the offset for the next table in this CSV #} - {% set ns.offset = ns.offset + 8 + increment %} - - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - {% endfor %} - -.. \ No newline at end of file diff --git a/docs/reference/graph-safe-support.rst b/docs/reference/graph-safe-support.rst deleted file mode 100644 index 44283e732..000000000 --- a/docs/reference/graph-safe-support.rst +++ /dev/null @@ -1,111 +0,0 @@ -.. meta:: - :description: This page lists supported graph safe ROCm libraries. - :keywords: AMD, ROCm, HIP, hipGRAPH - -******************************************************************************** -Graph-safe support for ROCm libraries -******************************************************************************** - -HIP graph-safe libraries operate safely in HIP execution graphs. -:ref:`hip:how_to_HIP_graph` are an alternative way of executing tasks on a GPU -that can provide performance benefits over launching kernels using the standard -method via streams. - -Functions and routines from graph-safe libraries shouldn’t result in issues like -race conditions, deadlocks, or unintended dependencies. - -The following table shows whether a ROCm library is graph-safe. - -.. list-table:: - :header-rows: 1 - - * - - ROCm library - - Graph safe support - * - - `Composable Kernel `_ - - ❌ - * - - `hipBLAS `_ - - ✅ - * - - `hipBLASLt `_ - - ⚠️ - * - - `hipCUB `_ - - ✅ - * - - `hipFFT `_ - - ✅ (see :ref:`details `) - * - - `hipRAND `_ - - ✅ - * - - `hipSOLVER `_ - - ⚠️ (experimental) - * - - `hipSPARSE `_ - - ✅ - * - - `hipSPARSELt `_ - - ⚠️ (experimental) - * - - `hipTensor `_ - - ❌ - * - - `MIOpen `_ - - ❌ - * - - `RCCL `_ - - ✅ - * - - `rocAL `_ - - ❌ - * - - `rocALUTION `_ - - ❌ - * - - `rocBLAS `_ - - ✅ (see :doc:`details `) - * - - `rocDecode `_ - - ❌ - * - - `rocFFT `_ - - ✅ (see :ref:`details `) - * - - `rocHPCG `_ - - ❌ - * - - `rocJPEG `_ - - ❌ - * - - `rocPRIM `_ - - ✅ - * - - `rocRAND `_ - - ✅ - * - - `rocSOLVER `_ - - ⚠️ (experimental) - * - - `rocSPARSE `_ - - ⚠️ (experimental) - * - - `rocThrust `_ - - ❌ (see :doc:`details `) - * - - `rocWMMA `_ - - ❌ - * - - `RPP `_ - - ⚠️ - * - - `Tensile `_ - - ✅ - -✅: full support - -⚠️: partial support - -❌: not supported diff --git a/docs/reference/precision-support.rst b/docs/reference/precision-support.rst deleted file mode 100644 index 8ee81e4b3..000000000 --- a/docs/reference/precision-support.rst +++ /dev/null @@ -1,1207 +0,0 @@ -.. meta:: - :description: Supported data types of AMD GPUs and libraries in ROCm. - :keywords: precision, data types, HIP types, int8, float8, float8 (E4M3), - float8 (E5M2), bfloat8, float16, half, bfloat16, tensorfloat32, - float, float32, float64, double, AMD data types, HIP data types, - ROCm precision, ROCm data types - -************************************************************* -Data types and precision support -************************************************************* - -This topic summarizes the data types supported on AMD GPUs and accelerators and -ROCm libraries, along with corresponding :doc:`HIP ` data types. - -Integral types -============== - -The signed and unsigned integral types supported by ROCm are listed in -the following table. - -.. list-table:: - :header-rows: 1 - :widths: 15,35,50 - - * - - Type name - - HIP type - - Description - * - - int8 - - ``int8_t``, ``uint8_t`` - - A signed or unsigned 8-bit integer - * - - int16 - - ``int16_t``, ``uint16_t`` - - A signed or unsigned 16-bit integer - * - - int32 - - ``int32_t``, ``uint32_t`` - - A signed or unsigned 32-bit integer - * - - int64 - - ``int64_t``, ``uint64_t`` - - A signed or unsigned 64-bit integer - -.. _precision_support_floating_point_types: - -Floating-point types -==================== - -The floating-point types supported by ROCm are listed in the following table. - -.. image:: ../data/about/compatibility/floating-point-data-types.png - :alt: Supported floating-point types - -.. list-table:: - :header-rows: 1 - :widths: 15,25,60 - - * - - Type name - - HIP type - - Description - - * - - float4 (E2M1) - - | ``__hip_fp4_e2m1`` - - A 4-bit floating-point number with **E2M1** bit layout, as described - in :doc:`low precision floating point types page `. - - * - - float6 (E3M2) - - | ``__hip_fp6_e3m2`` - - A 6-bit floating-point number with **E3M2** bit layout, as described - in :doc:`low precision floating point types page `. - - * - - float6 (E2M3) - - | ``__hip_fp6_e2m3`` - - A 6-bit floating-point number with **E2M3** bit layout, as described - in :doc:`low precision floating point types page `. - - * - - float8 (E4M3) - - | ``__hip_fp8_e4m3_fnuz``, - | ``__hip_fp8_e4m3`` - - An 8-bit floating-point number with **E4M3** bit layout, as described in :doc:`low precision floating point types page `. - The FNUZ variant has expanded range with no infinity or signed zero (NaN represented as negative zero), - while the OCP variant follows the Open Compute Project specification. - - * - - float8 (E5M2) - - | ``__hip_fp8_e5m2_fnuz``, - | ``__hip_fp8_e5m2`` - - An 8-bit floating-point number with **E5M2** bit layout, as described in :doc:`low precision floating point types page `. - The FNUZ variant has expanded range with no infinity or signed zero (NaN represented as negative zero), - while the OCP variant follows the Open Compute Project specification. - - * - - float16 - - ``half`` - - A 16-bit floating-point number that conforms to the IEEE 754-2008 - half-precision storage format. - - * - - bfloat16 - - ``bfloat16`` - - A shortened 16-bit version of the IEEE 754 single-precision storage - format. - - * - - tensorfloat32 - - Not available - - A floating-point number that occupies 32 bits or less of storage, - providing improved range compared to half (16-bit) format, at - (potentially) greater throughput than single-precision (32-bit) formats. - - * - - float32 - - ``float`` - - A 32-bit floating-point number that conforms to the IEEE 754 - single-precision storage format. - - * - - float64 - - ``double`` - - A 64-bit floating-point number that conforms to the IEEE 754 - double-precision storage format. - -.. note:: - - * The float8 and tensorfloat32 types are internal types used in calculations - in Matrix Cores and can be stored in any type of the same size. - - * CDNA3 natively supports FP8 FNUZ (E4M3 and E5M2), which differs from the customized - FP8 format used with NVIDIA H100 - (`FP8 Formats for Deep Learning `_). - - * In some AMD documents and articles, float8 (E5M2) is referred to as bfloat8. - - * The :doc:`low precision floating point types page ` - describes how to use these types in HIP with examples. - -Level of support definitions -============================ - -In the following sections, icons represent the level of support. These icons, -described in the following table, are also used in the library data type support -pages. - -.. list-table:: - :header-rows: 1 - - * - - Icon - - Definition - - * - - NA - - Not applicable - - * - - ❌ - - Not supported - - * - - ⚠️ - - Partial support - - * - - ✅ - - Full support - -.. note:: - - * Full support means that the type is supported natively or with hardware - emulation. - - * Native support means that the operations for that type are implemented in - hardware. Types that are not natively supported are emulated with the - available hardware. The performance of non-natively supported types can - differ from the full instruction throughput rate. For example, 16-bit - integer operations can be performed on the 32-bit integer ALUs at full rate; - however, 64-bit integer operations might need several instructions on the - 32-bit integer ALUs. - - * Any type can be emulated by software, but this page does not cover such - cases. - -Data type support by hardware architecture -========================================== - -AMD's GPU lineup spans multiple architecture generations: - -* CDNA1 such as MI100 -* CDNA2 such as MI210, MI250, and MI250X -* CDNA3 such as MI300A, MI300X, and MI325X -* CDNA4 such as MI350X and MI355X -* RDNA2 such as PRO W6800 and PRO V620 -* RDNA3 such as RX 7900XT and RX 7900XTX -* RDNA4 such as RX 9070 and RX 9070XT - -HIP C++ type implementation support ------------------------------------ - -The HIP C++ types available on different hardware platforms are listed in the -following table. - -.. list-table:: - :header-rows: 1 - - * - - HIP C++ Type - - CDNA1 - - CDNA2 - - CDNA3 - - CDNA4 - - RDNA2 - - RDNA3 - - RDNA4 - - * - - ``int8_t``, ``uint8_t`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``int16_t``, ``uint16_t`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``int32_t``, ``uint32_t`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``int64_t``, ``uint64_t`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``__hip_fp4_e2m1`` - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ❌ - - * - - ``__hip_fp6_e2m3`` - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ❌ - - * - - ``__hip_fp6_e3m2`` - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ❌ - - * - - ``__hip_fp8_e4m3_fnuz`` - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - ``__hip_fp8_e5m2_fnuz`` - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - ``__hip_fp8_e4m3`` - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ✅ - - * - - ``__hip_fp8_e5m2`` - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - ❌ - - ✅ - - * - - ``half`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``bfloat16`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``float`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - ``double`` - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - -.. note:: - - Library support for specific data types is contingent upon hardware support. - Even if a ROCm library indicates support for a particular data type, that type - will only be fully functional if the underlying hardware architecture (as shown - in the table above) also supports it. For example, fp8 types are only available - on architectures shown with a checkmark in the relevant rows. - -Compute units support ---------------------- - -The following table lists data type support for compute units. - -.. tab-set:: - - .. tab-item:: Integral types - :sync: integral-type - - .. list-table:: - :header-rows: 1 - - * - - Type name - - int8 - - int16 - - int32 - - int64 - - * - - CDNA1 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - CDNA2 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - CDNA3 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - CDNA4 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - RDNA2 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - RDNA3 - - ✅ - - ✅ - - ✅ - - ✅ - - * - - RDNA4 - - ✅ - - ✅ - - ✅ - - ✅ - - .. tab-item:: Low precision floating-point types - :sync: floating-point-type-low - - .. list-table:: - :header-rows: 1 - - * - - Type name - - float4 - - float6 (E2M3) - - float6 (E3M2) - - float8 (E4M3) - - float8 (E5M2) - - * - - CDNA1 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA3 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA4 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA3 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA4 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - .. tab-item:: High precision floating-point types - :sync: floating-point-type-high - - .. list-table:: - :header-rows: 1 - - * - - Type name - - float16 - - bfloat16 - - tensorfloat32 - - float32 - - float64 - - * - - CDNA1 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA2 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA3 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA4 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - RDNA2 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - RDNA3 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - RDNA4 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - -Matrix core support -------------------- - -The following table lists data type support for AMD GPU matrix cores. - -.. tab-set:: - - .. tab-item:: Integral types - :sync: integral-type - - .. list-table:: - :header-rows: 1 - - * - - Type name - - int8 - - int16 - - int32 - - int64 - - * - - CDNA1 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - CDNA2 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - CDNA3 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - CDNA4 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - RDNA2 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - RDNA3 - - ✅ - - ❌ - - ❌ - - ❌ - - * - - RDNA4 - - ✅ - - ❌ - - ❌ - - ❌ - - .. tab-item:: Low precision floating-point types - :sync: floating-point-type-low - - .. list-table:: - :header-rows: 1 - - * - - Type name - - float4 - - float6 (E2M3) - - float6 (E3M2) - - float8 (E4M3) - - float8 (E5M2) - - * - - CDNA1 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA3 - - ❌ - - ❌ - - ❌ - - ✅ - - ✅ - - * - - CDNA4 - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - RDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA3 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA4 - - ❌ - - ❌ - - ❌ - - ✅ - - ✅ - - .. tab-item:: High precision floating-point types - :sync: floating-point-type-high - - .. list-table:: - :header-rows: 1 - - * - - Type name - - float16 - - bfloat16 - - tensorfloat32 - - float32 - - float64 - - * - - CDNA1 - - ✅ - - ✅ - - ❌ - - ✅ - - ❌ - - * - - CDNA2 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA3 - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - CDNA4 - - ✅ - - ✅ - - ✅ - - ✅ - - ✅ - - * - - RDNA2 - - ✅ - - ✅ - - ❌ - - ❌ - - ❌ - - * - - RDNA3 - - ✅ - - ✅ - - ❌ - - ❌ - - ❌ - - * - - RDNA4 - - ✅ - - ✅ - - ❌ - - ❌ - - ❌ - -Atomic operations support -------------------------- - -The following table lists which data types are supported for atomic -operations on AMD GPUs. The atomics operation type behavior is affected by the -memory locations, memory granularity, or scope of operations. For detailed -various support of atomic read-modify-write (atomicRMW) operations collected on -the :ref:`Hardware atomics operation support ` -page. - -.. tab-set:: - - .. tab-item:: Integral types - :sync: integral-type - - .. list-table:: - :header-rows: 1 - - * - - Type name - - int8 - - int16 - - int32 - - int64 - * - - CDNA1 - - ❌ - - ❌ - - ✅ - - ✅ - * - - CDNA2 - - ❌ - - ❌ - - ✅ - - ✅ - * - - CDNA3 - - ❌ - - ❌ - - ✅ - - ✅ - - * - - RDNA3 - - ❌ - - ❌ - - ✅ - - ✅ - - * - - RDNA4 - - ❌ - - ❌ - - ✅ - - ✅ - - .. tab-item:: Low precision floating-point types - :sync: floating-point-type-low - - .. list-table:: - :header-rows: 1 - - * - - Type name - - float4 - - float6 (E2M3) - - float6 (E3M2) - - float8 (E4M3) - - float8 (E5M2) - - * - - CDNA1 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA3 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - CDNA4 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA2 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA3 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - * - - RDNA4 - - ❌ - - ❌ - - ❌ - - ❌ - - ❌ - - .. tab-item:: High precision floating-point types - :sync: floating-point-type-high - - .. list-table:: - :header-rows: 1 - - * - - Type name - - 2 x float16 - - 2 x bfloat16 - - tensorfloat32 - - float32 - - float64 - - * - - CDNA1 - - ✅ - - ✅ - - ❌ - - ✅ - - ❌ - - * - - CDNA2 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA3 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - CDNA4 - - ✅ - - ✅ - - ❌ - - ✅ - - ✅ - - * - - RDNA2 - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - * - - RDNA3 - - ❌ - - ❌ - - ❌ - - ✅ - - ❌ - - * - - RDNA4 - - ✅ - - ✅ - - ❌ - - ✅ - - ❌ - -.. note:: - - You can emulate atomic operations using software for cases that are not - natively supported. Software-emulated atomic operations have a high negative - performance impact when they frequently access the same memory address. - -Data type support in ROCm libraries -=================================== - -ROCm library support for int8, float8 (E4M3), float8 (E5M2), int16, float16, -bfloat16, int32, tensorfloat32, float32, int64, and float64 is listed in the -following tables. - -Libraries input/output type support ------------------------------------ - -The following tables list ROCm library support for specific input and output -data types. Select a library from the below table to view the supported data -types. - -.. datatemplate:yaml:: /data/reference/precision-support/precision-support.yaml - - {% set library_groups = data.library_groups %} - - .. raw:: html - -
-
-
Category
-
- {% for group in library_groups %} -
{{ group.group }}
- {% endfor %} -
-
- -
-
Library
-
- {% for group in library_groups %} - {% for library in group.libraries %} -
{{ library.name }}
- {% endfor %} - {% endfor %} -
-
-
- - {% for group in library_groups %} - {% for library in group.libraries %} - - .. container:: model-doc {{ library.tag }} - - For more information, please visit :doc:`{{ library.name }} <{{ library.doc_link }}>`. - - .. list-table:: - :header-rows: 1 - :widths: 70, 30 - - * - - Data Type - - Support - {% for data_type in library.data_types %} - * - - {{ data_type.type }} - - {{ data_type.support }} - {% endfor %} - - {% endfor %} - {% endfor %} - -.. note:: - - The meaning of partial support depends on the library. Please refer to the individual - libraries' documentation for more information. - -.. note:: - - As random number generation libraries, rocRAND and hipRAND only specify output - data types for the random values they generate, with no need for input data - types. - -.. note:: - - hipBLASLt supports additional data types as internal compute types, which may - differ from the supported input/output types shown in the tables above. While - TensorFloat32 is not supported as an input or output type in this library, it - is available as an internal compute type. For complete details on supported - compute types, refer to the :doc:`hipBLASLt ` - documentation. - -hipDataType enumeration ------------------------ - -The ``hipDataType`` enumeration defines data precision types and is primarily -used when the data reference itself does not include type information, such as -in ``void*`` pointers. This enumeration is mainly utilized in BLAS libraries. -The HIP type equivalents of the ``hipDataType`` enumeration are listed in the -following table with descriptions and values. - -.. list-table:: - :header-rows: 1 - :widths: 25,25,10,40 - - * - - hipDataType - - HIP type - - Value - - Description - - * - - ``HIP_R_8I`` - - ``int8_t`` - - 3 - - 8-bit real signed integer. - - * - - ``HIP_R_8U`` - - ``uint8_t`` - - 8 - - 8-bit real unsigned integer. - - * - - ``HIP_R_16I`` - - ``int16_t`` - - 20 - - 16-bit real signed integer. - - * - - ``HIP_R_16U`` - - ``uint16_t`` - - 22 - - 16-bit real unsigned integer. - - * - - ``HIP_R_32I`` - - ``int32_t`` - - 10 - - 32-bit real signed integer. - - * - - ``HIP_R_32U`` - - ``uint32_t`` - - 12 - - 32-bit real unsigned integer. - - * - - ``HIP_R_32F`` - - ``float`` - - 0 - - 32-bit real single precision floating-point. - - * - - ``HIP_R_64F`` - - ``double`` - - 1 - - 64-bit real double precision floating-point. - - * - - ``HIP_R_16F`` - - ``half`` - - 2 - - 16-bit real half precision floating-point. - - * - - ``HIP_R_16BF`` - - ``bfloat16`` - - 14 - - 16-bit real bfloat16 precision floating-point. - - * - - ``HIP_R_8F_E4M3`` - - ``__hip_fp8_e4m3`` - - 28 - - 8-bit real float8 precision floating-point (OCP version). - - * - - ``HIP_R_8F_E5M2`` - - ``__hip_fp8_e5m2`` - - 29 - - 8-bit real bfloat8 precision floating-point (OCP version). - - * - - ``HIP_R_6F_E2M3`` - - ``__hip_fp6_e2m3`` - - 31 - - 6-bit real float6 precision floating-point. - - * - - ``HIP_R_6F_E3M2`` - - ``__hip_fp6_e3m2`` - - 32 - - 6-bit real bfloat6 precision floating-point. - - * - - ``HIP_R_4F_E2M1`` - - ``__hip_fp4_e2m1`` - - 33 - - 4-bit real float4 precision floating-point. - - * - - ``HIP_R_8F_E4M3_FNUZ`` - - ``__hip_fp8_e4m3_fnuz`` - - 1000 - - 8-bit real float8 precision floating-point (FNUZ version). - - * - - ``HIP_R_8F_E5M2_FNUZ`` - - ``__hip_fp8_e5m2_fnuz`` - - 1001 - - 8-bit real bfloat8 precision floating-point (FNUZ version). - -The full list of the ``hipDataType`` enumeration listed in `library_types.h `_. diff --git a/docs/reference/rocm-tools.md b/docs/reference/rocm-tools.md deleted file mode 100644 index 6f3c1fbfd..000000000 --- a/docs/reference/rocm-tools.md +++ /dev/null @@ -1,72 +0,0 @@ - - - - - - -# ROCm tools, compilers, and runtimes - -::::{grid} 1 2 2 2 -:gutter: 3 -:class-container: rocm-doc-grid - -:::{grid-item-card} System Management -:class-body: rocm-card-banner rocm-hue-1 - -(system-tools)= - -* {doc}`AMD SMI ` -* {doc}`ROCm Data Center Tool ` -* {doc}`rocminfo ` -* {doc}`ROCm SMI ` -* {doc}`ROCm Validation Suite ` -::: - -:::{grid-item-card} Performance -:class-body: rocm-card-banner rocm-hue-6 - -(performance-tools)= - -* {doc}`ROCm Bandwidth Test ` -* {doc}`ROCm Compute Profiler ` -* {doc}`ROCm Systems Profiler ` -* {doc}`ROCProfiler ` -* {doc}`ROCprofiler-SDK ` -* {doc}`ROCTracer ` -::: - -:::{grid-item-card} Development -:class-body: rocm-card-banner rocm-hue-1 - -(development-tools)= - -* {doc}`ROCm CMake ` -* {doc}`HIPIFY ` -* {doc}`ROCdbgapi ` -* {doc}`ROCm Debugger (ROCgdb) ` -* {doc}`ROCr Debug Agent ` -::: - -:::{grid-item-card} Compilers -:class-body: rocm-card-banner rocm-hue-8 - -(compilers)= - -* {doc}`ROCm Compilers ` -* {doc}`HIPCC ` -* [FLANG](https://github.com/ROCm/flang/) -::: - -:::{grid-item-card} Runtimes -:class-body: rocm-card-banner rocm-hue-12 - -(runtimes)= - -* {doc}`AMD Compute Language Runtime (CLR) ` -* {doc}`HIP ` -* {doc}`ROCR-Runtime ` -::: - -:::: diff --git a/docs/release/versions.md b/docs/release/versions.md deleted file mode 100644 index c224b8c9d..000000000 --- a/docs/release/versions.md +++ /dev/null @@ -1,55 +0,0 @@ -:orphan: - - - - - - - -# ROCm release history - -| Version | Release date | -| ------- | ------------ | -| [7.0.2](https://rocm.docs.amd.com/en/docs-7.0.2/) | October 10, 2025 | -| [7.0.1](https://rocm.docs.amd.com/en/docs-7.0.1/) | September 17, 2025 | -| [7.0.0](https://rocm.docs.amd.com/en/docs-7.0.0/) | September 16, 2025 | -| [6.4.3](https://rocm.docs.amd.com/en/docs-6.4.3/) | August 7, 2025 | -| [6.4.2](https://rocm.docs.amd.com/en/docs-6.4.2/) | July 21, 2025 | -| [6.4.1](https://rocm.docs.amd.com/en/docs-6.4.1/) | May 21, 2025 | -| [6.4.0](https://rocm.docs.amd.com/en/docs-6.4.0/) | April 11, 2025 | -| [6.3.3](https://rocm.docs.amd.com/en/docs-6.3.3/) | February 19, 2025 | -| [6.3.2](https://rocm.docs.amd.com/en/docs-6.3.2/) | January 28, 2025 | -| [6.3.1](https://rocm.docs.amd.com/en/docs-6.3.1/) | December 20, 2024 | -| [6.3.0](https://rocm.docs.amd.com/en/docs-6.3.0/) | December 3, 2024 | -| [6.2.4](https://rocm.docs.amd.com/en/docs-6.2.4/) | November 6, 2024 | -| [6.2.2](https://rocm.docs.amd.com/en/docs-6.2.2/) | September 27, 2024 | -| [6.2.1](https://rocm.docs.amd.com/en/docs-6.2.1/) | September 20, 2024 | -| [6.2.0](https://rocm.docs.amd.com/en/docs-6.2.0/) | August 2, 2024 | -| [6.1.5](https://rocm.docs.amd.com/en/docs-6.1.5/) | March 13, 2025 | -| [6.1.2](https://rocm.docs.amd.com/en/docs-6.1.2/) | June 4, 2024 | -| [6.1.1](https://rocm.docs.amd.com/en/docs-6.1.1/) | May 8, 2024 | -| [6.1.0](https://rocm.docs.amd.com/en/docs-6.1.0/) | Apr 16, 2024 | -| [6.0.2](https://rocm.docs.amd.com/en/docs-6.0.2/) | Jan 31, 2024 | -| [6.0.0](https://rocm.docs.amd.com/en/docs-6.0.0/) | Dec 15, 2023 | -| [5.7.1](https://rocm.docs.amd.com/en/docs-5.7.1/) | Oct 13, 2023 | -| [5.7.0](https://rocm.docs.amd.com/en/docs-5.7.0/) | Sep 15, 2023 | -| [5.6.1](https://rocm.docs.amd.com/en/docs-5.6.1/) | Aug 29, 2023 | -| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 | -| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 | -| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 | -| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 | -| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 | -| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 | -| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 | -| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 | -| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 | -| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 | -| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 | -| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 | -| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 | -| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 | -| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 | -| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 | -| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 | -| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 | -| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 | diff --git a/docs/what-is-rocm.rst b/docs/what-is-rocm.rst deleted file mode 100644 index b8ade66c5..000000000 --- a/docs/what-is-rocm.rst +++ /dev/null @@ -1,159 +0,0 @@ - .. meta:: - :description: What is ROCm - :keywords: ROCm components, ROCm projects, introduction, ROCm, AMD, runtimes, compilers, tools, libraries, API - -*********************************************************** -What is ROCm? -*********************************************************** - -ROCm is a software stack, composed primarily of open-source software, that -provides the tools for programming AMD Graphics Processing Units (GPUs), from -low-level kernels to high-level end-user applications. - -.. image:: data/rocm-software-stack-7_0_0.jpg - :width: 800 - :alt: AMD's ROCm software stack and enabling technologies. - :align: center - -Specifically, ROCm provides the tools for -:doc:`HIP (Heterogeneous-computing Interface for Portability) `, -OpenCL and OpenMP. These include compilers, libraries for high-level -functions, debuggers, profilers and runtimes. - -ROCm components -=============================================== - -ROCm consists of the following components. For information on the license associated with each component, -see :doc:`ROCm licensing <./about/license>`. - -Libraries ------------------------------------------------ - -Machine Learning & Computer Vision -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`Composable Kernel `", "Provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures" - ":doc:`MIGraphX `", "Graph inference engine that accelerates machine learning model inference" - ":doc:`MIOpen `", "An open source deep-learning library" - ":doc:`MIVisionX `", "Set of comprehensive computer vision and machine learning libraries, utilities, and applications" - ":doc:`ROCm Performance Primitives (RPP) `", "Comprehensive high-performance computer vision library for AMD processors with HIP/OpenCL/CPU back-ends" - ":doc:`rocAL `", "An augmentation library designed to decode and process images and videos" - ":doc:`rocDecode `", "High-performance SDK for access to video decoding features on AMD GPUs" - ":doc:`rocJPEG `", "Library for decoding JPG images on AMD GPUs" - ":doc:`rocPyDecode `", "Provides access to rocDecode APIs in both Python and C/C++ languages" - -.. note:: - - `rocCV `_ is an efficient GPU-accelerated library for image pre- and post-processing. rocCV is in an early access state. Using it on production workloads is not recommended. - -Communication -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`RCCL `", "Standalone library that provides multi-GPU and multi-node collective communication primitives" - ":doc:`rocSHMEM `", "An intra-kernel networking library that provides GPU-centric networking through an OpenSHMEM-like interface" - -Math -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - "`half `_", "C++ header-only library that provides an IEEE 754 conformant, 16-bit half-precision floating-point type, along with corresponding arithmetic operators, type conversions, and common mathematical functions" - ":doc:`hipBLAS `", "BLAS-marshaling library that supports :doc:`rocBLAS ` and cuBLAS backends" - ":doc:`hipBLASLt `", "Provides general matrix-matrix operations with a flexible API and extends functionalities beyond traditional BLAS library" - ":doc:`hipFFT `", "Fast Fourier transforms (FFT)-marshalling library that supports rocFFT or cuFFT backends" - ":doc:`hipfort `", "Fortran interface library for accessing GPU Kernels" - ":doc:`hipRAND `", "Ports CUDA applications that use the cuRAND library into the HIP layer" - ":doc:`hipSOLVER `", "An LAPACK-marshalling library that supports :doc:`rocSOLVER ` and cuSOLVER backends" - ":doc:`hipSPARSE `", "SPARSE-marshalling library that supports :doc:`rocSPARSE ` and cuSPARSE backends" - ":doc:`hipSPARSELt `", "SPARSE-marshalling library with multiple supported backends" - ":doc:`rocALUTION `", "Sparse linear algebra library for exploring fine-grained parallelism on ROCm runtime and toolchains" - ":doc:`rocBLAS `", "BLAS implementation (in the HIP programming language) on the ROCm runtime and toolchains" - ":doc:`rocFFT `", "Software library for computing fast Fourier transforms (FFTs) written in HIP" - ":doc:`rocRAND `", "Provides functions that generate pseudorandom and quasirandom numbers" - ":doc:`rocSOLVER `", "An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMD's latest discrete GPUs" - ":doc:`rocSPARSE `", "Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language)" - ":doc:`rocWMMA `", "C++ library for accelerating mixed-precision matrix multiply-accumulate (MMA) operations" - ":doc:`Tensile `", "Creates benchmark-driven backend libraries for GEMMs, GEMM-like problems, and general N-dimensional tensor contractions" - -Primitives -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`hipCUB `", "Thin header-only wrapper library on top of :doc:`rocPRIM ` or CUB that allows project porting using the CUB library to the HIP layer" - ":doc:`hipTensor `", "AMD's C++ library for accelerating tensor primitives based on the composable kernel library" - ":doc:`rocPRIM `", "Header-only library for HIP parallel primitives" - ":doc:`rocThrust `", "Parallel algorithm library" - -Tools ------------------------------------------------ - -System Management -^^^^^^^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`AMD SMI `", "System management interface to control AMD GPU settings, monitor performance, and retrieve device and process information" - ":doc:`ROCm Data Center Tool `", "Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments" - ":doc:`rocminfo `", "Reports system information" - ":doc:`ROCm SMI `", "C library for Linux that provides a user space interface for applications to monitor and control GPU applications" - ":doc:`ROCm Validation Suite `", "Detects and troubleshoots common problems affecting AMD GPUs running in a high-performance computing environment" - -Performance -^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`ROCm Bandwidth Test `", "Captures the performance characteristics of buffer copying and kernel read/write operations" - ":doc:`ROCm Compute Profiler `", "Kernel-level profiling for machine learning and high performance computing (HPC) workloads" - ":doc:`ROCm Systems Profiler `", "Comprehensive profiling and tracing of applications running on the CPU or the CPU and GPU" - ":doc:`ROCProfiler `", "Profiling tool for HIP applications" - ":doc:`ROCprofiler-SDK `", "Toolkit for developing analysis tools for profiling and tracing GPU compute applications. This toolkit is in beta and subject to change" - ":doc:`ROCTracer `", "Intercepts runtime API calls and traces asynchronous activity" - -.. note:: - - `ROCprof Compute Viewer `_ is a tool for visualizing and analyzing GPU thread trace data collected using :doc:`rocprofv3 `. - Note that `ROCprof Compute Viewer `_ is in an early access state. Running production workloads is not recommended. - -Development -^^^^^^^^^^^ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`HIPIFY `", "Translates CUDA source code into portable HIP C++" - ":doc:`ROCm CMake `", "Collection of CMake modules for common build and development tasks" - ":doc:`ROCdbgapi `", "ROCm debugger API library" - ":doc:`ROCm Debugger (ROCgdb) `", "Source-level debugger for Linux, based on the GNU Debugger (GDB)" - ":doc:`ROCr Debug Agent `", "Prints the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running" - -Compilers ------------------------------------------------ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`HIPCC `", "Compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure" - ":doc:`ROCm compilers `", "ROCm LLVM compiler infrastructure" - "`FLANG `_", "An out-of-tree Fortran compiler targeting LLVM" - -Runtimes ------------------------------------------------ - -.. csv-table:: - :header: "Component", "Description" - - ":doc:`AMD Compute Language Runtime (CLR) `", "Contains source code for AMD's compute language runtimes: HIP and OpenCL" - ":doc:`HIP `", "AMD's GPU programming language extension and the GPU runtime" - ":doc:`ROCR-Runtime `", "User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents"