diff --git a/CHANGELOG.md b/CHANGELOG.md index 6c068afb0..27f48060e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4116,7 +4116,7 @@ memory partition modes upon an invalid argument return from memory partition mod - JSON output plugin for `rocprofv2`. The JSON file matches Google Trace Format making it easy to load on Perfetto, Chrome tracing, or Speedscope. For Speedscope, use `--disable-json-data-flows` option as speedscope doesn't work with data flows. - `--no-serialization` flag to disable kernel serialization when `rocprofv2` is in counter collection mode. This allows `rocprofv2` to avoid deadlock when profiling certain programs in counter collection mode. -- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 accelerator +- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 GPU - New HIP APIs with struct defined inside union. - Early checks to confirm the eligibility of ELF file in ATT plugin - Support for kernel name filtering in `rocprofv2` @@ -4140,18 +4140,18 @@ memory partition modes upon an invalid argument return from memory partition mod #### Resolved issues -- Bandwidth measurement in AMD Instinct MI300 accelerator +- Bandwidth measurement in AMD Instinct MI300 GPU - Perfetto plugin issue of `roctx` trace not getting displayed - `--help` for counter collection - Signal management issues in `queue.cpp` - Perfetto tracks for multi-GPU - Perfetto plugin usage with `rocsys` - Incorrect number of columns in the output CSV files for counter collection and kernel tracing -- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 accelerator +- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 GPU - Build errors thrown during parsing of unions - The system hang caused while running `--kernel-trace` with Perfetto for certain applications - Missing profiler records issue caused while running `--trace-period` -- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 accelerator +- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 GPU - Segmentation fault on Navi32 @@ -5548,7 +5548,7 @@ See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub. intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/rocprofiler-compute/issues/347). -- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error: +- Omniperf might not work with AMD Instinct MI300 GPUs out of the box, resulting in the following error: "*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the environment variable `export ROCPROF=rocprofv2`. @@ -5664,7 +5664,7 @@ See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub. #### Optimized -* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator. +* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 GPU. #### Removed diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index c16240e32..661c935dd 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -10,7 +10,7 @@ Use this matrix to view the ROCm compatibility and system requirements across su You can also refer to the :ref:`past versions of ROCm compatibility matrix`. -Accelerators and GPUs listed in the following table support compute workloads (no display +GPUs listed in the following table support compute workloads (no display information or graphics). If you’re using ROCm with AMD Radeon GPUs or Ryzen APUs for graphics workloads, see the :docs:`Use ROCm on Radeon and Ryzen ` to verify compatibility and system requirements. diff --git a/docs/compatibility/ml-compatibility/jax-compatibility.rst b/docs/compatibility/ml-compatibility/jax-compatibility.rst index 479256c48..121e4d126 100644 --- a/docs/compatibility/ml-compatibility/jax-compatibility.rst +++ b/docs/compatibility/ml-compatibility/jax-compatibility.rst @@ -79,7 +79,7 @@ Use cases and recommendations * The `MI300X workload optimization guide `_ provides detailed guidance on optimizing workloads for the AMD Instinct MI300X - accelerator using ROCm. The page is aimed at helping users achieve optimal + GPU using ROCm. The page is aimed at helping users achieve optimal performance for deep learning and other high-performance computing tasks on the MI300X GPU. diff --git a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst index 245296532..19901b7cd 100644 --- a/docs/compatibility/ml-compatibility/pytorch-compatibility.rst +++ b/docs/compatibility/ml-compatibility/pytorch-compatibility.rst @@ -73,9 +73,9 @@ Use cases and recommendations * The :doc:`Instinct MI300X workload optimization guide ` provides detailed guidance on optimizing workloads for the AMD Instinct MI300X - accelerator using ROCm. This guide helps users achieve optimal performance for + GPU using ROCm. This guide helps users achieve optimal performance for deep learning and other high-performance computing tasks on the MI300X - accelerator. + GPU. * The :doc:`Inception with PyTorch documentation ` describes how PyTorch integrates with ROCm for AI workloads It outlines the @@ -417,7 +417,7 @@ Key features and enhancements for PyTorch 2.7 with ROCm 7.0 - Expanded GPU architecture support: Provides optimized support for newer GPU architectures, including gfx1200 and gfx1201 with preferred hipBLASLt backend - selection, along with improvements for gfx950 and gfx1100 series GPUs. + selection, along with improvements for gfx950 and gfx1100 Series GPUs. - Advanced Triton Integration: AOTriton 0.10b introduces official support for gfx950 and gfx1201, along with experimental support for gfx1101, gfx1151, diff --git a/docs/compatibility/ml-compatibility/taichi-compatibility.rst b/docs/compatibility/ml-compatibility/taichi-compatibility.rst index 58bbbd4f5..5fb2b9708 100644 --- a/docs/compatibility/ml-compatibility/taichi-compatibility.rst +++ b/docs/compatibility/ml-compatibility/taichi-compatibility.rst @@ -30,8 +30,8 @@ visual effects in film and gaming, and general-purpose computing. Supported devices and features =============================================================================== -There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X series GPUs with the exception of Taichi’s GPU rendering system, CGUI. -AMD Instinct MI300X series GPUs will be supported by November. +There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X Series GPUs with the exception of Taichi’s GPU rendering system, CGUI. +AMD Instinct MI300X Series GPUs will be supported by November. .. _taichi-recommendations: diff --git a/docs/conceptual/gpu-arch.md b/docs/conceptual/gpu-arch.md index f14a39421..2074e0cc3 100644 --- a/docs/conceptual/gpu-arch.md +++ b/docs/conceptual/gpu-arch.md @@ -13,22 +13,22 @@ :gutter: 1 :::{grid-item-card} -**AMD Instinct MI300 series** +**AMD Instinct MI300 Series** -Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3 +Review hardware aspects of the AMD Instinct™ MI300 Series GPUs and the CDNA™ 3 architecture. * [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md) * [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) * [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf) * [MI300 performance counters](./gpu-arch/mi300-mi200-performance-counters.rst) -* [MI350 series performance counters](./gpu-arch/mi350-performance-counters.rst) +* [MI350 Series performance counters](./gpu-arch/mi350-performance-counters.rst) ::: :::{grid-item-card} -**AMD Instinct MI200 series** +**AMD Instinct MI200 Series** -Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2 +Review hardware aspects of the AMD Instinct™ MI200 Series GPUs and the CDNA™ 2 architecture. * [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md) @@ -41,7 +41,7 @@ architecture. :::{grid-item-card} **AMD Instinct MI100** -Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1 +Review hardware aspects of the AMD Instinct™ MI100 Series GPUs and the CDNA™ 1 architecture. * [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md) diff --git a/docs/conceptual/gpu-arch/mi100.md b/docs/conceptual/gpu-arch/mi100.md index f98d4c0db..8fa4a11d8 100644 --- a/docs/conceptual/gpu-arch/mi100.md +++ b/docs/conceptual/gpu-arch/mi100.md @@ -1,14 +1,14 @@ --- myst: html_meta: - "description lang=en": "Learn about the AMD Instinct MI100 series architecture." + "description lang=en": "Learn about the AMD Instinct MI100 Series architecture." "keywords": "Instinct, MI100, microarchitecture, AMD, ROCm" --- # AMD Instinct™ MI100 microarchitecture The following image shows the node-level architecture of a system that -comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators. +comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ GPUs. The two EPYC processors are connected to each other with the AMD Infinity™ fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such that each processor can access the available node memory as a single @@ -18,29 +18,29 @@ available to connect the processors plus one PCIe Gen 4 x16 link per processor can attach additional I/O devices such as the host adapters for the network fabric. -![Structure of a single GCD in the AMD Instinct MI100 accelerator](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.") +![Structure of a single GCD in the AMD Instinct MI100 GPU](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ GPUs.") In a typical node configuration, each processor can host up to four AMD -Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec, +Instinct™ GPUs that are attached using PCIe Gen 4 links at 16 GT/sec, which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive -of four accelerators can participate in a fully connected, coherent AMD -Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD +of four GPUs can participate in a fully connected, coherent AMD +Instinct™ fabric that connects the four GPUs using 23 GT/sec AMD Infinity fabric links that run at a higher frequency than the inter-processor links. This inter-GPU link can be established in certified server systems if the GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity -Fabric™ bridge for the AMD Instinct™ accelerators. +Fabric™ bridge for the AMD Instinct™ GPUs. ## Microarchitecture -The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA +The microarchitecture of the AMD Instinct GPUs is based on the AMD CDNA architecture, which targets compute applications such as high-performance computing (HPC) and AI & machine learning (ML) that run on everything from individual servers to the world's largest exascale supercomputers. The overall system architecture is designed for extreme scalability and compute performance. -![Structure of the AMD Instinct accelerator (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct accelerator (MI100 generation)") +![Structure of the AMD Instinct GPU (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct GPU (MI100 generation)") -The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16 +The above image shows the AMD Instinct GPU with its PCIe Gen 4 x16 link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host processor(s). It also shows the three AMD Infinity Fabric ports that provide high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local @@ -48,7 +48,7 @@ hive. On the left and right of the floor plan, the High Bandwidth Memory (HBM) attaches via the GPU memory controller. The MI100 generation of the AMD -Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total +Instinct GPU offers four stacks of HBM generation 2 (HBM2) for a total of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz. @@ -64,7 +64,7 @@ Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS ![Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture](../../data/conceptual/gpu-arch/image006.png "An MI100 compute unit with detailed SIMD view of the AMD CDNA architecture") The preceding image shows the block diagram of a single CU of an AMD Instinct™ -MI100 accelerator and summarizes how instructions flow through the execution +MI100 GPU and summarizes how instructions flow through the execution engines. The CU fetches the instructions via a 32KB instruction cache and moves them forward to execution via a dispatcher. The CU can handle up to ten wavefronts at a time and feed their instructions into the execution unit. The diff --git a/docs/conceptual/gpu-arch/mi250.md b/docs/conceptual/gpu-arch/mi250.md index 47f7f8b1b..0a308ad88 100644 --- a/docs/conceptual/gpu-arch/mi250.md +++ b/docs/conceptual/gpu-arch/mi250.md @@ -1,13 +1,13 @@ --- myst: html_meta: - "description lang=en": "Learn about the AMD Instinct MI250 series architecture." + "description lang=en": "Learn about the AMD Instinct MI250 Series architecture." "keywords": "Instinct, MI250, microarchitecture, AMD, ROCm" --- # AMD Instinct™ MI250 microarchitecture -The microarchitecture of the AMD Instinct MI250 accelerators is based on the +The microarchitecture of the AMD Instinct MI250 GPU is based on the AMD CDNA 2 architecture that targets compute applications such as HPC, artificial intelligence (AI), and machine learning (ML) and that run on everything from individual servers to the world’s largest exascale @@ -40,7 +40,7 @@ execution units (also called matrix cores), which are geared toward executing matrix operations like matrix-matrix multiplications. For FP64, the peak performance of these units amounts to 90.5 TFLOPS. -![Structure of a single GCD in the AMD Instinct MI250 accelerator.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 accelerator.") +![Structure of a single GCD in the AMD Instinct MI250 GPU.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 GPU.") ```{list-table} Peak-performance capabilities of the MI250 OAM for different data types. :header-rows: 1 @@ -84,16 +84,9 @@ performance of these units amounts to 90.5 TFLOPS. - 362.1 ``` -The above table summarizes the aggregated peak performance of the AMD -Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute -Platform) and its two GCDs for different data types and execution units. The -middle column lists the peak performance (number of data elements processed in a -single instruction) of a single compute unit if a SIMD (or matrix) instruction -is being retired in each clock cycle. The third column lists the theoretical -peak performance of the OAM module. The theoretical aggregated peak memory -bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD). +The above table summarizes the aggregated peak performance of the AMD Instinct MI250 Open Compute Platform (OCP) Open Accelerator Modules (OAMs) and its two GCDs for different data types and execution units. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD). -![Dual-GCD architecture of the AMD Instinct MI250 accelerators](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 accelerators") +![Dual-GCD architecture of the AMD Instinct MI250 GPUs](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 GPUs") The following image shows the block diagram of an OAM package that consists of two GCDs, each of which constitutes one GPU device in the system. The two @@ -105,18 +98,18 @@ between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of ## Node-level architecture The following image shows the node-level architecture of a system that is -based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host +based on the AMD Instinct MI250 GPU. The MI250 OAMs attach to the host system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe x16 link to the host part of the system. Depending on the server platform, the GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch . Note that some platforms may offer an x8 interface to the GCDs, which reduces the available host-to-GPU bandwidth. -![Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor") +![Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor") The preceding image shows the node-level architecture of a system with AMD EPYC processors in a dual-socket configuration and four AMD Instinct MI250 -accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4 +GPUs. The MI250 OAMs attach to the host processors system via PCIe Gen 4 x16 links (yellow lines). Depending on the system design, a PCIe switch may exist to make more PCIe lanes available for additional components like network interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to diff --git a/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst b/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst index 8598d2c7c..3777b3d32 100644 --- a/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst +++ b/docs/conceptual/gpu-arch/mi300-mi200-performance-counters.rst @@ -1,16 +1,16 @@ .. meta:: - :description: MI300 and MI200 series performance counters and metrics + :description: MI300 and MI200 Series performance counters and metrics :keywords: MI300, MI200, performance counters, command processor counters *************************************************************************************************** -MI300 and MI200 series performance counters and metrics +MI300 and MI200 Series performance counters and metrics *************************************************************************************************** This document lists and describes the hardware performance counters and derived metrics available for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the :doc:`ROCprofiler-SDK `. -MI300 and MI200 series performance counters +MI300 and MI200 Series performance counters =============================================================== Series performance counters include the following categories: @@ -27,7 +27,7 @@ The following sections provide additional details for each category. .. note:: - Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with + Preliminary validation of all MI300 and MI200 Series performance counters is in progress. Those with an asterisk (*) require further evaluation. .. _command-processor-counters: @@ -171,7 +171,7 @@ Instruction mix "``SQ_INSTS_SMEM``", "Instr", "Number of scalar memory instructions issued" "``SQ_INSTS_SMEM_NORM``", "Instr", "Number of scalar memory instructions normalized to match ``smem_level`` issued" "``SQ_INSTS_FLAT``", "Instr", "Number of flat instructions issued" - "``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled." + "``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 Series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled." "``SQ_INSTS_LDS``", "Instr", "Number of LDS instructions issued **(MI200: includes flat; MI300: does not include flat)**" "``SQ_INSTS_GDS``", "Instr", "Number of global data share instructions issued" "``SQ_INSTS_EXP_GDS``", "Instr", "Number of EXP and global data share instructions excluding skipped export instructions issued" @@ -396,9 +396,9 @@ Texture cache per pipe counters "``TCP_UTCL1_TRANSLATION_MISS[n]``", "Req", "Number of unified translation cache (L1) translation misses", "0-15" "``TCP_UTCL1_PERMISSION_MISS[n]``", "Req", "Number of unified translation cache (L1) permission misses", "0-15" "``TCP_TOTAL_CACHE_ACCESSES[n]``", "Req", "Number of vector L1d cache accesses including hits and misses", "0-15" - "``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15" - "``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15" - "``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15" + "``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 Series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15" + "``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15" + "``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15" "``TCP_TCC_READ_REQ[n]``", "Req", "Number of read requests to L2 cache", "0-15" "``TCP_TCC_WRITE_REQ[n]``", "Req", "Number of write requests to L2 cache", "0-15" "``TCP_TCC_ATOMIC_WITH_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache with return", "0-15" @@ -560,7 +560,7 @@ Note the following: ``TCC_TAG_STALL[n]``, probes can stall the pipeline at a variety of places. There is no single point that can accurately measure the total stalls -MI300 and MI200 series derived metrics list +MI300 and MI200 Series derived metrics list ============================================================== .. csv-table:: diff --git a/docs/conceptual/gpu-arch/mi300.md b/docs/conceptual/gpu-arch/mi300.md index f5b66ceae..416b81859 100644 --- a/docs/conceptual/gpu-arch/mi300.md +++ b/docs/conceptual/gpu-arch/mi300.md @@ -1,21 +1,21 @@ --- myst: html_meta: - "description lang=en": "Learn about the AMD Instinct MI300 series architecture." + "description lang=en": "Learn about the AMD Instinct MI300 Series architecture." "keywords": "Instinct, MI300X, MI300A, microarchitecture, AMD, ROCm" --- -# AMD Instinct™ MI300 series microarchitecture +# AMD Instinct™ MI300 Series microarchitecture -The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3 +The AMD Instinct MI300 Series GPUs are based on the AMD CDNA 3 architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine -learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running +learning (ML) workloads. The AMD Instinct MI300 Series GPUs are well-suited for extreme scalability and compute performance, running on everything from individual servers to the world’s largest exascale supercomputers. -With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the +With the MI300 Series, AMD is introducing the Accelerator Complex Die (XCD), which contains the GPU computational elements of the processor along with the lower levels of the cache hierarchy. -The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series. +The following image depicts the structure of a single XCD in the AMD Instinct MI300 GPU Series. ```{figure} ../../data/shared/xcd-sys-arch.png --- @@ -39,7 +39,7 @@ infrastructure) using the AMD Infinity Fabric™ technology as interconnect. The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine learning, enhancing throughput of existing data types while adding support for new data types. CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X -accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a +GPUs, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32, while TF32 has a gain of 4 times compared to FP32. @@ -105,7 +105,7 @@ name: mi300-arch alt: align: center --- -MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs. +MI300 Series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs. ``` ## Node-level architecture @@ -116,11 +116,11 @@ name: mi300-node align: center --- -MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors. +MI300 Series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors. ``` The image above shows the node-level architecture of a system with AMD EPYC processors in a -dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the +dual-socket configuration and eight AMD Instinct MI300X GPUs. The MI300X OAMs attach to the host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth, low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system. diff --git a/docs/conceptual/gpu-arch/mi350-performance-counters.rst b/docs/conceptual/gpu-arch/mi350-performance-counters.rst index fe103c17c..c6bcd1e5f 100644 --- a/docs/conceptual/gpu-arch/mi350-performance-counters.rst +++ b/docs/conceptual/gpu-arch/mi350-performance-counters.rst @@ -1,12 +1,12 @@ .. meta:: - :description: MI355 series performance counters and metrics + :description: MI355 Series performance counters and metrics :keywords: MI355, MI355X, MI3XX *********************************** -MI350 series performance counters +MI350 Series performance counters *********************************** -This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using `ROCprofiler-SDK `_ and `ROCm Compute Profiler `_. +This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 GPUs. These counters are available for profiling using `ROCprofiler-SDK `_ and `ROCm Compute Profiler `_. The following sections list the performance counters based on the IP blocks. diff --git a/docs/conf.py b/docs/conf.py index b097dc26e..5a6298e04 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -234,7 +234,7 @@ suppress_warnings = ["autosectionlabel.*"] html_context = { "project_path" : {project_path}, - "gpu_type" : [('AMD Instinct accelerators', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ], + "gpu_type" : [('AMD Instinct GPUs', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ], "atomics_type" : [('HW atomics', 'hw-atomics'), ('CAS emulation', 'cas-atomics')], "pcie_type" : [('No PCIe atomics', 'nopcie'), ('PCIe atomics', 'pcie')], "memory_type" : [('Device DRAM', 'device-dram'), ('Migratable Host DRAM', 'migratable-host-dram'), ('Pinned Host DRAM', 'pinned-host-dram')], diff --git a/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv index b375886eb..1c3f222e3 100644 --- a/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/cas-atomics_nopcie_instinct.csv @@ -1,4 +1,4 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series 32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS 32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS 32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS diff --git a/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv index 2e0e40fc1..533571e8c 100644 --- a/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/cas-atomics_pcie_instinct.csv @@ -1,4 +1,4 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series 32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS 32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS 32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS diff --git a/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv index 483684089..2bf490375 100644 --- a/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/hw-atomics_nopcie_instinct.csv @@ -1,4 +1,4 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series 32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native 32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native 32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native diff --git a/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv b/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv index 5ea596069..b915ef992 100644 --- a/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv +++ b/docs/data/reference/gpu-atomics-operation/hw-atomics_pcie_instinct.csv @@ -1,4 +1,4 @@ -Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series +Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series 32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native 32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native 32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native diff --git a/docs/how-to/deep-learning-rocm.rst b/docs/how-to/deep-learning-rocm.rst index fb21328f8..8a535ea54 100644 --- a/docs/how-to/deep-learning-rocm.rst +++ b/docs/how-to/deep-learning-rocm.rst @@ -10,7 +10,7 @@ Deep learning frameworks provide environments for machine learning, training, fi ROCm offers a complete ecosystem for developing and running deep learning applications efficiently. It also provides ROCm-compatible versions of popular frameworks and libraries, such as PyTorch, TensorFlow, JAX, and others. -The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs and accelerators. +The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs. The table below summarizes information about ROCm-enabled deep learning frameworks. It includes details on ROCm compatibility and third-party tool support, installation steps and options, and links to GitHub resources. For a complete list of supported framework versions on ROCm, see the :doc:`Compatibility matrix <../compatibility/compatibility-matrix>` topic. diff --git a/docs/how-to/gpu-performance/mi300x.rst b/docs/how-to/gpu-performance/mi300x.rst index 5c13ea177..d82698c24 100644 --- a/docs/how-to/gpu-performance/mi300x.rst +++ b/docs/how-to/gpu-performance/mi300x.rst @@ -1,5 +1,5 @@ .. meta:: - :description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance. + :description: How to configure MI300X GPUs to fully leverage their capabilities and achieve optimal performance. :keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning ************************************** @@ -7,11 +7,11 @@ AMD Instinct MI300X performance guides ************************************** The following performance guides provide essential guidance on the necessary -steps to properly `configure your system for AMD Instinct™ MI300X accelerators +steps to properly `configure your system for AMD Instinct™ MI300X GPUs `_. They include detailed instructions on system settings and application :doc:`workload tuning ` to -help you leverage the maximum capabilities of these accelerators and achieve +help you leverage the maximum capabilities of these GPUs and achieve superior performance. * `AMD Instinct MI300X system optimization `__ @@ -19,9 +19,9 @@ superior performance. your AMD Instinct MI300X system for performance. * :doc:`/how-to/rocm-for-ai/inference-optimization/workload` covers steps to - optimize the performance of AMD Instinct MI300X series accelerators for HPC + optimize the performance of AMD Instinct MI300X Series GPUs for HPC and deep learning operations. * :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm` introduces a preconfigured environment for LLM inference, designed to help you test performance with - popular models on AMD Instinct MI300X series accelerators. + popular models on AMD Instinct MI300X Series GPUs. diff --git a/docs/how-to/programming_guide.rst b/docs/how-to/programming_guide.rst index 54ebc93e9..c002b0295 100644 --- a/docs/how-to/programming_guide.rst +++ b/docs/how-to/programming_guide.rst @@ -25,7 +25,7 @@ execute on AMD GPUs while maintaining compatibility with CUDA-based systems. OpenCL (Open Computing Language) is an open standard for cross-platform, parallel programming of diverse processors. ROCm supports OpenCL for developers who want to use standard frameworks across different hardware platforms, -including CPUs, GPUs, and other accelerators. For more information, see +including CPUs, GPUs, and APUs. For more information, see `OpenCL `_. Python bindings can be found at https://github.com/ROCm/hip-python. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst index 3542706db..0d540988b 100644 --- a/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst +++ b/docs/how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference.rst @@ -11,10 +11,10 @@ Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and ROCm-aware versions of :doc:`deep learning frameworks <../../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX. -Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for +Single-accelerator systems, such as a machine equipped with a single GPU, are commonly used for smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`. -Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are +Multi-accelerator systems, on the other hand, consist of multiple GPUs working in parallel. These systems are typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst index 0d6ba2b9c..83ec3927b 100644 --- a/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst +++ b/docs/how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst @@ -3,11 +3,11 @@ :keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune ***************************************************** -Fine-tuning and inference using multiple accelerators +Fine-tuning and inference using multiple GPUs ***************************************************** This section explains how to fine-tune a model on a multi-accelerator system. See -:doc:`Single-accelerator fine-tuning ` for a single accelerator or GPU setup. +:doc:`Single-accelerator fine-tuning ` for a single GPU setup. .. _fine-tuning-llms-multi-gpu-env: @@ -20,7 +20,7 @@ This section was tested using the following hardware and software environment. :stub-columns: 1 * - Hardware - - 4 AMD Instinct MI300X accelerators + - 4 AMD Instinct MI300X GPUs * - Software - ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10 @@ -40,13 +40,13 @@ Setting up the base implementation environment :doc:`PyTorch installation guide `. For consistent installation, it’s recommended to use official ROCm prebuilt Docker images with the framework pre-installed. -#. In the Docker container, check the availability of ROCM-capable accelerators using the following command. +#. In the Docker container, check the availability of ROCm-capable GPUs using the following command. .. code-block:: shell rocm-smi --showproductname -#. Check that your accelerators are available to PyTorch. +#. Check that your GPUs are available to PyTorch. .. code-block:: python @@ -66,7 +66,7 @@ Setting up the base implementation environment .. tip:: During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal. - This tool helps you see shows which accelerators or GPUs are involved. + This tool helps you see shows which GPUs are involved. .. _fine-tuning-llms-multi-gpu-hugging-face-accelerate: @@ -74,9 +74,9 @@ Setting up the base implementation environment Hugging Face Accelerate for fine-tuning and inference =========================================================== -`Hugging Face Accelerate `_ is a library that simplifies turning raw -PyTorch code for a single accelerator into code for multiple accelerators for LLM fine-tuning and inference. It is -integrated with `Transformers `_ allowing you to scale your PyTorch +`Hugging Face Accelerate `__ is a library that simplifies turning raw +PyTorch code for a single GPU into code for multiple GPUs for LLM fine-tuning and inference. It is +integrated with `Transformers `__, so you can scale your PyTorch code while maintaining performance and flexibility. As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama @@ -107,7 +107,7 @@ Now, it's important to adjust how you load the model. Add the ``device_map`` par (``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``). It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and - efficiently allocate the model given the available resources (4 accelerators in this case). + efficiently allocate the model given the available resources (four GPUs in this case). When you have more GPU memory available than the model size, here is the difference between each ``device_map`` option: @@ -130,8 +130,8 @@ After loading the model in this way, the model is fully ready to use the resourc torchtune for fine-tuning and inference ============================================= -`torchtune `_ is a PyTorch-native library for easy single and multi-accelerator or -GPU model fine-tuning and inference with LLMs. +`torchtune `_ is a PyTorch-native library for easy single and multi-GPU +model fine-tuning and inference with LLMs. #. Install torchtune using pip. diff --git a/docs/how-to/rocm-for-ai/fine-tuning/overview.rst b/docs/how-to/rocm-for-ai/fine-tuning/overview.rst index f5dea82a4..fd7d096c3 100644 --- a/docs/how-to/rocm-for-ai/fine-tuning/overview.rst +++ b/docs/how-to/rocm-for-ai/fine-tuning/overview.rst @@ -30,7 +30,7 @@ The challenge of fine-tuning models However, the computational cost of fine-tuning is still high, especially for complex models and large datasets, which poses distinct challenges related to substantial computational and memory requirements. This might be a barrier for -accelerators or GPUs with low computing power or limited device memory resources. +GPUs with low computing power or limited device memory resources. For example, suppose we have a language model with 7 billion (7B) parameters, represented by a weight matrix :math:`W`. During backpropagation, the model needs to learn a :math:`ΔW` matrix, which updates the original weights to minimize the @@ -84,8 +84,8 @@ Walkthrough =========== To demonstrate the benefits of LoRA and the ideal compute compatibility of using PEFT and TRL libraries on AMD -ROCm-compatible accelerators and GPUs, let's step through a comprehensive implementation of the fine-tuning process -using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X accelerators. +ROCm-compatible GPUs, let's step through a comprehensive implementation of the fine-tuning process +using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X GPUs. Before starting, review and understand the key components of this walkthrough: diff --git a/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst b/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst index dc66d9a75..0aca524da 100644 --- a/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst +++ b/docs/how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst @@ -3,12 +3,11 @@ :keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference, SFTTrainer **************************************************** -Fine-tuning and inference using a single accelerator +Fine-tuning and inference using a single GPU **************************************************** This section explains model fine-tuning and inference techniques on a single-accelerator system. See -:doc:`Multi-accelerator fine-tuning ` for a setup with multiple accelerators or -GPUs. +:doc:`Multi-accelerator fine-tuning ` for a setup with multiple GPUs. .. _fine-tuning-llms-single-gpu-env: @@ -21,7 +20,7 @@ This section was tested using the following hardware and software environment. :stub-columns: 1 * - Hardware - - AMD Instinct MI300X accelerator + - AMD Instinct MI300X GPU * - Software - ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10 @@ -41,7 +40,7 @@ Setting up the base implementation environment :doc:`PyTorch installation guide `. For a consistent installation, it’s recommended to use official ROCm prebuilt Docker images with the framework pre-installed. -#. In the Docker container, check the availability of ROCm-capable accelerators using the following command. +#. In the Docker container, check the availability of ROCm-capable GPUs using the following command. .. code-block:: shell @@ -53,14 +52,14 @@ Setting up the base implementation environment ============================ ROCm System Management Interface ============================ ====================================== Product Info ====================================== - GPU[0] : Card series: AMD Instinct MI300X OAM + GPU[0] : Card Series: AMD Instinct MI300X OAM GPU[0] : Card model: 0x74a1 GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI] GPU[0] : Card SKU: MI3SRIOV ========================================================================================== ================================== End of ROCm SMI Log =================================== -#. Check that your accelerators are available to PyTorch. +#. Check that your GPUs are available to PyTorch. .. code-block:: python @@ -502,9 +501,9 @@ Let's look at achieving model inference using these types of models. # Token generation print(pipe("What is a large language model?")[0]["generated_text"]) -If using multiple accelerators, see +If using multiple GPUs, see :ref:`Multi-accelerator fine-tuning and inference ` to explore -popular libraries that simplify fine-tuning and inference in a multi-accelerator system. +popular libraries that simplify fine-tuning and inference in a multiple-GPU system. Read more about inference frameworks like vLLM and Hugging Face TGI in :doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`. diff --git a/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst b/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst index 2136efa84..e3b9a761a 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst +++ b/docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst @@ -45,7 +45,7 @@ ROCm provides two different implementations of Flash Attention 2 modules. They c # Install from source git clone https://github.com/ROCm/flash-attention.git cd flash-attention/ - GPU_ARCHS=gfx942 python setup.py install #MI300 series + GPU_ARCHS=gfx942 python setup.py install #MI300 Series Hugging Face Transformers can easily deploy the CK Flash Attention 2 module by passing an argument ``attn_implementation="flash_attention_2"`` in the ``from_pretrained`` class. @@ -526,7 +526,7 @@ follow these instructions: python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and -more recent accelerators. +more recent GPUs. .. code-block:: shell diff --git a/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst b/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst index 8a14f42cf..f46c60729 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst +++ b/docs/how-to/rocm-for-ai/inference-optimization/model-quantization.rst @@ -7,7 +7,7 @@ Model quantization techniques ***************************** Quantization reduces the model size compared to its native full-precision version, making it easier to fit large models -onto accelerators or GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ +onto GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ and bitsandbytes on AMD Instinct hardware. .. _quantize-llms-quark: @@ -311,7 +311,7 @@ ExLlama-v2 support ExLlama is a Python/C++/CUDA implementation of the Llama model that is designed for faster inference with 4-bit GPTQ weights. The ExLlama kernel is activated by default when users create a ``GPTQConfig`` object. To -boost inference speed even further on Instinct accelerators, use the ExLlama-v2 +boost inference speed even further on Instinct GPUs, use the ExLlama-v2 kernels by configuring the ``exllama_config`` parameter as the following. .. code-block:: python @@ -332,7 +332,7 @@ The `ROCm-aware bitsandbytes `_ library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizer, matrix multiplication, and 8-bit and 4-bit quantization functions. The library includes quantization primitives for 8-bit and 4-bit operations through ``bitsandbytes.nn.Linear8bitLt`` and ``bitsandbytes.nn.Linear4bit`` and 8-bit optimizers through the -``bitsandbytes.optim`` module. These modules are supported on AMD Instinct accelerators. +``bitsandbytes.optim`` module. These modules are supported on AMD Instinct GPUs. Installing bitsandbytes ----------------------- diff --git a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md index cc68823f3..75d151430 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md +++ b/docs/how-to/rocm-for-ai/inference-optimization/optimizing-with-composable-kernel.md @@ -9,13 +9,13 @@ myst: The AMD ROCm Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions. -This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X accelerators using CK. +This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X GPUs using CK. ## High-level overview: a CK GEMM instance GEMM is a fundamental block in linear algebra, machine learning, and deep neural networks. It is defined as the operation: {math}`E = α \times (A \times B) + β \times (D)`, with A and B as matrix inputs, α and β as scalar inputs, and D as a pre-existing matrix. -Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct accelerators for the computation of GEMM. The implementation of the instance contains two phases: +Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct GPUs for the computation of GEMM. The implementation of the instance contains two phases: - [Template parameter definition](#template-parameter-definition) - [Instantiating and running the templated kernel](#instantiating-and-running-the-templated-kernel) @@ -108,7 +108,7 @@ These parameters include Block Size, M/N/K Per Block, M/N per XDL, AK1, BK1, etc - Block Size determines the number of threads in the thread block. - M/N/K Per Block determines the size of tile that each thread block is responsible for calculating. -- M/N Per XDL refers to M/N size for Instinct accelerator Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis. +- M/N Per XDL refers to M/N size for Instinct GPU Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis. - A/B K1 is related to the data type. It can be any value ranging from 1 to K Per Block. To achieve the optimal load/store performance, 128bit per load is suggested. In addition, the A/B loading parameters must be changed accordingly to match the A/B K1 value; otherwise, it will result in compilation errors. Conditions for achieving computational load balancing on different hardware platforms can vary. @@ -133,7 +133,7 @@ Templated kernel launching consists of kernel instantiation, making arguments by ## Developing fused INT8 kernels for SmoothQuant models -[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct accelerators are shown in the following table. +[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct GPUs are shown in the following table. :::{table} Functionalities used to implement SmoothQuant model inference. @@ -164,7 +164,7 @@ The CK library contains many fundamental instances that implement different func Second, consider whether the format of input data meets your actual calculation needs. For SQ models, the 8-bit integer data format (INT8) is applied for matrix calculations. -Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct accelerators after being compiled and cannot run on Radeon-series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances. +Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct GPUs after being compiled and cannot run on Radeon-Series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances. Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_kernel/tree/develop/example/24_batched_gemm) as the fundamental instance to implement the functionalities in the previous table. @@ -435,7 +435,7 @@ The implementation architecture of running SmoothQuant models on MI300X GPUs is ### Figure 7 ================ --> ```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg -The implementation architecture of running SmoothQuant models on AMD MI300X accelerators. +The implementation architecture of running SmoothQuant models on AMD MI300X GPUs. ``` For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-smoothquant), each decoder layer contains three major components: attention calculation, layer normalization, and linear transformation in fully connected layers. The corresponding implementation classes for these components are: @@ -447,21 +447,21 @@ For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-s These classes' underlying implementation logits will harness the functions in previous table. Note that for the example, the `LayerNormQ` module is implemented by the torch native module. Testing environment: -The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X accelerators and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub: +The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X GPUs and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub: [`rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b) The tested models are OPT-1.3B, 2.7B, 6.7B and 13B FP16 models and the corresponding SmoothQuant INT8 OPT models were obtained from Hugging Face. Note that since the default values were used for the tunable parameters of the fundamental instance, the performance of the INT8 kernel is suboptimal. -Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size. +Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size. ```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg -Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. +Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU. ``` For accuracy comparisons between the original FP16 and INT8 models, the evaluation is done by using the first 1,000 samples from the LAMBADA dataset's validation set. We employ the same Last Token Prediction Accuracy method introduced in [SmoothQuant Real-INT8 Inference for PyTorch](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb) as our evaluation metric. The comparison results are shown in Table 2. @@ -482,4 +482,4 @@ CK provides a rich set of template parameters for generating flexible accelerate CK supports multiple instruction sets of AMD Instinct GPUs, operator fusion and different data precisions. Its composability helps users quickly construct operator performance verification. -With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD accelerator platforms. +With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD GPU platforms. diff --git a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst index bc9463f58..30e86e277 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst +++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst @@ -1,15 +1,15 @@ .. meta:: - :description: Learn about workload tuning on AMD Instinct MI300X accelerators for optimal performance. + :description: Learn about workload tuning on AMD Instinct MI300X GPUs for optimal performance. :keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm, environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL, - MIOpen, accelerator, GPU, resource utilization + MIOpen, GPU, resource utilization ***************************************** AMD Instinct MI300X workload optimization ***************************************** This document provides guidelines for optimizing the performance of AMD -Instinct™ MI300X accelerators, with a particular focus on GPU kernel +Instinct™ MI300X GPUs, with a particular focus on GPU kernel programming, high-performance computing (HPC), and deep learning operations using PyTorch. It delves into specific workloads such as :ref:`model inference `, offering strategies to @@ -25,7 +25,7 @@ Workload tuning strategy By following a structured approach, you can systematically address performance issues and enhance the efficiency of your workloads on AMD Instinct -MI300X accelerators. +MI300X GPUs. Measure the current workload ---------------------------- @@ -86,7 +86,7 @@ Optimize model inference with vLLM ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vLLM provides tools and techniques specifically designed for efficient model -inference on AMD Instinct MI300X accelerators. See :ref:`fine-tuning-llms-vllm` +inference on AMD Instinct MI300X GPUs. See :ref:`fine-tuning-llms-vllm` for installation guidance. Optimizing performance with vLLM involves configuring tensor parallelism, leveraging advanced features, and ensuring efficient execution. Here’s how to optimize vLLM performance: @@ -239,7 +239,7 @@ benchmarking process. With AMD's profiling tools, developers are able to gain important insight into how efficiently their application is using hardware resources and effectively diagnose potential bottlenecks contributing to poor performance. Developers -working with AMD Instinct accelerators have multiple tools depending on their specific profiling needs; these include: +working with AMD Instinct GPUs have multiple tools depending on their specific profiling needs; these include: * :ref:`ROCProfiler ` @@ -257,11 +257,11 @@ metrics, commonly called *performance counters*. These counters quantify the per showcasing which pieces of the computational pipeline and memory hierarchy are being utilized. Your ROCm installation contains a script or executable command called ``rocprof`` which provides the ability to list all -available hardware counters for your specific accelerator or GPU, and run applications while collecting counters during +available hardware counters for your specific GPU, and run applications while collecting counters during their execution. This ``rocprof`` utility also depends on the :doc:`ROCTracer and ROC-TX libraries `, giving it the -ability to collect timeline traces of the accelerator software stack as well as user-annotated code regions. +ability to collect timeline traces of the GPU software stack as well as user-annotated code regions. .. note:: @@ -276,16 +276,16 @@ ROCm Compute Profiler ^^^^^^^^^^^^^^^^^^^^^ :doc:`ROCm Compute Profiler ` is a system performance profiler for high-performance computing (HPC) and -machine learning (ML) workloads using Instinct accelerators. Under the hood, ROCm Compute Profiler uses +machine learning (ML) workloads using Instinct GPUs. Under the hood, ROCm Compute Profiler uses :ref:`ROCProfiler ` to collect hardware performance counters. The ROCm Compute Profiler tool performs system profiling based on all approved hardware counters for Instinct -accelerator architectures. It provides high level performance analysis features including System Speed-of-Light, IP +GPU architectures. It provides high level performance analysis features including System Speed-of-Light, IP block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more. ROCm Compute Profiler takes the guesswork out of profiling by removing the need to provide text input files with lists of counters to collect and analyze raw CSV output files as is the case with ROCProfiler. Instead, ROCm Compute Profiler automates the collection of all available hardware counters in one command and provides graphical interfaces to help users understand and -analyze bottlenecks and stressors for their computational workloads on AMD Instinct accelerators. +analyze bottlenecks and stressors for their computational workloads on AMD Instinct GPUs. .. note:: @@ -411,7 +411,7 @@ for additional performance tips. :ref:`fine-tuning-llms-vllm` describes vLLM usage with ROCm. ROCm provides a prebuilt optimized Docker image for validating the performance -of LLM inference with vLLM on MI300X series accelerators. The Docker image includes +of LLM inference with vLLM on MI300X Series GPUs. The Docker image includes ROCm, vLLM, and PyTorch. For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. @@ -449,7 +449,7 @@ Maximizing vLLM instances on a single node The general guideline is to maximize per-node throughput by running as many vLLM instances as possible. However, running too many instances might lead to insufficient memory for the KV-cache, which can affect performance. -The Instinct MI300X accelerator is equipped with 192GB of HBM3 memory capacity and bandwidth. +The Instinct MI300X GPU is equipped with 192 GB of HBM3 memory capacity and bandwidth. For models that fit in one GPU -- to maximize the accumulated throughput -- you can run as many as eight vLLM instances simultaneously on one MI300X node (with eight GPUs). To do so, use the GPU isolation environment variable ``CUDA_VISIBLE_DEVICES``. @@ -468,7 +468,7 @@ The total throughput achieved by running ``N`` instances of vLLM is generally mu single vLLM instance across ``N`` GPUs simultaneously (that is, configuring ``tensor_parallel_size`` as N or using the ``-tp`` N option, where ``1 < N ≤ 8``). -vLLM on MI300X accelerators can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on. +vLLM on MI300X GPUs can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on. Notable configurations include Llama2-70b and Llama3-70b models on a single MI300X GPU, and the Llama3.1 405b model can fit on one single node with 8 MI300X GPUs. .. _mi300x-vllm-gpu-memory-utilization: @@ -917,7 +917,7 @@ ROCm library tuning involves optimizing the performance of routine computational operations (such as ``GEMM``) provided by ROCm libraries like :ref:`hipBLASLt `, :ref:`Composable Kernel `, :ref:`MIOpen `, and :ref:`RCCL `. This tuning aims -to maximize efficiency and throughput on Instinct MI300X accelerators to gain +to maximize efficiency and throughput on Instinct MI300X GPUs to gain improved application performance. .. _mi300x-library-gemm: @@ -1451,7 +1451,7 @@ you can only use a fraction of the potential bandwidth on the node. The following figure shows an :doc:`MI300X node-level architecture ` of a system with AMD EPYC processors in a dual-socket configuration and eight -AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via +AMD Instinct MI300X GPUs. The MI300X OAMs attach to the host system via PCIe Gen 5 x16 links (yellow lines). The GPUs use seven high-bandwidth, low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system. @@ -1460,7 +1460,7 @@ low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected .. figure:: ../../../data/shared/mi300-node-level-arch.png - MI300 series node-level architecture showing 8 fully interconnected MI300X + MI300 Series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIe switches via re-timers and HGX connectors. @@ -1653,7 +1653,7 @@ Auto-tunable kernel configuration involves adjusting memory access and computati resources assigned to each compute unit. It encompasses the usage of :ref:`LDS `, register, and task scheduling on a compute unit. -The accelerator or GPU contains global memory, local data share (LDS), and +The GPU contains global memory, local data share (LDS), and registers. Global memory has high access latency, but is large. LDS access has much lower latency, but is smaller. It is a fast on-CU software-managed memory that can be used to efficiently share data between all work items in a block. @@ -1666,11 +1666,11 @@ Register access is the fastest yet smallest among the three. Schematic representation of a CU in the CDNA2 or CDNA3 architecture. The following is a list of kernel arguments used for tuning performance and -resource allocation on AMD accelerators, which helps in optimizing the +resource allocation on AMD GPUs, which helps in optimizing the efficiency and throughput of various computational kernels. ``num_stages=n`` - Adjusts the number of pipeline stages for different types of kernels. On AMD accelerators, set ``num_stages`` + Adjusts the number of pipeline stages for different types of kernels. On AMD GPUs, set ``num_stages`` according to the following rules: * For kernels with a single GEMM, set to ``2``. @@ -1697,15 +1697,15 @@ efficiency and throughput of various computational kernels. * The occupancy of the kernel is limited by VGPR usage, and * The current VGPR usage is only a few above a boundary in - :ref:`Occupancy related to VGPR usage in an Instinct MI300X accelerator `. + :ref:`Occupancy related to VGPR usage in an Instinct MI300X GPU `. .. _mi300x-occupancy-vgpr-table: .. figure:: ../../../data/shared/occupancy-vgpr.png - :alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator. + :alt: Occupancy related to VGPR usage in an Instinct MI300X GPU. :align: center - Occupancy related to VGPRs usage on an Instinct MI300X accelerator + Occupancy related to VGPRs usage on an Instinct MI300X GPU For example, according to the table, each Execution Unit (EU) has 512 available VGPRs, which are allocated in blocks of 16. If the current VGPR usage is 170, @@ -1730,7 +1730,7 @@ VGPR usage so that it might fit 3 waves per EU. - ``matrix_instr_nonkdim = 32``: ``mfma_32x32`` is used. - For GEMM kernels on an MI300X accelerator, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large + For GEMM kernels on an MI300X GPU, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large tile/GEMM sizes. @@ -1749,7 +1749,7 @@ the number of CUs a kernel can distribute its task across. XCD-level system architecture showing 40 compute units, each with 32 KB L1 cache, a unified compute system with 4 ACE compute - accelerators, shared 4MB of L2 cache, and a hardware scheduler (HWS). + GPUs, shared 4MB of L2 cache, and a hardware scheduler (HWS). You can query hardware resources with the command ``rocminfo`` in the ``/opt/rocm/bin`` directory. For instance, query the number of CUs, number of diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst index 68d7f66e7..77d583be6 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.0-20250812.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: .. list-table:: :header-rows: 1 @@ -47,7 +47,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for -MI300X series accelerators. +MI300X Series GPUs. What's new ========== @@ -139,7 +139,7 @@ page provides reference throughput and serving measurements for inferencing popu The performance data presented in `Performance results with AMD ROCm software `_ only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -424,7 +424,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst index a68618338..d368d0207 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.10.1-20250909.rst @@ -21,8 +21,8 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series + inference performance on AMD Instinct™ MI300X Series accelerators. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series accelerators and includes the following components: .. list-table:: @@ -38,7 +38,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for -MI300X series accelerators. +MI300X Series accelerators. What's new ========== @@ -430,7 +430,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series accelerators, see `AMD Instinct MI300X system optimization `_. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for a brief introduction to vLLM and optimization strategies. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst index 8a03f95d2..ad42d5a1d 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.4.3.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified ROCm Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -18,9 +18,9 @@ vLLM inference performance testing The `ROCm vLLM Docker `_ image offers a prebuilt, optimized environment designed for validating large language model -(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This +(LLM) inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the -MI300X accelerator and includes the following components: +MI300X GPU and includes the following components: * `ROCm 6.2.0 `_ @@ -31,7 +31,7 @@ MI300X accelerator and includes the following components: * Tuning files (in CSV format) With this Docker image, you can quickly validate the expected inference -performance numbers on the MI300X accelerator. This topic also provides tips on +performance numbers on the MI300X GPU. This topic also provides tips on optimizing performance with popular AI models. .. _vllm-benchmark-vllm: @@ -51,7 +51,7 @@ Getting started =============== Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. +MI300X GPU with the prebuilt vLLM Docker image. .. _vllm-benchmark-get-started: @@ -267,7 +267,7 @@ Options .. _vllm-benchmark-run-benchmark-v043: -Running the benchmark on the MI300X accelerator +Running the benchmark on the MI300X GPU ----------------------------------------------- Here are some examples of running the benchmark with various options. @@ -328,7 +328,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst index e3111c7fa..153cf305e 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.4.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified ROCm Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -18,9 +18,9 @@ vLLM inference performance testing The `ROCm vLLM Docker `_ image offers a prebuilt, optimized environment designed for validating large language model -(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This +(LLM) inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the -MI300X accelerator and includes the following components: +MI300X GPU and includes the following components: * `ROCm 6.2.1 `_ @@ -31,7 +31,7 @@ MI300X accelerator and includes the following components: * Tuning files (in CSV format) With this Docker image, you can quickly validate the expected inference -performance numbers on the MI300X accelerator. This topic also provides tips on +performance numbers on the MI300X GPU. This topic also provides tips on optimizing performance with popular AI models. .. hlist:: @@ -74,7 +74,7 @@ Getting started =============== Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. +MI300X GPU with the prebuilt vLLM Docker image. .. _vllm-benchmark-get-started: @@ -332,7 +332,7 @@ Options .. _vllm-benchmark-run-benchmark-v064: -Running the benchmark on the MI300X accelerator +Running the benchmark on the MI300X GPU ----------------------------------------------- Here are some examples of running the benchmark with various options. @@ -398,7 +398,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst index 996a7ddb8..259d02049 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -18,9 +18,9 @@ LLM inference performance validation on AMD Instinct MI300X The `ROCm vLLM Docker `_ image offers a prebuilt, optimized environment for validating large language model (LLM) -inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM +inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the MI300X -accelerator and includes the following components: +GPU and includes the following components: * `ROCm 6.3.1 `_ @@ -29,7 +29,7 @@ accelerator and includes the following components: * `PyTorch 2.7.0 (2.7.0a0+git3a58512) `_ With this Docker image, you can quickly validate the expected inference -performance numbers for the MI300X accelerator. This topic also provides tips on +performance numbers for the MI300X GPU. This topic also provides tips on optimizing performance with popular AI models. For more information, see the lists of :ref:`available models for MAD-integrated benchmarking ` and :ref:`standalone benchmarking `. @@ -47,7 +47,7 @@ Getting started =============== Use the following procedures to reproduce the benchmark results on an -MI300X accelerator with the prebuilt vLLM Docker image. +MI300X GPU with the prebuilt vLLM Docker image. .. _vllm-benchmark-get-started: @@ -377,7 +377,7 @@ Options and available models .. _vllm-benchmark-run-benchmark-v066: -Running the benchmark on the MI300X accelerator +Running the benchmark on the MI300X GPU ----------------------------------------------- Here are some examples of running the benchmark with various options. @@ -443,7 +443,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst index 2bd3910a0..99f028eb8 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.7.3-20250325.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPU. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -37,7 +37,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-v073: @@ -110,7 +110,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -122,7 +122,7 @@ vLLM inference performance testing =============== Use the following procedures to reproduce the benchmark results on an - MI300X accelerator with the prebuilt vLLM Docker image. + MI300X GPU with the prebuilt vLLM Docker image. .. _vllm-benchmark-get-started: @@ -311,7 +311,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst index e096f965d..a87fbe8f0 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.3-20250415.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -18,9 +18,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -32,7 +32,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-v083: @@ -105,7 +105,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -327,7 +327,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst index 79e6c12dd..5a352d22a 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250513.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -37,7 +37,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-v085-20250513: @@ -114,7 +114,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -333,7 +333,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst index 0bcdf3ed4..0016f9559 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.8.5-20250521.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -37,7 +37,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-v085-20250521: @@ -114,7 +114,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. + Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -333,7 +333,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst index 81b7edfec..7c575f9da 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.0.1-20250605.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -37,7 +37,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-v0901-20250605: @@ -113,7 +113,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -332,7 +332,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X GPUs, see `AMD Instinct MI300X system optimization `_ - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst index a482c27c7..393946582 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250702.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: * `ROCm {{ unified_docker.rocm_version }} `_ @@ -37,7 +37,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for - MI300X series accelerators. + MI300X Series GPUs. .. _vllm-benchmark-available-models-20250702: @@ -113,7 +113,7 @@ vLLM inference performance testing The performance data presented in `Performance results with AMD ROCm software `_ only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. Advanced features and known issues ================================== @@ -332,7 +332,7 @@ Further reading see ``_. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `_ + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_ - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst index 9f6d001ad..4480e363c 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.9.1-20250715.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate @@ -23,9 +23,9 @@ vLLM inference performance testing The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers a prebuilt, optimized environment for validating large language model (LLM) - inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM - Docker image integrates vLLM and PyTorch tailored specifically for MI300X series - accelerators and includes the following components: + inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM + Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series + GPUs and includes the following components: .. list-table:: :header-rows: 1 @@ -47,7 +47,7 @@ vLLM inference performance testing With this Docker image, you can quickly test the :ref:`expected inference performance numbers ` for -MI300X series accelerators. +MI300X Series GPUs. What's new ========== @@ -145,7 +145,7 @@ page provides reference throughput and latency measurements for inferencing popu The performance data presented in `Performance results with AMD ROCm software `_ only reflects the latest version of this inference benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -429,7 +429,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst index 21ee1b647..414a54b42 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm PyTorch Docker image. :keywords: model, MAD, automation, dashboarding, validate, pytorch @@ -15,7 +15,7 @@ PyTorch inference performance testing {% set model_groups = data.pytorch_inference_benchmark.model_groups %} The `ROCm PyTorch Docker `_ image offers a prebuilt, - optimized environment for testing model inference performance on AMD Instinct™ MI300X series + optimized environment for testing model inference performance on AMD Instinct™ MI300X Series GPUs. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD) tool with the ROCm PyTorch container to test inference performance on various models efficiently. @@ -175,7 +175,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`../../inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst index 17e5ea54b..4bbace2e7 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst @@ -22,7 +22,7 @@ improved efficiency and throughput. `SGLang `__ is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__ - bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series + bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X Series GPUs. It includes the following software components: .. list-table:: @@ -37,7 +37,7 @@ improved efficiency and throughput. {% endfor %} The following guides on setting up and running SGLang and Mooncake for disaggregated -distributed inference on a Slurm cluster using AMD Instinct MI300X series GPUs backed by +distributed inference on a Slurm cluster using AMD Instinct MI300X Series GPUs backed by Mellanox CX-7 NICs. Prerequisites @@ -236,7 +236,7 @@ Further reading - See the base upstream Docker image on `Docker Hub `__. - To learn more about system settings and management practices to configure your system for - MI300X series GPUs, see `AMD Instinct MI300X system optimization `__. + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `__. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst index 1722b2018..b955c0c11 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and SGLang + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and SGLang :keywords: model, MAD, automation, dashboarding, validate ***************************************************************** @@ -15,8 +15,8 @@ SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B `SGLang `__ is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled `SGLang Docker image <{{ docker.docker_hub_url }}>`__ - bundles SGLang with PyTorch, optimized for AMD Instinct MI300X series - accelerators. It includes the following software components: + bundles SGLang with PyTorch, optimized for AMD Instinct MI300X Series + GPUs. It includes the following software components: .. list-table:: :header-rows: 1 @@ -255,7 +255,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - MI300X series accelerators, see `AMD Instinct MI300X system optimization `__. + MI300X Series GPUs, see `AMD Instinct MI300X system optimization `__. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst index 66e7f6621..d8b26325b 100644 --- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/vllm.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the ROCm vLLM Docker image. + :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image. :keywords: model, MAD, automation, dashboarding, validate ********************************** @@ -457,7 +457,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for a brief introduction to vLLM and optimization strategies. diff --git a/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst b/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst index fc5bc7732..65d6ac909 100644 --- a/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst +++ b/docs/how-to/rocm-for-ai/inference/deploy-your-model.rst @@ -44,9 +44,9 @@ Validating vLLM performance --------------------------- ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM -on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV +on the MI300X GPU. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information, see the guide to -`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X accelerator `_ +`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X GPU `_ on the ROCm GitHub repository. .. _rocm-for-ai-serve-hugging-face-tgi: @@ -61,7 +61,7 @@ The `Hugging Face Text Generation Inference `__. TGI walkthrough diff --git a/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst b/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst index fe84e33d9..0ff11c8a9 100644 --- a/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst +++ b/docs/how-to/rocm-for-ai/inference/hugging-face-models.rst @@ -10,7 +10,7 @@ Running models from Hugging Face transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in developing and deploying AI solutions. -This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs. +This section describes how to run popular community transformer models from Hugging Face on AMD GPUs. .. _rocm-for-ai-hugging-face-transformers: @@ -62,11 +62,11 @@ Using Hugging Face with Optimum-AMD Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack. -For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the +For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the `Optimum-AMD `_ page on Hugging Face for guidance on using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration. -Hugging Face libraries natively support AMD Instinct accelerators. For other +Hugging Face libraries natively support AMD Instinct GPUs. For other :doc:`ROCm-capable hardware `, support is currently not validated, but most features are expected to work without issues. @@ -139,7 +139,7 @@ To enable `GPTQ `_, hosted wheels are availabl pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ - Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable. + Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable. .. code-block:: shell diff --git a/docs/how-to/rocm-for-ai/inference/index.rst b/docs/how-to/rocm-for-ai/inference/index.rst index 3c211882b..6eb705141 100644 --- a/docs/how-to/rocm-for-ai/inference/index.rst +++ b/docs/how-to/rocm-for-ai/inference/index.rst @@ -9,7 +9,7 @@ AI inference is a process of deploying a trained machine learning model to make Understanding the ROCm™ software platform’s architecture and capabilities is vital for running AI inference. By leveraging the ROCm platform's capabilities, you can harness the power of high-performance computing and efficient resource management to run inference workloads, leading to faster predictions and classifications on real-time data. -Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X accelerators and illustrate how to deploy trained models in production environments. +Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X GPUs and illustrate how to deploy trained models in production environments. The AI Developer Hub contains `AMD ROCm tutorials `_ for training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs. diff --git a/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst b/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst index d1e81d1a9..d23b9a351 100644 --- a/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst +++ b/docs/how-to/rocm-for-ai/inference/llm-inference-frameworks.rst @@ -60,7 +60,7 @@ Installing vLLM vllm-rocm \ bash - 3. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command. + 3. Inside the container, start the API server to run on a single GPU on port 8000 using the following command. .. code-block:: shell @@ -113,7 +113,7 @@ Installing vLLM python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 & 4. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to - isolate each instance to a different accelerator. + isolate each instance to a different GPU. For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a a command like the following. @@ -140,7 +140,7 @@ Installing vLLM See :ref:`mi300x-vllm-optimization` for performance optimization tips. ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM - on the MI300X accelerator. The Docker image includes ROCm, vLLM, and PyTorch. + on the MI300X GPU. The Docker image includes ROCm, vLLM, and PyTorch. For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. .. _fine-tuning-llms-tgi: @@ -178,7 +178,7 @@ Install TGI .. tab-item:: TGI on a single-accelerator system :sync: single - 2. Inside the container, launch a model using TGI server on a single accelerator. + 2. Inside the container, launch a model using TGI server on a single GPU. .. code-block:: shell @@ -199,7 +199,7 @@ Install TGI .. tab-item:: TGI on a multi-accelerator system - 2. Inside the container, launch a model using TGI server on multiple accelerators (4 in this case). + 2. Inside the container, launch a model using TGI server on multiple GPUs (four in this case). .. code-block:: shell diff --git a/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst b/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst index 739a9c8e8..3d4e44182 100644 --- a/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst +++ b/docs/how-to/rocm-for-ai/system-setup/multi-node-setup.rst @@ -1,6 +1,6 @@ .. meta:: :description: Multi-node setup for AI training - :keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training + :keywords: gpu, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training .. _rocm-for-ai-multi-node-setup: @@ -21,7 +21,7 @@ Before starting, ensure your environment meets the following requirements: * Multi-node networking: your cluster should have a configured multi-node network. For setup instructions, see the `Multi-node network configuration for AMD Instinct - accelerators + GPUs `__ guide in the Instinct documentation. @@ -54,8 +54,8 @@ Compile and install the RoCE library If you're using Broadcom NICs, you need to compile and install the RoCE (RDMA over Converged Ethernet) library. See `RoCE cluster network configuration guide -for AMD Instinct accelerators -`__ +for AMD Instinct GPUs +`__ for more information. See the `Ethernet networking guide for AMD @@ -315,6 +315,6 @@ Megatron-LM Further reading =============== -* `Multi-node network configuration for AMD Instinct accelerators `__ +* `Multi-node network configuration for AMD Instinct GPUs `__ * `Ethernet networking guide for AMD Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source `__ diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst index eec785b7b..5871eb8ae 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst @@ -10,7 +10,7 @@ MaxText is a high-performance, open-source framework built on the Google JAX machine learning library to train LLMs at scale. The MaxText framework for ROCm is an optimized fork of the upstream ``__ enabling efficient AI workloads -on AMD MI300X series GPUs. +on AMD MI300X Series GPUs. The MaxText for ROCm training Docker image provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs, @@ -69,7 +69,7 @@ Supported models ================ The following models are pre-optimized for performance on AMD Instinct MI300 -series GPUs. Some instructions, commands, and available training +Series GPUs. Some instructions, commands, and available training configurations in this documentation might vary by model -- select one to get started. @@ -347,7 +347,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst index ebd55be17..584ec53b1 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst @@ -17,13 +17,13 @@ Training a model with Megatron-LM on ROCm The `Megatron-LM framework for ROCm `_ is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series GPUs, Megatron-LM delivers enhanced +Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently. -AMD provides ready-to-use Docker images for MI300X series GPUs containing +AMD provides ready-to-use Docker images for MI300X Series GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -61,7 +61,7 @@ workloads: ================ The following models are supported for training performance benchmarking with Megatron-LM and ROCm - on AMD Instinct MI300X series GPUs. + on AMD Instinct MI300X Series GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model -- select one to get started. @@ -138,7 +138,7 @@ Environment setup ================= Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series GPUs with the AMD Megatron-LM Docker +reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements: @@ -533,7 +533,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series GPUs with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training -------------------- diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst index de9e44a8c..e78bceb9a 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst @@ -16,7 +16,7 @@ environment for the MPT-30B model using the ``rocm/pytorch-training:v25.5`` base `Docker image `_ and the `LLM Foundry `_ framework. This environment packages the following software components to train -on AMD Instinct MI300X series accelerators: +on AMD Instinct MI300X Series GPUs: +--------------------------+--------------------------------+ | Software component | Version | @@ -182,7 +182,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst index 8b8dd65bd..4ec711c59 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4.rst @@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX machine learning library to train LLMs at scale. The MaxText framework for ROCm is an optimized fork of the upstream ``__ enabling efficient AI workloads -on AMD MI300X series accelerators. +on AMD MI300X Series GPUs. The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.4``) image -provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators, +provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs, including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. It includes the following software components: @@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic .. _amd-maxtext-model-support-v254: -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. * Llama 3.1 8B diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst index e1581423e..221f1a88a 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5.rst @@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX machine learning library to train LLMs at scale. The MaxText framework for ROCm is an optimized fork of the upstream ``__ enabling efficient AI workloads -on AMD MI300X series accelerators. +on AMD MI300X Series GPUs. The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.5``) image -provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators, +provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs, including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. It includes the following software components: @@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic .. _amd-maxtext-model-support-v255: -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. * Llama 3.3 70B diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst index c18b1dfea..1a2bfee91 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst @@ -17,12 +17,12 @@ Training a model with ROCm Megatron-LM The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X -accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI +GPUs, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to :ref:`support models ` like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater efficiency. See the GitHub repository at ``__. -For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential +For ease of use, AMD provides a ready-to-use Docker image for MI300X GPUs containing essential components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the following software to accelerate training workloads: @@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e .. _amd-megatron-lm-model-support-24-12: -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. +The following models are pre-optimized for performance on the AMD Instinct MI300X GPU. * Llama 2 7B @@ -208,14 +208,14 @@ Use the following script to run the RCCL test for four MI300X GPU nodes. Modify .. _mi300x-amd-megatron-lm-training-v2412: -Start training on MI300X accelerators +Start training on MI300X GPUs ===================================== The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1. Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v2412: diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst index e039aff8a..be3bb7d64 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.3.rst @@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD -GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers +GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and DeepSeek, enabling developers to train next-generation AI models more efficiently. See the GitHub repository at ``__. -AMD provides a ready-to-use Docker image for MI300X accelerators containing +AMD provides a ready-to-use Docker image for MI300X GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e .. _amd-megatron-lm-model-support-25-3: -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. +The following models are pre-optimized for performance on the AMD Instinct MI300X GPU. * Llama 2 7B @@ -123,7 +123,7 @@ The pre-built ROCm Megatron-LM environment allows users to quickly validate syst training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2. Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v253: @@ -334,7 +334,7 @@ Multi-node training inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating the Docker container. -Start training on AMD Instinct accelerators +Start training on AMD Instinct GPUs =========================================== The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate @@ -345,8 +345,8 @@ can expect the container to perform in the model configurations described in the following section, but other configurations are not validated by AMD. Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD Megatron-LM Docker image. +to train models, and reproduce the benchmark results on MI300X Series +GPUs with the AMD Megatron-LM Docker image. .. tab-set:: diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst index 9d7c7ecd6..308335f68 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.4.rst @@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD -GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers +GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and DeepSeek, enabling developers to train next-generation AI models more efficiently. See the GitHub repository at ``__. -AMD provides a ready-to-use Docker image for MI300X series accelerators containing +AMD provides a ready-to-use Docker image for MI300X Series GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e .. _amd-megatron-lm-model-support-25-4: -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. * Llama 3.1 8B @@ -105,7 +105,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `__ only reflects the :doc:`latest version of this training benchmarking environment <../megatron-lm>`. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -124,7 +124,7 @@ The prebuilt ROCm Megatron-LM environment allows users to quickly validate syste training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2. Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v254: @@ -367,7 +367,7 @@ Multi-node training # Specify which RDMA interfaces to use for communication export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 -Start training on AMD Instinct accelerators +Start training on AMD Instinct GPUs =========================================== The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate @@ -378,8 +378,8 @@ can expect the container to perform in the model configurations described in the following section, but other configurations are not validated by AMD. Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD Megatron-LM Docker image. +to train models, and reproduce the benchmark results on MI300X Series +GPUs with the AMD Megatron-LM Docker image. .. tab-set:: diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst index f2a81b7c6..4144e0cfd 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.5.rst @@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm The `Megatron-LM framework for ROCm `_ is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced +Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently. -AMD provides a ready-to-use Docker image for MI300X series accelerators containing +AMD provides a ready-to-use Docker image for MI300X Series GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e .. _amd-megatron-lm-model-support-v255: -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.5-benchmark-models.yaml @@ -131,7 +131,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `__ only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -154,7 +154,7 @@ Environment setup ================= Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v255: @@ -536,7 +536,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training ^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst index 32d72311b..9d749dd84 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.6.rst @@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm The `Megatron-LM framework for ROCm `__ is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced +Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently. -AMD provides ready-to-use Docker images for MI300X series accelerators containing +AMD provides ready-to-use Docker images for MI300X Series GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -65,7 +65,7 @@ workloads: .. _amd-megatron-lm-model-support-v256: - The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. + The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. Supported models ================ @@ -124,7 +124,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `__ only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -147,7 +147,7 @@ Environment setup ================= Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v256: @@ -589,7 +589,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training -------------------- diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst index b804abdf7..00105adb1 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.7.rst @@ -22,13 +22,13 @@ Training a model with Megatron-LM for ROCm The `Megatron-LM framework for ROCm `_ is a specialized fork of the robust Megatron-LM, designed to enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD -Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced +Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI workloads. It is purpose-built to support models like Llama, DeepSeek, and Mixtral, enabling developers to train next-generation AI models more efficiently. -AMD provides ready-to-use Docker images for MI300X series accelerators containing +AMD provides ready-to-use Docker images for MI300X Series GPUs containing essential components, including PyTorch, ROCm libraries, and Megatron-LM utilities. It contains the following software components to accelerate training workloads: @@ -66,7 +66,7 @@ workloads: ================ The following models are supported for training performance benchmarking with Megatron-LM and ROCm - on AMD Instinct MI300X series accelerators. + on AMD Instinct MI300X Series GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model -- select one to get started. @@ -120,7 +120,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `__ only reflects the latest version of this training benchmarking environment. - The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software. + The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -143,7 +143,7 @@ Environment setup ================= Use the following instructions to set up the environment, configure the script to train models, and -reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker +reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker image. .. _amd-megatron-lm-requirements-v257: @@ -592,7 +592,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training -------------------- diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst index 86ccd2e17..91cb48862 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7.rst @@ -15,7 +15,7 @@ Training a model with Primus and Megatron-LM `Primus `__ is a unified and flexible LLM training framework designed to streamline training. It streamlines LLM -training on AMD Instinct accelerators using a modular, reproducible configuration paradigm. +training on AMD Instinct GPUs using a modular, reproducible configuration paradigm. Primus is backend-agnostic and supports multiple training engines -- including Megatron. .. note:: @@ -25,7 +25,7 @@ Primus is backend-agnostic and supports multiple training engines -- including M workloads from Megatron-LM to Primus with Megatron, see :doc:`megatron-lm-primus-migration-guide`. -For ease of use, AMD provides a ready-to-use Docker image for MI300 series accelerators +For ease of use, AMD provides a ready-to-use Docker image for MI300 Series GPUs containing essential components for Primus and Megatron-LM. .. note:: @@ -53,7 +53,7 @@ containing essential components for Primus and Megatron-LM. Supported models ================ -The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. Some instructions, commands, and training examples in this documentation might vary by model -- select one to get started. @@ -120,7 +120,7 @@ system's configuration. ================= Use the following instructions to set up the environment, configure the script to train models, and - reproduce the benchmark results on MI300X series accelerators with the ``{{ docker.pull_tag }}`` image. + reproduce the benchmark results on MI300X Series GPUs with the ``{{ docker.pull_tag }}`` image. .. _amd-primus-megatron-lm-requirements-v257: @@ -231,7 +231,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series accelerators with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training -------------------- diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst index a0e31be9e..be26bb3a7 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3.rst @@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models. The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.3``) image provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following +model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads: +--------------------------+--------------------------------+ @@ -44,7 +44,7 @@ software components to accelerate training workloads: Supported models ================ -The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator. +The following models are pre-optimized for performance on the AMD Instinct MI300X GPU. * Llama 3.1 8B @@ -237,7 +237,7 @@ Along with the following datasets: * `bghira/pseudo-camera-10k `_ -Start training on AMD Instinct accelerators +Start training on AMD Instinct GPUs =========================================== The prebuilt PyTorch with ROCm training environment allows users to quickly validate @@ -248,8 +248,8 @@ can expect the container to perform in the model configurations described in the following section, but other configurations are not validated by AMD. Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI300X series -accelerators with the AMD PyTorch training Docker image. +to train models, and reproduce the benchmark results on MI300X Series +GPUs with the AMD PyTorch training Docker image. Once your environment is set up, use the following commands and examples to start benchmarking. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst index 772c6dc4a..a2148143e 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.4.rst @@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models. The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.4``) image provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following +model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads: +--------------------------+--------------------------------+ @@ -44,7 +44,7 @@ software components to accelerate training workloads: Supported models ================ -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. +The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. * Llama 3.1 8B @@ -76,7 +76,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `_ should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. + Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -260,7 +260,7 @@ the following section, but other configurations are not validated by AMD. Use the following instructions to set up the environment, configure the script to train models, and reproduce the benchmark results on MI325X and MI300X -accelerators with the AMD PyTorch training Docker image. +GPUs with the AMD PyTorch training Docker image. Once your environment is set up, use the following commands and examples to start benchmarking. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst index e68a1092b..c984056fa 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.5.rst @@ -19,7 +19,7 @@ model training with GPU-optimized components for transformer-based models. The `PyTorch for ROCm training Docker `_ (``rocm/pytorch-training:v25.5``) image provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following +model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads: +--------------------------+--------------------------------+ @@ -45,7 +45,7 @@ software components to accelerate training workloads: Supported models ================ -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. +The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. * Llama 3.3 70B @@ -79,7 +79,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `_ should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. + Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst index f9bc57a43..8499a4b47 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.6.rst @@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models. The `PyTorch for ROCm training Docker `_ (``rocm/pytorch-training:v25.6``) image provides a prebuilt optimized environment for fine-tuning and pretraining a -model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate +model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads: +--------------------------+--------------------------------+ @@ -44,7 +44,7 @@ training workloads: Supported models ================ -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. +The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.6-benchmark-models.yaml @@ -99,7 +99,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325 The performance data presented in `Performance results with AMD ROCm software `_ should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. + Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -444,7 +444,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst index 43b9a02e5..36a7d490d 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.7.rst @@ -22,7 +22,7 @@ model training with GPU-optimized components for transformer-based models. {% set docker = dockers[0] %} The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__ (``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a - model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate + model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate training workloads: .. list-table:: @@ -41,7 +41,7 @@ model training with GPU-optimized components for transformer-based models. Supported models ================ -The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. +The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model -- select one to get started. @@ -124,7 +124,7 @@ popular AI models. The performance data presented in `Performance results with AMD ROCm software `_ should not be interpreted as the peak performance achievable by AMD - Instinct MI325X and MI300X accelerators or ROCm software. + Instinct MI325X and MI300X GPUs or ROCm software. System validation ================= @@ -555,7 +555,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst index 65ac5e50c..abd02fdcc 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-megatron.rst @@ -17,7 +17,7 @@ Primus is backend-agnostic and supports multiple training engines -- including M To learn how to migrate workloads from Megatron-LM to Primus with Megatron, see :doc:`previous-versions/megatron-lm-primus-migration-guide`. -For ease of use, AMD provides a ready-to-use Docker image for MI300 series GPUs +For ease of use, AMD provides a ready-to-use Docker image for MI300 Series GPUs containing essential components for Primus and Megatron-LM. This Docker is powered by Primus Turbo optimizations for performance; this release adds support for Primus Turbo with optimized attention and grouped GEMM kernels. @@ -47,7 +47,7 @@ with optimized attention and grouped GEMM kernels. Supported models ================ -The following models are pre-optimized for performance on AMD Instinct MI300X series GPUs. +The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs. Some instructions, commands, and training examples in this documentation might vary by model -- select one to get started. @@ -114,7 +114,7 @@ Environment setup {% set docker = dockers[0] %} Use the following instructions to set up the environment, configure the script to train models, and - reproduce the benchmark results on MI300X series GPUs with the ``{{ docker.pull_tag }}`` image. + reproduce the benchmark results on MI300X Series GPUs with the ``{{ docker.pull_tag }}`` image. .. _amd-primus-megatron-lm-requirements: @@ -234,7 +234,7 @@ Run training Use the following example commands to set up the environment, configure :ref:`key options `, and run training on -MI300X series GPUs with the AMD Megatron-LM environment. +MI300X Series GPUs with the AMD Megatron-LM environment. Single node training -------------------- @@ -649,7 +649,7 @@ Further reading Framework for Large Models on AMD GPUs `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst index 5c99776ee..c2f90dd97 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst @@ -21,7 +21,7 @@ Primus now supports the PyTorch torchtitan backend. {% set dockers = data.dockers %} {% set docker = dockers[0] %} For ease of use, AMD provides a ready-to-use Docker image -- ``{{ - docker.pull_tag }}`` -- for MI300X series GPUs containing essential + docker.pull_tag }}`` -- for MI300X Series GPUs containing essential components for Primus and PyTorch training with Primus Turbo optimizations. @@ -296,7 +296,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst index 88842418e..abfc65376 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst @@ -576,7 +576,7 @@ Further reading - To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. - To learn more about system settings and management practices to configure your system for - AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization `_. + AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization `_. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. diff --git a/docs/how-to/rocm-for-ai/training/scale-model-training.rst b/docs/how-to/rocm-for-ai/training/scale-model-training.rst index 405206aaa..2b2ce23e3 100644 --- a/docs/how-to/rocm-for-ai/training/scale-model-training.rst +++ b/docs/how-to/rocm-for-ai/training/scale-model-training.rst @@ -6,9 +6,9 @@ Scaling model training ********************** -To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the -model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or -accelerator can simultaneously store and process the entire model's parameters during training. PyTorch +To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single GPU cannot store all the +model parameters required for training. This immense scale presents a fundamental challenge: no single GPU +can simultaneously store and process the entire model's parameters during training. PyTorch provides an answer to this computational constraint through its distributed training frameworks. .. _rocm-for-ai-pytorch-distributed: @@ -26,9 +26,9 @@ Features in ``torch.distributed`` are categorized into three main components: - `Collective communication `_ In this topic, the focus is on the distributed data-parallelism strategy as it’s the most popular. To get started with DDP, -you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs. +you need to first understand how to coordinate the model and its training data across multiple GPUs. -The DDP workflow on multiple accelerators or GPUs is as follows: +The DDP workflow on multiple GPUs is as follows: #. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples. @@ -46,7 +46,7 @@ In DDP training, each process or worker owns a replica of the model and processe See the following developer blogs for more in-depth explanations and examples. -* `Multi GPU training with DDP — PyTorch Tutorials `_ +* `Multi GPU training with DDP — PyTorch Tutorials `_ * `Building a decoder transformer model on AMD GPUs — ROCm Blogs `_ @@ -82,7 +82,7 @@ the training pillar. See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs `_ for a detailed example of -training with DeepSpeed on an AMD accelerator or GPU. +training with DeepSpeed on an AMD GPU. .. _rocm-for-ai-automatic-mixed-precision: @@ -95,7 +95,7 @@ can take to reduce training time and memory usage through `automatic mixed preci See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs `_ -for more information about running AMP on an AMD accelerator. +for more information about running AMP on an AMD Instinct-Series GPU. .. _rocm-for-ai-fine-tune: @@ -107,7 +107,7 @@ example, LoRA, QLoRA, PEFT, and FSDP. Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning/index`. -The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU. +The following developer blogs showcase examples of fine-tuning a model on an AMD GPU. * Fine-tuning Llama2 with LoRA diff --git a/docs/how-to/rocm-for-hpc/index.rst b/docs/how-to/rocm-for-hpc/index.rst index 064680ab4..36e3e592b 100644 --- a/docs/how-to/rocm-for-hpc/index.rst +++ b/docs/how-to/rocm-for-hpc/index.rst @@ -7,14 +7,14 @@ Using ROCm for HPC ****************** The ROCm open-source software stack is optimized to extract high-performance -computing (HPC) workload performance from AMD Instinct™ accelerators +computing (HPC) workload performance from AMD Instinct™ GPUs while maintaining compatibility with industry software frameworks. ROCm enhances support and access for developers by providing streamlined and improved tools that significantly increase productivity. Being open-source, ROCm fosters innovation, differentiation, and collaboration within the developer community, making it a powerful and accessible solution for leveraging the full -potential of AMD accelerators' capabilities in diverse computational +potential of AMD GPUs' capabilities in diverse computational applications. * For more information, see :doc:`What is ROCm? <../../what-is-rocm>`. @@ -24,7 +24,7 @@ applications. and operating system support. Some of the most popular HPC frameworks are part of the ROCm platform, including -those to help parallelize operations across multiple accelerators and servers, +those to help parallelize operations across multiple GPUs and servers, handle memory hierarchies, and solve linear systems. .. image:: ../../data/how-to/rocm-for-hpc/hpc-stack-2024_6_20.png @@ -38,7 +38,7 @@ science, genomics, geophysics, molecular dynamics, and physics computing. Refer to the resources in the following table for instructions on building, running, and deploying these applications on ROCm-capable systems with AMD -Instinct accelerators. Each build container provides parameters to specify +Instinct GPUs. Each build container provides parameters to specify different source code branches, release versions of ROCm, OpenMPI, UCX, and Ubuntu versions. @@ -96,7 +96,7 @@ Ubuntu versions. * - - `QUDA `_ - Library designed for efficient lattice QCD computations on - accelerators. It includes optimized Dirac operators and a variety of + GPUs. It includes optimized Dirac operators and a variety of fermion solvers and conjugate gradient (CG) implementations, enhancing performance and accuracy in lattice QCD simulations. diff --git a/docs/how-to/system-optimization/index.rst b/docs/how-to/system-optimization/index.rst index d8e46b52a..60e6e9a1e 100644 --- a/docs/how-to/system-optimization/index.rst +++ b/docs/how-to/system-optimization/index.rst @@ -1,6 +1,6 @@ .. meta:: :description: Learn about AMD hardware optimization for HPC-specific and workstation workloads. - :keywords: high-performance computing, HPC, Instinct accelerators, Radeon, + :keywords: high-performance computing, HPC, Instinct GPUs, Radeon, tuning, tuning guide, AMD, ROCm ******************* diff --git a/docs/how-to/tuning-guides/mi300x/index.rst b/docs/how-to/tuning-guides/mi300x/index.rst index 61bc3c4ee..0a7b98ef4 100644 --- a/docs/how-to/tuning-guides/mi300x/index.rst +++ b/docs/how-to/tuning-guides/mi300x/index.rst @@ -1,7 +1,7 @@ :orphan: .. meta:: - :description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance. + :description: How to configure MI300X GPUs to fully leverage their capabilities and achieve optimal performance. :keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning ************************ @@ -10,9 +10,9 @@ AMD MI300X tuning guides The tuning guides in this section provide a comprehensive summary of the necessary steps to properly configure your system for AMD Instinct™ MI300X -accelerators. They include detailed instructions on system settings and +GPUs. They include detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of -these accelerators, thereby achieving optimal performance. +these GPUs, thereby achieving optimal performance. * :doc:`/how-to/rocm-for-ai/inference-optimization/workload` * `AMD Instinct MI300X system optimization `_ diff --git a/docs/index.md b/docs/index.md index 99812010f..be6026ef3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,7 +8,7 @@ myst: # AMD ROCm documentation ROCm is an open-source software platform optimized to extract HPC and AI workload -performance from AMD Instinct accelerators and AMD Radeon GPUs while maintaining +performance from AMD Instinct GPUs and AMD Radeon GPUs while maintaining compatibility with industry software frameworks. For more information, see [What is ROCm?](./what-is-rocm.rst) @@ -64,7 +64,7 @@ ROCm documentation is organized into the following categories: * [ROCm libraries](./reference/api-libraries.md) * [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md) -* [Accelerator and GPU hardware specifications](./reference/gpu-arch-specs.rst) +* [GPU hardware specifications](./reference/gpu-arch-specs.rst) * [Data types and precision support](./reference/precision-support.rst) * [Graph safe support](./reference/graph-safe-support.rst) diff --git a/docs/reference/gpu-arch-specs.rst b/docs/reference/gpu-arch-specs.rst index 7b71435d2..833d1c747 100644 --- a/docs/reference/gpu-arch-specs.rst +++ b/docs/reference/gpu-arch-specs.rst @@ -1,17 +1,17 @@ .. meta:: - :description: AMD Instinct™ accelerator, AMD Radeon PRO™, and AMD Radeon™ GPU architecture information + :description: AMD Instinct™ GPU, AMD Radeon PRO™, and AMD Radeon™ GPU architecture information :keywords: Instinct, Radeon, accelerator, GCN, CDNA, RDNA, GPU, architecture, VRAM, Compute Units, Cache, Registers, LDS, Register File -Accelerator and GPU hardware specifications +GPU hardware specifications =========================================== -The following tables provide an overview of the hardware specifications for AMD Instinct™ accelerators, and AMD Radeon™ PRO and Radeon™ GPUs. +The following tables provide an overview of the hardware specifications for AMD Instinct™ GPUs, and AMD Radeon™ PRO and Radeon™ GPUs. For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix `_. .. tab-set:: - .. tab-item:: AMD Instinct accelerators + .. tab-item:: AMD Instinct GPUs .. list-table:: :header-rows: 1 diff --git a/docs/reference/gpu-atomics-operation.rst b/docs/reference/gpu-atomics-operation.rst index faa9a8320..de8a6e15b 100644 --- a/docs/reference/gpu-atomics-operation.rst +++ b/docs/reference/gpu-atomics-operation.rst @@ -1,5 +1,5 @@ .. meta:: - :description: AMD Instinct accelerator, AMD Radeon PRO, and AMD Radeon GPU + :description: AMD Instinct GPU, AMD Radeon PRO, and AMD Radeon GPU atomics operations information :keywords: Atomics operations, atomic bitwise functions, atomics add, atomics subtraction, atomics exchange, atomics min, atomics max @@ -15,8 +15,8 @@ access to the same memory location could lead to incorrect or undefined behavior. This topic summarizes the support of atomic read-modify-write -(atomicRMW) operations on AMD GPUs and accelerators. This includes gfx9, gfx10, -gfx11, and gfx12 targets and the following series of Instinct™ series: +(atomicRMW) operations on AMD GPUs. This includes gfx9, gfx10, +gfx11, and gfx12 targets and the following Instinct™ Series: - MI100 @@ -79,10 +79,10 @@ Scopes of operations: Support summary ================================================================================ -AMD Instinct accelerators +AMD Instinct GPUs -------------------------------------------------------------------------------- -**MI300 and MI350 series** +**MI300 and MI350 Series** - All atomicRMW operations are forwarded out to the Infinity Fabric. - Infinity Fabric supports common integer and bitwise atomics, FP32 atomic add, @@ -95,7 +95,7 @@ AMD Instinct accelerators It will seem like atomics to the wave, but the CPU sees it as a non-atomic load-op-store sequence. This downgrades system-scope atomics to device-scope. -**MI200 series** +**MI200 Series** - L2 cache and Infinity Fabric both support common integer and bitwise atomics. - L2 cache supports FP32 atomic add, packed-FP16 atomic add, and FP64 add, @@ -272,10 +272,10 @@ The tables selectors or options are the following: - Second-level option: - "No PCIe atomics" means the system does not support PCIe atomics between - the accelerator and peer/host-memory. + the GPU and peer/host-memory. - "PCIe atomics" means the system supports PCIe atomics between the - accelerator and peer/host-memory. + GPU and peer/host-memory. - The third-level option is the memory granularity of the memory target. @@ -306,11 +306,11 @@ The integer type atomic operations that are supported by different hardware. - Max -AMD Instinct accelerators +AMD Instinct GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The integer type atomic operations that are supported by different AMD -Instinct accelerators listed in the following table. +Instinct GPUs listed in the following table. .. @@ -481,11 +481,11 @@ The bitwise atomic operations that are supported by different hardware. 128-bit bitwise Exchange and CAS are not supported on AMD GPUs -AMD Instinct accelerators +AMD Instinct GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The bitwise atomic operations that are supported by different AMD Instinct -accelerators listed in the following table. +GPUs listed in the following table. .. @@ -659,11 +659,11 @@ The float types atomic operations that are supported by different hardware. - Add -AMD Instinct accelerators +AMD Instinct GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The float type atomic operations that are supported by different AMD Instinct -accelerators listed in the following table. +GPUs listed in the following table. .. diff --git a/docs/reference/precision-support.rst b/docs/reference/precision-support.rst index 8ee81e4b3..a80e4a5f8 100644 --- a/docs/reference/precision-support.rst +++ b/docs/reference/precision-support.rst @@ -9,7 +9,7 @@ Data types and precision support ************************************************************* -This topic summarizes the data types supported on AMD GPUs and accelerators and +This topic summarizes the data types supported on AMD GPUs and ROCm libraries, along with corresponding :doc:`HIP ` data types. Integral types diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index bfaef7ffe..cbc2e7bce 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -97,9 +97,9 @@ subtrees: subtrees: - entries: - file: how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst - title: Use a single accelerator + title: Use a single GPU - file: how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst - title: Use multiple accelerators + title: Use multiple GPUs - file: how-to/rocm-for-ai/inference/index.rst title: Inference @@ -180,7 +180,7 @@ subtrees: - file: conceptual/gpu-arch/mi300-mi200-performance-counters.rst title: MI300 and MI200 performance counters - file: conceptual/gpu-arch/mi350-performance-counters.rst - title: MI350 series performance counters + title: MI350 Series performance counters - file: conceptual/gpu-arch/mi250.md title: MI250 microarchitecture subtrees: diff --git a/tools/autotag/templates/extra_components/6.1.2.md b/tools/autotag/templates/extra_components/6.1.2.md index 257ac6efe..c0004895a 100644 --- a/tools/autotag/templates/extra_components/6.1.2.md +++ b/tools/autotag/templates/extra_components/6.1.2.md @@ -46,4 +46,4 @@ ROCm SMI for ROCm 6.1.2 #### Fixes * Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112). -* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware. +* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-Series hardware. diff --git a/tools/autotag/templates/highlights/5.3.2.md b/tools/autotag/templates/highlights/5.3.2.md index 61e4478c1..09bcd0d20 100644 --- a/tools/autotag/templates/highlights/5.3.2.md +++ b/tools/autotag/templates/highlights/5.3.2.md @@ -33,10 +33,10 @@ Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV #### AMD Instinct™ MI200 firmware updates -Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators. +Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 GPUs. An updated firmware maintenance bundle consisting of an installation tool and images specific to -AMD Instinct™ MI200 accelerators is under planning and will be available soon. +AMD Instinct™ MI200 GPUs is under planning and will be available soon. #### Known issue with rocThrust and rocPRIM libraries diff --git a/tools/autotag/templates/highlights/5.4.1.md b/tools/autotag/templates/highlights/5.4.1.md index d52a32ae4..465f67031 100644 --- a/tools/autotag/templates/highlights/5.4.1.md +++ b/tools/autotag/templates/highlights/5.4.1.md @@ -50,12 +50,12 @@ fixed in this release. #### AMD Instinct™ MI200 firmware IFWI maintenance update #3 -This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators. +This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 GPUs. -After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded +After prolonged periods of operation, certain MI200 Instinct™ GPUs may perform in a degraded way resulting in application failures. -In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware +In this package, AMD delivers a new firmware version for MI200 GPU GPUs and a firmware installation tool – AMD FW FLASH 1.2. | GPU | Productionp part number | SKU | IFWI name | diff --git a/tools/autotag/templates/highlights/5.7.0.md b/tools/autotag/templates/highlights/5.7.0.md index 76219eb2c..fb30103d8 100644 --- a/tools/autotag/templates/highlights/5.7.0.md +++ b/tools/autotag/templates/highlights/5.7.0.md @@ -10,10 +10,10 @@ New features include: * AddressSanitizer for host and device code (GPU) is now available as a beta Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major releases in the ROCm 5 -series. This release is Linux-only. +Series. This release is Linux-only. :::{important} -The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. +The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 Series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure. ::: diff --git a/tools/autotag/templates/highlights/6.0.0.md b/tools/autotag/templates/highlights/6.0.0.md index fe6540493..802fb470d 100644 --- a/tools/autotag/templates/highlights/6.0.0.md +++ b/tools/autotag/templates/highlights/6.0.0.md @@ -3,7 +3,7 @@ ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library support, and improved developer experience. This includes initial enablement of the AMD Instinct™ -MI300 series. Future releases will further enable and optimize this new platform. Key features include: +MI300 Series. Future releases will further enable and optimize this new platform. Key features include: * Improved performance in areas like lower precision math and attention layers. * New hipSPARSELt library to accelerate AI workloads via AMD's sparse matrix core technique. @@ -18,7 +18,7 @@ the [Changelog](https://rocm.docs.amd.com/en/docs-6.0.0/about/CHANGELOG.html). ### OS and GPU support changes -AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating +AMD Instinct™ MI300A and MI300X GPU support has been enabled for limited operating systems. * Ubuntu 22.04.3 (MI300A and MI300X) diff --git a/tools/autotag/templates/highlights/6.1.1.md b/tools/autotag/templates/highlights/6.1.1.md index 24cd0d77e..16d05eec8 100644 --- a/tools/autotag/templates/highlights/6.1.1.md +++ b/tools/autotag/templates/highlights/6.1.1.md @@ -3,7 +3,7 @@ ROCm™ 6.1.1 introduces minor fixes and improvements to some tools and librarie ### OS support -* ROCm 6.1.1 now supports Oracle Linux. It has been tested against version 8.9 (kernel 5.15.0-205) with AMD Instinct MI300X accelerators. +* ROCm 6.1.1 now supports Oracle Linux. It has been tested against version 8.9 (kernel 5.15.0-205) with AMD Instinct MI300X GPUs. * ROCm 6.1.1 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]). diff --git a/tools/autotag/templates/highlights/6.2.0.md b/tools/autotag/templates/highlights/6.2.0.md index a6323705e..4df7e35a6 100644 --- a/tools/autotag/templates/highlights/6.2.0.md +++ b/tools/autotag/templates/highlights/6.2.0.md @@ -28,7 +28,7 @@ This section introduces notable new features and improvements in ROCm 6.2. See t ROCm 6.2.0 introduces the following new components to the ROCm software stack. - **Omniperf** -- A kernel-level profiling tool for machine learning and high-performance computing (HPC) workloads - running on AMD Instinct accelerators. Omniperf offers comprehensive profiling and advanced analysis via command line + running on AMD Instinct GPUs. Omniperf offers comprehensive profiling and advanced analysis via command line or a GUI dashboard. For more information, see [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest). @@ -141,7 +141,7 @@ For more information, see [Model quantization techniques](https://rocm.docs.amd. #### Improved vLLM support -ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, adding +ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct GPUs, adding capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama. ROCm 6.2.0 adds support for the following vLLM features: @@ -177,12 +177,12 @@ To enable these experimental new features, see Use the `rocm/vllm` branch when cloning the GitHub repo. The `vllm/ROCm_performance.md` document outlines all the accessible features, and the `vllm/Dockerfile.rocm` file can be used. -### Enhanced performance tuning on AMD Instinct accelerators +### Enhanced performance tuning on AMD Instinct GPUs ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing. -The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes +The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct GPUs. It includes detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these -accelerators for optimal performance. For more information, see +GPUs for optimal performance. For more information, see [AMD MI300X tuning guides](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/tuning-guides/mi300x/index.html) and [AMD MI300A system optimization](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/system-optimization/mi300x.html). diff --git a/tools/autotag/templates/highlights/6.2.2.md b/tools/autotag/templates/highlights/6.2.2.md index f66e2c098..f20abf3b8 100644 --- a/tools/autotag/templates/highlights/6.2.2.md +++ b/tools/autotag/templates/highlights/6.2.2.md @@ -22,7 +22,7 @@ The following is a significant fix introduced in ROCm 6.2.2. ### Fixed Instinct MI300X error recovery failure -Improved the reliability of AMD Instinct MI300X accelerators in scenarios involving +Improved the reliability of AMD Instinct MI300X GPUs in scenarios involving uncorrectable errors. Previously, error recovery did not occur as expected, potentially leaving the system in an undefined state. This fix ensures that error recovery functions as expected, maintaining system stability. diff --git a/tools/autotag/templates/highlights/6.2.4.md b/tools/autotag/templates/highlights/6.2.4.md index 4d8af7762..2f893bd34 100644 --- a/tools/autotag/templates/highlights/6.2.4.md +++ b/tools/autotag/templates/highlights/6.2.4.md @@ -32,7 +32,7 @@ ROCm documentation continues to be updated to provide clearer and more comprehen a wider variety of user needs and use cases. * Added a new GPU cluster networking guide. See - [Cluster network performance validation for AMD Instinct accelerators](https://rocm.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html). + [Cluster network performance validation for AMD Instinct GPUs](https://rocm.docs.amd.com/projects/gpu-cluster-networking/en/latest/index.html). This documentation provides guidelines on validating network configurations in single-node and multi-node environments to attain optimal speed and bandwidth diff --git a/tools/autotag/templates/highlights/6.3.0.md b/tools/autotag/templates/highlights/6.3.0.md index f34343e2d..beb92ee5a 100644 --- a/tools/autotag/templates/highlights/6.3.0.md +++ b/tools/autotag/templates/highlights/6.3.0.md @@ -138,7 +138,7 @@ wider variety of user needs and use cases. documentation](https://rocm.docs.amd.com/projects/Tensile/en/docs-6.3.0/src/index.html). - New documentation has been added to explain the advantages of enabling the IOMMU in passthrough - mode for Instinct accelerators and Radeon GPUs. See [Input-Output Memory Management + mode for Instinct GPUs and Radeon GPUs. See [Input-Output Memory Management Unit](https://rocm.docs.amd.com/en/docs-6.3.0/conceptual/iommu.html). - The HIP documentation has been updated and includes the following new topics: diff --git a/tools/autotag/templates/highlights/6.3.1.md b/tools/autotag/templates/highlights/6.3.1.md index 2f8ecc330..5883116a2 100644 --- a/tools/autotag/templates/highlights/6.3.1.md +++ b/tools/autotag/templates/highlights/6.3.1.md @@ -26,9 +26,9 @@ documentation to verify compatibility and system requirements. The following are notable new features and improvements in ROCm 6.3.1. For changes to individual components, see [Detailed component changes](#detailed-component-changes). -### Per queue resiliency for Instinct MI300 accelerators +### Per queue resiliency for Instinct MI300 GPUs -The AMDGPU driver now includes enhanced resiliency for misbehaving applications on AMD Instinct MI300 accelerators. This helps isolate the impact of misbehaving applications, ensuring other workloads running on the same accelerator are unaffected. +The AMDGPU driver now includes enhanced resiliency for misbehaving applications on AMD Instinct MI300 GPUs. This helps isolate the impact of misbehaving applications, ensuring other workloads running on the same GPU are unaffected. ### ROCm Runfile Installer @@ -38,7 +38,7 @@ ROCm 6.3.1 introduces the ROCm Runfile Installer, with initial support for Ubunt ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases. -* Added documentation on training a model with ROCm Megatron-LM. AMD offers a Docker image for MI300X accelerators +* Added documentation on training a model with ROCm Megatron-LM. AMD offers a Docker image for MI300X GPUs containing essential components to get started, including ROCm libraries, PyTorch, and Megatron-LM utilities. See [Training a model using ROCm Megatron-LM](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/train-a-model.html) to get started. diff --git a/tools/autotag/templates/known_issues/6.2.0.md b/tools/autotag/templates/known_issues/6.2.0.md index 81351a9e5..f0f19d497 100644 --- a/tools/autotag/templates/known_issues/6.2.0.md +++ b/tools/autotag/templates/known_issues/6.2.0.md @@ -45,9 +45,9 @@ bash -c "echo taskset -p \$\$" See [issue #3493](https://github.com/ROCm/ROCm/issues/3493) on GitHub. -### Display issues on servers with Instinct MI300-series accelerators when loading AMDGPU driver +### Display issues on servers with Instinct MI300-Series GPUs when loading AMDGPU driver -AMD Instinct MI300-series accelerators and third-party GPUs such as the Matrox G200 have an issue impacting video +AMD Instinct MI300-Series GPUs and third-party GPUs such as the Matrox G200 have an issue impacting video output. The issue was reproduced on a Dell server model PowerEdge XE9680. Servers from other vendors utilizing Matrox G200 cards may be impacted as well. This issue was found with ROCm 6.2.0 but is present in older ROCm versions. diff --git a/tools/autotag/templates/known_issues/6.2.1.md b/tools/autotag/templates/known_issues/6.2.1.md index f6f601d95..3d048f20f 100644 --- a/tools/autotag/templates/known_issues/6.2.1.md +++ b/tools/autotag/templates/known_issues/6.2.1.md @@ -5,7 +5,7 @@ individual components are listed in the [Detailed component changes](detailed-co ### Instinct MI300X GPU recovery failure on uncorrectable errors -For the AMD Instinct MI300X accelerator, GPU recovery resets triggered by uncorrectable errors (UE) might not complete +For the AMD Instinct MI300X GPU, GPU recovery resets triggered by uncorrectable errors (UE) might not complete successfully, which can result in the system being left in an undefined state. A system reboot is needed to recover from this state. Additionally, error logging might fail in these situations, hindering diagnostics. diff --git a/tools/autotag/templates/known_issues/6.3.0.md b/tools/autotag/templates/known_issues/6.3.0.md index 4351a9582..fcd2729fb 100644 --- a/tools/autotag/templates/known_issues/6.3.0.md +++ b/tools/autotag/templates/known_issues/6.3.0.md @@ -5,13 +5,13 @@ issues related to individual components, review the [Detailed component changes] ### Instinct MI300X reports incorrect raw GPU timestamps -On MI300X accelerators, the command processor firmware reports incorrect raw GPU timestamps. This +On MI300X GPUs, the command processor firmware reports incorrect raw GPU timestamps. This issue is under investigation and will be addressed in a future release. See [GitHub issue #4079](https://github.com/ROCm/ROCm/issues/4079). -### Instinct MI300 series: backward weights convolution performance issue +### Instinct MI300 Series: backward weights convolution performance issue A performance issue affects certain tensor shapes during backward weights convolution when using -FP16 or FP32 data types on Instinct MI300 series accelerators. This issue will be addressed in a future ROCm release. +FP16 or FP32 data types on Instinct MI300 Series GPUs. This issue will be addressed in a future ROCm release. See [GitHub issue #4080](https://github.com/ROCm/ROCm/issues/4080). To mitigate the issue during model training, set the following environment variables: @@ -77,7 +77,7 @@ This issue will be addressed in a future ROCm release. See [GitHub issue #4085]( Canny edge detection kernels might access out-of-bounds memory locations while computing gradient intensities on edge pixels. This issue is isolated to -Canny-specific use cases on Instinct MI300 series accelerators. This issue is +Canny-specific use cases on Instinct MI300 Series GPUs. This issue is resolved in the [MIVisionX `develop` branch](https://github.com/ROCm/mivisionx) and will be part of a future ROCm release. See [GitHub issue #4086](https://github.com/ROCm/ROCm/issues/4086). diff --git a/tools/autotag/templates/resolved_issues/6.3.1.md b/tools/autotag/templates/resolved_issues/6.3.1.md index ac41d81da..d12a3127a 100644 --- a/tools/autotag/templates/resolved_issues/6.3.1.md +++ b/tools/autotag/templates/resolved_issues/6.3.1.md @@ -3,9 +3,9 @@ The following are previously known issues resolved in this release. For resolved issues related to individual components, review the [Detailed component changes](#detailed-component-changes). -### Instinct MI300 series: backward weights convolution performance issue +### Instinct MI300 Series: backward weights convolution performance issue -Fixed a performance issue affecting certain tensor shapes during backward weights convolution when using FP16 or FP32 data types on Instinct MI300 series accelerators. See [GitHub issue #4080](https://github.com/ROCm/ROCm/issues/4080). +Fixed a performance issue affecting certain tensor shapes during backward weights convolution when using FP16 or FP32 data types on Instinct MI300 Series GPUs. See [GitHub issue #4080](https://github.com/ROCm/ROCm/issues/4080). ### ROCm Compute Profiler and ROCm Systems Profiler post-upgrade issues diff --git a/tools/autotag/templates/resolved_issues/6.3.2.md b/tools/autotag/templates/resolved_issues/6.3.2.md index 0f57c86ad..97dfd4616 100644 --- a/tools/autotag/templates/resolved_issues/6.3.2.md +++ b/tools/autotag/templates/resolved_issues/6.3.2.md @@ -19,7 +19,7 @@ This issue has been fixed in the ROCm 6.3.2 release. See [GitHub issue #4085](ht An issue where Canny edge detection kernels accessed out-of-bounds memory locations while computing gradient intensities on edge pixels has been fixed. This issue was isolated to -Canny-specific use cases on Instinct MI300 series accelerators. See [GitHub issue #4086](https://github.com/ROCm/ROCm/issues/4086). +Canny-specific use cases on Instinct MI300 Series GPUs. See [GitHub issue #4086](https://github.com/ROCm/ROCm/issues/4086). ### AMD VCN instability with rocDecode diff --git a/tools/autotag/templates/support/6.3.1.md b/tools/autotag/templates/support/6.3.1.md index ea458ca45..e7c51a62a 100644 --- a/tools/autotag/templates/support/6.3.1.md +++ b/tools/autotag/templates/support/6.3.1.md @@ -1,8 +1,8 @@ ## Operating system and hardware support changes -ROCm 6.3.1 adds support for Debian 12 (kernel: 6.1). Debian is supported only on AMD Instinct accelerators. See the installation instructions at [Debian native installation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.3.1/install/native-install/debian.html). +ROCm 6.3.1 adds support for Debian 12 (kernel: 6.1). Debian is supported only on AMD Instinct GPUs. See the installation instructions at [Debian native installation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.3.1/install/native-install/debian.html). -ROCm 6.3.1 enables support for AMD Instinct MI325X accelerator. For more information, see [AMD Instinct™ MI325X Accelerators](https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html). +ROCm 6.3.1 enables support for AMD Instinct MI325X GPU. For more information, see [AMD Instinct™ MI325X GPUs](https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html). See the [Compatibility matrix](https://rocm.docs.amd.com/en/docs-6.3.1/compatibility/compatibility-matrix.html) diff --git a/tools/autotag/templates/support/6.3.2.md b/tools/autotag/templates/support/6.3.2.md index c9d003052..7b9758ad4 100644 --- a/tools/autotag/templates/support/6.3.2.md +++ b/tools/autotag/templates/support/6.3.2.md @@ -1,6 +1,6 @@ ## Operating system and hardware support changes -ROCm 6.3.2 adds support for Azure Linux 3.0 (kernel: 6.6). Azure Linux is supported only on AMD Instinct accelerators. For more information, see [Azure Linux installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html). +ROCm 6.3.2 adds support for Azure Linux 3.0 (kernel: 6.6). Azure Linux is supported only on AMD Instinct GPUs. For more information, see [Azure Linux installation](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html). See the [Compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html)