diff --git a/docs/.sphinx/_toc.yml.in b/docs/.sphinx/_toc.yml.in index f78aad5ba..d536f9f32 100644 --- a/docs/.sphinx/_toc.yml.in +++ b/docs/.sphinx/_toc.yml.in @@ -134,6 +134,10 @@ subtrees: - url: https://rocmdocs.amd.com/projects/rdc/en/{branch}/ title: ROCm Datacenter Tool - file: reference/gpu_arch + subtrees: + - entries: + - file: reference/gpu_arch/mi250 + title: MI250 - caption: Understand ROCm entries: - title: Compiler Disambiguation diff --git a/docs/data/reference/gpu_arch/image.001.png b/docs/data/reference/gpu_arch/image.001.png new file mode 100644 index 000000000..befcd73c4 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.001.png differ diff --git a/docs/data/reference/gpu_arch/image.002.png b/docs/data/reference/gpu_arch/image.002.png new file mode 100644 index 000000000..ecdadc327 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.002.png differ diff --git a/docs/data/reference/gpu_arch/image.003.png b/docs/data/reference/gpu_arch/image.003.png new file mode 100644 index 000000000..611dcf786 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.003.png differ diff --git a/docs/reference/gpu_arch.md b/docs/reference/gpu_arch.md index 12197c60f..913443764 100644 --- a/docs/reference/gpu_arch.md +++ b/docs/reference/gpu_arch.md @@ -17,3 +17,17 @@ - [AMD CDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) - [AMD Vega Architecture Whitepaper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf) - [AMD RDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf) + +## Architecture Guides + +:::::{grid} 1 1 1 1 +:gutter: 1 + +:::{grid-item-card} AMD Instinct MI200 +This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs. + +- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) +- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) +- [Guide](./gpu_arch/mi250.md) + +::: diff --git a/docs/reference/gpu_arch/mi250.md b/docs/reference/gpu_arch/mi250.md new file mode 100644 index 000000000..6aa289b6b --- /dev/null +++ b/docs/reference/gpu_arch/mi250.md @@ -0,0 +1,141 @@ +# AMD Instinct Hardware + +This chapter briefly reviews hardware aspects of the AMD Instinct MI250 +accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs. + +## AMD CDNA 2 Micro-architecture + +The micro-architecture of the AMD Instinct MI250 accelerators is based on the +AMD CDNA 2 architecture that targets compute applications such as HPC, +artificial intelligence (AI), and Machine Learning (ML) and that run on +everything from individual servers to the world’s largest exascale +supercomputers. The overall system architecture is designed for extreme +scalability and compute performance. + +{numref}`mi250-gcd` shows the components of a single Graphics Compute Die (GCD +) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™ +interfaces and their physical links that are used to connect the GPU die to the +other system-level components of the node (see also Section 2.2). Both +interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity +Fabric links of the controller at the bottom can be configured as a PCIe link. +Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec, +which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link ( +two bytes per transaction). Section 2.2 has more details on the number of AMD +Infinity Fabric links and the resulting transfer rates between the system-level +components. + +To the left and the right are memory controllers that attach the High Bandwidth +Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers +a peak memory bandwidth of 1.6 TB/sec per GCD. + +The execution units of the GPU are depicted in {numref}`mi250-gcd` as Compute +Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further +subdivided into four SIMD units that process SIMD instructions of 16 data +elements per instruction (for the FP64 data type). This enables the CU to +process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7 +GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3 +TFLOPS for vector instructions. The MI250 compute units also provide specialized +execution units (also called matrix cores), which are geared toward executing +matrix operations like matrix-matrix multiplications. For FP64, the peak +performance of these units amounts to 90.5 TFLOPS. + +:::{figure-md} mi250-gcd + +Structure of a single GCD in the AMD Instinct MI250 accelerator. + +Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator. +::: + +```{list-table} Peak-performance capabilities of the MI250 OAM for different data types. +:header-rows: 1 +:name: mi250-perf + +* - Computation and Data Type + - FLOPS/CLOCK/CU + - Peak TFLOPS +* - Matrix FP64 + - 256 + - 90.5 +* - Vector FP64 + - 128 + - 45.3 +* - Matrix FP32 + - 256 + - 90.5 +* - Packed FP32 + - 256 + - 90.5 +* - Vector FP32 + - 128 + - 45.3 +* - Matrix FP16 + - 1024 + - 362.1 +* - Matrix BF16 + - 1024 + - 362.1 +* - Matrix INT8 + - 1024 + - 362.1 + +``` + +{numref}`mi250-perf` summarizes the aggregated peak performance of the AMD +Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute +Platform) and its two GCDs for different data types and execution units. The +middle column lists the peak performance (number of data elements processed in a +single instruction) of a single compute unit if a SIMD (or matrix) instruction +is being retired in each clock cycle. The third column lists the theoretical +peak performance of the OAM module. The theoretical aggregated peak memory +bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD). + +:::{figure-md} mi250-arch + +Dual-GCD architecture of the AMD Instinct MI250 accelerators. + +Dual-GCD architecture of the AMD Instinct MI250 accelerators. +::: + +{numref}`mi250-arch` shows the block diagram of an OAM package that consists +of two GCDs, each of which constitutes one GPU device in the system. The two +GCDs in the package are connected via four AMD Infinity Fabric links running at +a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth +between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of +400 GB/sec for the same. + +## Node-level Architecture + +{numref}`mi250-block` shows the node-level architecture of a system that is +based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host +system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe +x16 link to the host part of the system. Depending on the server platform, the +GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch +. Note that some platforms may offer an x8 interface to the GCDs, which reduces +the available host-to-GPU bandwidth. + +:::{figure-md} mi250-block + +Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor. + +Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation +AMD EPYC processor. +::: + +{numref}`mi250-block` shows the node-level architecture of a system with AMD +EPYC processors in a dual-socket configuration and four AMD Instinct MI250 +accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4 +x16 links (yellow lines). Depending on the system design, a PCIe switch may +exist to make more PCIe lanes available for additional components like network +interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to +the host part of the system or to the PCIe switch. Please note, some platforms +may offer an x8 interface to the GCDs, which will reduce the available +host-to-GPU bandwidth. + +Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows +for direct data exchange between the GPU dies via AMD Infinity Fabric links ( +black, green, and red lines). Each of these 16-wide links connects to one of the +two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a +theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec +bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0 +and 4 connect via two XGMI links, which is indicated by the thicker red line in +{numref}`mi250-block`.