Initial MI250 Guide (#1976)

* Initial MI250 Guide * Limit line length to 80 columns * References using MyST * Move to figure-md and numref * Add MI250 to TOC
2026-01-10 07:08:08 -05:00 · 2023-03-22 15:45:00 +01:00
parent 2f51e147f2
commit e9ee6b9874
6 changed files with 159 additions and 0 deletions
--- a/docs/.sphinx/_toc.yml.in
+++ b/docs/.sphinx/_toc.yml.in
@@ -134,6 +134,10 @@ subtrees:
          - url: https://rocmdocs.amd.com/projects/rdc/en/{branch}/
            title: ROCm Datacenter Tool
    - file: reference/gpu_arch
+      subtrees:
+        - entries:
+            - file: reference/gpu_arch/mi250
+              title: MI250
 - caption: Understand ROCm
  entries:
    - title: Compiler Disambiguation
--- a/docs/data/reference/gpu_arch/image.001.png
+++ b/docs/data/reference/gpu_arch/image.001.png
--- a/docs/data/reference/gpu_arch/image.002.png
+++ b/docs/data/reference/gpu_arch/image.002.png
--- a/docs/data/reference/gpu_arch/image.003.png
+++ b/docs/data/reference/gpu_arch/image.003.png
--- a/docs/reference/gpu_arch.md
+++ b/docs/reference/gpu_arch.md
@@ -17,3 +17,17 @@
 - [AMD CDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
 - [AMD Vega Architecture Whitepaper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
 - [AMD RDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
+
+## Architecture Guides
+
+:::::{grid} 1 1 1 1
+:gutter: 1
+
+:::{grid-item-card} AMD Instinct MI200
+This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
+
+- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
+- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
+- [Guide](./gpu_arch/mi250.md)
+
+:::
--- a/docs/reference/gpu_arch/mi250.md
+++ b/docs/reference/gpu_arch/mi250.md
@@ -0,0 +1,141 @@
+# AMD Instinct Hardware
+
+This chapter briefly reviews hardware aspects of the AMD Instinct MI250
+accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
+
+## AMD CDNA 2 Micro-architecture
+
+The micro-architecture of the AMD Instinct MI250 accelerators is based on the
+AMD CDNA 2 architecture that targets compute applications such as HPC,
+artificial intelligence (AI), and Machine Learning (ML) and that run on
+everything from individual servers to the world’s largest exascale
+supercomputers. The overall system architecture is designed for extreme
+scalability and compute performance.
+
+{numref}`mi250-gcd` shows the components of a single Graphics Compute Die (GCD
+) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™
+interfaces and their physical links that are used to connect the GPU die to the
+other system-level components of the node (see also Section 2.2). Both
+interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity
+Fabric links of the controller at the bottom can be configured as a PCIe link.
+Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec,
+which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link (
+two bytes per transaction). Section 2.2 has more details on the number of AMD
+Infinity Fabric links and the resulting transfer rates between the system-level
+components.
+
+To the left and the right are memory controllers that attach the High Bandwidth
+Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers
+a peak memory bandwidth of 1.6 TB/sec per GCD.
+
+The execution units of the GPU are depicted in {numref}`mi250-gcd` as Compute
+Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further
+subdivided into four SIMD units that process SIMD instructions of 16 data
+elements per instruction (for the FP64 data type). This enables the CU to
+process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7
+GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3
+TFLOPS for vector instructions. The MI250 compute units also provide specialized
+execution units (also called matrix cores), which are geared toward executing
+matrix operations like matrix-matrix multiplications. For FP64, the peak
+performance of these units amounts to 90.5 TFLOPS.
+
+:::{figure-md} mi250-gcd
+
+<img src="../../data/reference/gpu_arch/image.001.png" alt="Structure of a single GCD in the AMD Instinct MI250 accelerator.">
+
+Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.
+:::
+
+```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
+:header-rows: 1
+:name: mi250-perf
+
+* - Computation and Data Type
+  - FLOPS/CLOCK/CU
+  - Peak TFLOPS
+* - Matrix FP64
+  - 256
+  - 90.5
+* - Vector FP64
+  - 128
+  - 45.3
+* - Matrix FP32
+  - 256
+  - 90.5
+* - Packed FP32
+  - 256
+  - 90.5
+* - Vector FP32
+  - 128
+  - 45.3
+* - Matrix FP16
+  - 1024
+  - 362.1
+* - Matrix BF16
+  - 1024
+  - 362.1
+* - Matrix INT8
+  - 1024
+  - 362.1
+
+```
+
+{numref}`mi250-perf` summarizes the aggregated peak performance of the AMD
+Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
+Platform) and its two GCDs for different data types and execution units. The
+middle column lists the peak performance (number of data elements processed in a
+single instruction) of a single compute unit if a SIMD (or matrix) instruction
+is being retired in each clock cycle. The third column lists the theoretical
+peak performance of the OAM module. The theoretical aggregated peak memory
+bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
+
+:::{figure-md} mi250-arch
+
+<img src="../../data/reference/gpu_arch/image.002.png" alt="Dual-GCD architecture of the AMD Instinct MI250 accelerators.">
+
+Dual-GCD architecture of the AMD Instinct MI250 accelerators.
+:::
+
+{numref}`mi250-arch` shows the block diagram of an OAM package that consists
+of two GCDs, each of which constitutes one GPU device in the system. The two
+GCDs in the package are connected via four AMD Infinity Fabric links running at
+a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth
+between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
+400 GB/sec for the same.
+
+## Node-level Architecture
+
+{numref}`mi250-block` shows the node-level architecture of a system that is
+based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
+system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
+x16 link to the host part of the system. Depending on the server platform, the
+GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
+. Note that some platforms may offer an x8 interface to the GCDs, which reduces
+the available host-to-GPU bandwidth.
+
+:::{figure-md} mi250-block
+
+<img src="../../data/reference/gpu_arch/image.003.png" alt="Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.">
+
+Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation
+AMD EPYC processor.
+:::
+
+{numref}`mi250-block` shows the node-level architecture of a system with AMD
+EPYC processors in a dual-socket configuration and four AMD Instinct MI250
+accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
+x16 links (yellow lines). Depending on the system design, a PCIe switch may
+exist to make more PCIe lanes available for additional components like network
+interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to
+the host part of the system or to the PCIe switch. Please note, some platforms
+may offer an x8 interface to the GCDs, which will reduce the available
+host-to-GPU bandwidth.
+
+Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows
+for direct data exchange between the GPU dies via AMD Infinity Fabric links (
+black, green, and red lines). Each of these 16-wide links connects to one of the
+two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a
+theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec
+bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0
+and 4 connect via two XGMI links, which is indicated by the thicker red line in
+{numref}`mi250-block`.