diff --git a/docs/.sphinx/_toc.yml.in b/docs/.sphinx/_toc.yml.in index edcb24af7..2307aee1a 100644 --- a/docs/.sphinx/_toc.yml.in +++ b/docs/.sphinx/_toc.yml.in @@ -147,6 +147,8 @@ subtrees: - entries: - file: reference/gpu_arch/mi250 title: MI250 + - file: reference/gpu_arch/mi100 + title: MI100 - caption: Understand ROCm entries: - title: Compiler Disambiguation diff --git a/docs/data/reference/gpu_arch/image.004.png b/docs/data/reference/gpu_arch/image.004.png new file mode 100644 index 000000000..0b802dfc8 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.004.png differ diff --git a/docs/data/reference/gpu_arch/image.005.png b/docs/data/reference/gpu_arch/image.005.png new file mode 100644 index 000000000..f8a6f64b1 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.005.png differ diff --git a/docs/data/reference/gpu_arch/image.006.png b/docs/data/reference/gpu_arch/image.006.png new file mode 100644 index 000000000..88edc6588 Binary files /dev/null and b/docs/data/reference/gpu_arch/image.006.png differ diff --git a/docs/reference/gpu_arch.md b/docs/reference/gpu_arch.md index 913443764..4c94a6172 100644 --- a/docs/reference/gpu_arch.md +++ b/docs/reference/gpu_arch.md @@ -20,14 +20,27 @@ ## Architecture Guides -:::::{grid} 1 1 1 1 +:::::{grid} 1 1 2 2 :gutter: 1 :::{grid-item-card} AMD Instinct MI200 -This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs. +This chapter briefly reviews hardware aspects of the AMD Instinct™ MI250 +accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs. - [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) - [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) - [Guide](./gpu_arch/mi250.md) ::: + +:::{grid-item-card} AMD Instinct MI100 +This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100 +accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs. + +- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf) +- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf) +- [Guide](./gpu_arch/mi100.md) + +::: + +::::: diff --git a/docs/reference/gpu_arch/mi100.md b/docs/reference/gpu_arch/mi100.md new file mode 100644 index 000000000..9e4da6b0a --- /dev/null +++ b/docs/reference/gpu_arch/mi100.md @@ -0,0 +1,110 @@ +# AMD Instinct™ MI100 Hardware + +In this chapter, we are going to briefly review hardware aspects of the AMD +Instinct™ MI100 accelerators and the CDNA architecture that is the foundation of +these GPUs. + +## System Architecture + +{numref}`mi100-arch` shows the node-level architecture of a system that +comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators. +The two EPYC processors are connected to each other with the AMD Infinity™ +fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such +that each processor can access the available node memory as a single +shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a +2P, or dual-socket, configuration, three AMD Infinity™ fabric links are +available to connect the processors plus one PCIe Gen 4 x16 link per processor +can attach additional I/O devices such as the host adapters for the network +fabric. + +:::{figure-md} mi100-arch + +Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators. + +Structure of a single GCD in the AMD Instinct MI250 accelerator. +::: + +In a typical node configuration, each processor can host up to four AMD +Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec, +which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive +of four accelerators can participate in a fully connected, coherent AMD +Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD +Infinity fabric links that run at a higher frequency than the inter-processor +links. This inter-GPU link can be established in certified server systems if the +GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity +Fabric™ bridge for the AMD Instinct™ accelerators. + +## Micro-architecture + +The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA +architecture, which targets compute applications such as high-performance +computing (HPC) and AI & machine learning (ML) that run on everything from +individual servers to the world’s largest exascale supercomputers. The overall +system architecture is designed for extreme scalability and compute performance. + +:::{figure-md} mi100-block + +Structure of the AMD Instinct accelerator (MI100 generation). + +Structure of the AMD Instinct accelerator (MI100 generation). +::: + +{numref}`mi100-block` shows the AMD Instinct accelerator with its PCIe Gen 4 x16 +link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host +processor(s). It also shows the three AMD Infinity Fabric ports that provide +high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local +hive as shown in {numref}`mi100-arch`. + +On the left and right of the floor plan, the High Bandwidth Memory (HBM) +attaches via the GPU’s memory controller. The MI100 generation of the AMD +Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total +of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the +attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz. + +The execution units of the GPU are depicted in {numref}`mi100-block` as Compute +Units (CU). There are a total 120 compute units that are physically organized +into eight Shader Engines (SE) with fifteen compute units per shader engine. +Each compute unit is further sub-divided into four SIMD units that process SIMD +instructions of 16 data elements per instruction. This enables the CU to process +64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz. +Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS +(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`). + +:::{figure-md} mi100-gcd + +Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture + +Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA +architecture +::: + +{numref}`mi100-gcd` shows the block diagram of a single CU of an AMD Instinct™ +MI100 accelerator and summarizes how instructions flow through the execution +engines. The CU fetches the instructions via a 32KB instruction cache and moves +them forward to execution via a dispatcher. The CU can handle up to ten +wavefronts at a time and feed their instructions into the execution unit. The +execution unit contains 256 vector general-purpose registers (VGPR) and 800 +scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically +allocated to the executing wavefronts. A wavefront can access a maximum of 102 +scalar registers. Excess scalar-register usage will cause register spilling and +thus may affect execution performance. + +A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting +occupancy; that is, the number of concurrently active wavefronts in the CU. For +instance, with 119 VPGRs used, only two wavefronts can be active in the CU at +the same time. With the instruction latency of four cycles per SIMD instruction, +the occupancy should be as high as possible such that the compute unit can +improve execution efficiency by scheduling instructions from multiple +wavefronts. + +:::{table} Peak-performance capabilities of MI100 for different data types. +:name: mi100-perf +| Computation and Data Type | FLOPS/CLOCK/CU | Peak TFLOPS | +| :------------------------ | :------------: | ----------: | +| Vector FP64 | 64 | 11.5 | +| Matrix FP32 | 256 | 46.1 | +| Vector FP32 | 128 | 23.1 | +| Matrix FP16 | 1024 | 184.6 | +| Matrix BF16 | 512 | 92.3 | + +:::