MI100 architecture guide (#1994)

* Initial MI100 docs * Try changing style to fix MD004 * Disable MD004 * Disable MD005 * Move to {table} from {list-table} * Don't disable few MD styles
2026-02-03 19:05:35 -05:00 · 2023-03-29 15:14:23 +02:00
parent 519707db4f
commit 286f120d9a
6 changed files with 127 additions and 2 deletions
--- a/docs/reference/gpu_arch.md
+++ b/docs/reference/gpu_arch.md
@@ -20,14 +20,27 @@

 ## Architecture Guides

-:::::{grid} 1 1 1 1
+:::::{grid} 1 1 2 2
 :gutter: 1

 :::{grid-item-card} AMD Instinct MI200
-This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
+This chapter briefly reviews hardware aspects of the AMD Instinct™ MI250
+accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.

 - [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
 - [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
 - [Guide](./gpu_arch/mi250.md)

 :::
+
+:::{grid-item-card} AMD Instinct MI100
+This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100
+accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.
+
+- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
+- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
+- [Guide](./gpu_arch/mi100.md)
+
+:::
+
+:::::
--- a/docs/reference/gpu_arch/mi100.md
+++ b/docs/reference/gpu_arch/mi100.md
@@ -0,0 +1,110 @@
+# AMD Instinct™ MI100 Hardware
+
+In this chapter, we are going to briefly review hardware aspects of the AMD
+Instinct™ MI100 accelerators and the CDNA architecture that is the foundation of
+these GPUs.
+
+## System Architecture
+
+{numref}`mi100-arch` shows the node-level architecture of a system that
+comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators.
+The two EPYC processors are connected to each other with the AMD Infinity™
+fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such
+that each processor can access the available node memory as a single
+shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a
+2P, or dual-socket, configuration, three AMD Infinity™ fabric links are
+available to connect the processors plus one PCIe Gen 4 x16 link per processor
+can attach additional I/O devices such as the host adapters for the network
+fabric.
+
+:::{figure-md} mi100-arch
+
+<img src="../../data/reference/gpu_arch/image.004.png" alt="Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.">
+
+Structure of a single GCD in the AMD Instinct MI250 accelerator.
+:::
+
+In a typical node configuration, each processor can host up to four AMD
+Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec,
+which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive
+of four accelerators can participate in a fully connected, coherent AMD
+Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD
+Infinity fabric links that run at a higher frequency than the inter-processor
+links. This inter-GPU link can be established in certified server systems if the
+GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity
+Fabric™ bridge for the AMD Instinct™ accelerators.
+
+## Micro-architecture
+
+The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA
+architecture, which targets compute applications such as high-performance
+computing (HPC) and AI & machine learning (ML) that run on everything from
+individual servers to the world’s largest exascale supercomputers. The overall
+system architecture is designed for extreme scalability and compute performance.
+
+:::{figure-md} mi100-block
+
+<img src="../../data/reference/gpu_arch/image.005.png" alt="Structure of the AMD Instinct accelerator (MI100 generation).">
+
+Structure of the AMD Instinct accelerator (MI100 generation).
+:::
+
+{numref}`mi100-block` shows the AMD Instinct accelerator with its PCIe Gen 4 x16
+link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host
+processor(s). It also shows the three AMD Infinity Fabric ports that provide
+high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
+hive as shown in {numref}`mi100-arch`.
+
+On the left and right of the floor plan, the High Bandwidth Memory (HBM)
+attaches via the GPU’s memory controller.  The MI100 generation of the AMD
+Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
+of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
+attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
+
+The execution units of the GPU are depicted in {numref}`mi100-block` as Compute
+Units (CU). There are a total 120 compute units that are physically organized
+into eight Shader Engines (SE) with fifteen compute units per shader engine.
+Each compute unit is further sub-divided into four SIMD units that process SIMD
+instructions of 16 data elements per instruction. This enables the CU to process
+64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz.
+Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
+(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`).
+
+:::{figure-md} mi100-gcd
+
+<img src="../../data/reference/gpu_arch/image.006.png" alt="Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture">
+
+Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA
+architecture
+:::
+
+{numref}`mi100-gcd` shows the block diagram of a single CU of an AMD Instinct™
+MI100 accelerator and summarizes how instructions flow through the execution
+engines. The CU fetches the instructions via a 32KB instruction cache and moves
+them forward to execution via a dispatcher. The CU can handle up to ten
+wavefronts at a time and feed their instructions into the execution unit. The
+execution unit contains 256 vector general-purpose registers (VGPR) and 800
+scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically
+allocated to the executing wavefronts. A wavefront can access a maximum of 102
+scalar registers. Excess scalar-register usage will cause register spilling and
+thus may affect execution performance.
+
+A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting
+occupancy; that is, the number of concurrently active wavefronts in the CU. For
+instance, with 119 VPGRs used, only two wavefronts can be active in the CU at
+the same time. With the instruction latency of four cycles per SIMD instruction,
+the occupancy should be as high as possible such that the compute unit can
+improve execution efficiency by scheduling instructions from multiple
+wavefronts.
+
+:::{table} Peak-performance capabilities of MI100 for different data types.
+:name: mi100-perf
+| Computation and Data Type | FLOPS/CLOCK/CU | Peak TFLOPS |
+| :------------------------ | :------------: | ----------: |
+| Vector FP64               | 64             | 11.5        |
+| Matrix FP32               | 256            | 46.1        |
+| Vector FP32               | 128            | 23.1        |
+| Matrix FP16               | 1024           | 184.6       |
+| Matrix BF16               | 512            | 92.3        |
+
+:::