mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-03 19:05:35 -05:00
MI100 architecture guide (#1994)
* Initial MI100 docs
* Try changing style to fix MD004
* Disable MD004
* Disable MD005
* Move to {table} from {list-table}
* Don't disable few MD styles
This commit is contained in:
committed by
GitHub
parent
519707db4f
commit
286f120d9a
@@ -20,14 +20,27 @@
|
||||
|
||||
## Architecture Guides
|
||||
|
||||
:::::{grid} 1 1 1 1
|
||||
:::::{grid} 1 1 2 2
|
||||
:gutter: 1
|
||||
|
||||
:::{grid-item-card} AMD Instinct MI200
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct™ MI250
|
||||
accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||||
|
||||
- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
|
||||
- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
|
||||
- [Guide](./gpu_arch/mi250.md)
|
||||
|
||||
:::
|
||||
|
||||
:::{grid-item-card} AMD Instinct MI100
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct™ MI100
|
||||
accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.
|
||||
|
||||
- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
|
||||
- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
|
||||
- [Guide](./gpu_arch/mi100.md)
|
||||
|
||||
:::
|
||||
|
||||
:::::
|
||||
|
||||
110
docs/reference/gpu_arch/mi100.md
Normal file
110
docs/reference/gpu_arch/mi100.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# AMD Instinct™ MI100 Hardware
|
||||
|
||||
In this chapter, we are going to briefly review hardware aspects of the AMD
|
||||
Instinct™ MI100 accelerators and the CDNA architecture that is the foundation of
|
||||
these GPUs.
|
||||
|
||||
## System Architecture
|
||||
|
||||
{numref}`mi100-arch` shows the node-level architecture of a system that
|
||||
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators.
|
||||
The two EPYC processors are connected to each other with the AMD Infinity™
|
||||
fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such
|
||||
that each processor can access the available node memory as a single
|
||||
shared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a
|
||||
2P, or dual-socket, configuration, three AMD Infinity™ fabric links are
|
||||
available to connect the processors plus one PCIe Gen 4 x16 link per processor
|
||||
can attach additional I/O devices such as the host adapters for the network
|
||||
fabric.
|
||||
|
||||
:::{figure-md} mi100-arch
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.004.png" alt="Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.">
|
||||
|
||||
Structure of a single GCD in the AMD Instinct MI250 accelerator.
|
||||
:::
|
||||
|
||||
In a typical node configuration, each processor can host up to four AMD
|
||||
Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec,
|
||||
which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive
|
||||
of four accelerators can participate in a fully connected, coherent AMD
|
||||
Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD
|
||||
Infinity fabric links that run at a higher frequency than the inter-processor
|
||||
links. This inter-GPU link can be established in certified server systems if the
|
||||
GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity
|
||||
Fabric™ bridge for the AMD Instinct™ accelerators.
|
||||
|
||||
## Micro-architecture
|
||||
|
||||
The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA
|
||||
architecture, which targets compute applications such as high-performance
|
||||
computing (HPC) and AI & machine learning (ML) that run on everything from
|
||||
individual servers to the world’s largest exascale supercomputers. The overall
|
||||
system architecture is designed for extreme scalability and compute performance.
|
||||
|
||||
:::{figure-md} mi100-block
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.005.png" alt="Structure of the AMD Instinct accelerator (MI100 generation).">
|
||||
|
||||
Structure of the AMD Instinct accelerator (MI100 generation).
|
||||
:::
|
||||
|
||||
{numref}`mi100-block` shows the AMD Instinct accelerator with its PCIe Gen 4 x16
|
||||
link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host
|
||||
processor(s). It also shows the three AMD Infinity Fabric ports that provide
|
||||
high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
|
||||
hive as shown in {numref}`mi100-arch`.
|
||||
|
||||
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
|
||||
attaches via the GPU’s memory controller. The MI100 generation of the AMD
|
||||
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
|
||||
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
|
||||
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
|
||||
|
||||
The execution units of the GPU are depicted in {numref}`mi100-block` as Compute
|
||||
Units (CU). There are a total 120 compute units that are physically organized
|
||||
into eight Shader Engines (SE) with fifteen compute units per shader engine.
|
||||
Each compute unit is further sub-divided into four SIMD units that process SIMD
|
||||
instructions of 16 data elements per instruction. This enables the CU to process
|
||||
64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz.
|
||||
Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
|
||||
(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`).
|
||||
|
||||
:::{figure-md} mi100-gcd
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.006.png" alt="Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture">
|
||||
|
||||
Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA
|
||||
architecture
|
||||
:::
|
||||
|
||||
{numref}`mi100-gcd` shows the block diagram of a single CU of an AMD Instinct™
|
||||
MI100 accelerator and summarizes how instructions flow through the execution
|
||||
engines. The CU fetches the instructions via a 32KB instruction cache and moves
|
||||
them forward to execution via a dispatcher. The CU can handle up to ten
|
||||
wavefronts at a time and feed their instructions into the execution unit. The
|
||||
execution unit contains 256 vector general-purpose registers (VGPR) and 800
|
||||
scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamically
|
||||
allocated to the executing wavefronts. A wavefront can access a maximum of 102
|
||||
scalar registers. Excess scalar-register usage will cause register spilling and
|
||||
thus may affect execution performance.
|
||||
|
||||
A wavefront can occupy any number of VGPRs from 0 to 256, directly affecting
|
||||
occupancy; that is, the number of concurrently active wavefronts in the CU. For
|
||||
instance, with 119 VPGRs used, only two wavefronts can be active in the CU at
|
||||
the same time. With the instruction latency of four cycles per SIMD instruction,
|
||||
the occupancy should be as high as possible such that the compute unit can
|
||||
improve execution efficiency by scheduling instructions from multiple
|
||||
wavefronts.
|
||||
|
||||
:::{table} Peak-performance capabilities of MI100 for different data types.
|
||||
:name: mi100-perf
|
||||
| Computation and Data Type | FLOPS/CLOCK/CU | Peak TFLOPS |
|
||||
| :------------------------ | :------------: | ----------: |
|
||||
| Vector FP64 | 64 | 11.5 |
|
||||
| Matrix FP32 | 256 | 46.1 |
|
||||
| Vector FP32 | 128 | 23.1 |
|
||||
| Matrix FP16 | 1024 | 184.6 |
|
||||
| Matrix BF16 | 512 | 92.3 |
|
||||
|
||||
:::
|
||||
Reference in New Issue
Block a user