mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-10 07:08:08 -05:00
Initial MI250 Guide (#1976)
* Initial MI250 Guide * Limit line length to 80 columns * References using MyST * Move to figure-md and numref * Add MI250 to TOC
This commit is contained in:
committed by
GitHub
parent
2f51e147f2
commit
e9ee6b9874
@@ -134,6 +134,10 @@ subtrees:
|
||||
- url: https://rocmdocs.amd.com/projects/rdc/en/{branch}/
|
||||
title: ROCm Datacenter Tool
|
||||
- file: reference/gpu_arch
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: reference/gpu_arch/mi250
|
||||
title: MI250
|
||||
- caption: Understand ROCm
|
||||
entries:
|
||||
- title: Compiler Disambiguation
|
||||
|
||||
BIN
docs/data/reference/gpu_arch/image.001.png
Normal file
BIN
docs/data/reference/gpu_arch/image.001.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 103 KiB |
BIN
docs/data/reference/gpu_arch/image.002.png
Normal file
BIN
docs/data/reference/gpu_arch/image.002.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 59 KiB |
BIN
docs/data/reference/gpu_arch/image.003.png
Normal file
BIN
docs/data/reference/gpu_arch/image.003.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 41 KiB |
@@ -17,3 +17,17 @@
|
||||
- [AMD CDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
|
||||
- [AMD Vega Architecture Whitepaper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
|
||||
- [AMD RDNA Architecture Whitepaper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
|
||||
|
||||
## Architecture Guides
|
||||
|
||||
:::::{grid} 1 1 1 1
|
||||
:gutter: 1
|
||||
|
||||
:::{grid-item-card} AMD Instinct MI200
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct MI250 accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||||
|
||||
- [Instruction Set Architecture](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
|
||||
- [Whitepaper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
|
||||
- [Guide](./gpu_arch/mi250.md)
|
||||
|
||||
:::
|
||||
|
||||
141
docs/reference/gpu_arch/mi250.md
Normal file
141
docs/reference/gpu_arch/mi250.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# AMD Instinct Hardware
|
||||
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct MI250
|
||||
accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||||
|
||||
## AMD CDNA 2 Micro-architecture
|
||||
|
||||
The micro-architecture of the AMD Instinct MI250 accelerators is based on the
|
||||
AMD CDNA 2 architecture that targets compute applications such as HPC,
|
||||
artificial intelligence (AI), and Machine Learning (ML) and that run on
|
||||
everything from individual servers to the world’s largest exascale
|
||||
supercomputers. The overall system architecture is designed for extreme
|
||||
scalability and compute performance.
|
||||
|
||||
{numref}`mi250-gcd` shows the components of a single Graphics Compute Die (GCD
|
||||
) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™
|
||||
interfaces and their physical links that are used to connect the GPU die to the
|
||||
other system-level components of the node (see also Section 2.2). Both
|
||||
interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity
|
||||
Fabric links of the controller at the bottom can be configured as a PCIe link.
|
||||
Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec,
|
||||
which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link (
|
||||
two bytes per transaction). Section 2.2 has more details on the number of AMD
|
||||
Infinity Fabric links and the resulting transfer rates between the system-level
|
||||
components.
|
||||
|
||||
To the left and the right are memory controllers that attach the High Bandwidth
|
||||
Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers
|
||||
a peak memory bandwidth of 1.6 TB/sec per GCD.
|
||||
|
||||
The execution units of the GPU are depicted in {numref}`mi250-gcd` as Compute
|
||||
Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further
|
||||
subdivided into four SIMD units that process SIMD instructions of 16 data
|
||||
elements per instruction (for the FP64 data type). This enables the CU to
|
||||
process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7
|
||||
GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3
|
||||
TFLOPS for vector instructions. The MI250 compute units also provide specialized
|
||||
execution units (also called matrix cores), which are geared toward executing
|
||||
matrix operations like matrix-matrix multiplications. For FP64, the peak
|
||||
performance of these units amounts to 90.5 TFLOPS.
|
||||
|
||||
:::{figure-md} mi250-gcd
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.001.png" alt="Structure of a single GCD in the AMD Instinct MI250 accelerator.">
|
||||
|
||||
Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.
|
||||
:::
|
||||
|
||||
```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
|
||||
:header-rows: 1
|
||||
:name: mi250-perf
|
||||
|
||||
* - Computation and Data Type
|
||||
- FLOPS/CLOCK/CU
|
||||
- Peak TFLOPS
|
||||
* - Matrix FP64
|
||||
- 256
|
||||
- 90.5
|
||||
* - Vector FP64
|
||||
- 128
|
||||
- 45.3
|
||||
* - Matrix FP32
|
||||
- 256
|
||||
- 90.5
|
||||
* - Packed FP32
|
||||
- 256
|
||||
- 90.5
|
||||
* - Vector FP32
|
||||
- 128
|
||||
- 45.3
|
||||
* - Matrix FP16
|
||||
- 1024
|
||||
- 362.1
|
||||
* - Matrix BF16
|
||||
- 1024
|
||||
- 362.1
|
||||
* - Matrix INT8
|
||||
- 1024
|
||||
- 362.1
|
||||
|
||||
```
|
||||
|
||||
{numref}`mi250-perf` summarizes the aggregated peak performance of the AMD
|
||||
Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
|
||||
Platform) and its two GCDs for different data types and execution units. The
|
||||
middle column lists the peak performance (number of data elements processed in a
|
||||
single instruction) of a single compute unit if a SIMD (or matrix) instruction
|
||||
is being retired in each clock cycle. The third column lists the theoretical
|
||||
peak performance of the OAM module. The theoretical aggregated peak memory
|
||||
bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
|
||||
|
||||
:::{figure-md} mi250-arch
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.002.png" alt="Dual-GCD architecture of the AMD Instinct MI250 accelerators.">
|
||||
|
||||
Dual-GCD architecture of the AMD Instinct MI250 accelerators.
|
||||
:::
|
||||
|
||||
{numref}`mi250-arch` shows the block diagram of an OAM package that consists
|
||||
of two GCDs, each of which constitutes one GPU device in the system. The two
|
||||
GCDs in the package are connected via four AMD Infinity Fabric links running at
|
||||
a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth
|
||||
between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
|
||||
400 GB/sec for the same.
|
||||
|
||||
## Node-level Architecture
|
||||
|
||||
{numref}`mi250-block` shows the node-level architecture of a system that is
|
||||
based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
|
||||
system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
|
||||
x16 link to the host part of the system. Depending on the server platform, the
|
||||
GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
|
||||
. Note that some platforms may offer an x8 interface to the GCDs, which reduces
|
||||
the available host-to-GPU bandwidth.
|
||||
|
||||
:::{figure-md} mi250-block
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.003.png" alt="Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.">
|
||||
|
||||
Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation
|
||||
AMD EPYC processor.
|
||||
:::
|
||||
|
||||
{numref}`mi250-block` shows the node-level architecture of a system with AMD
|
||||
EPYC processors in a dual-socket configuration and four AMD Instinct MI250
|
||||
accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
|
||||
x16 links (yellow lines). Depending on the system design, a PCIe switch may
|
||||
exist to make more PCIe lanes available for additional components like network
|
||||
interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to
|
||||
the host part of the system or to the PCIe switch. Please note, some platforms
|
||||
may offer an x8 interface to the GCDs, which will reduce the available
|
||||
host-to-GPU bandwidth.
|
||||
|
||||
Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows
|
||||
for direct data exchange between the GPU dies via AMD Infinity Fabric links (
|
||||
black, green, and red lines). Each of these 16-wide links connects to one of the
|
||||
two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a
|
||||
theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec
|
||||
bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0
|
||||
and 4 connect via two XGMI links, which is indicated by the thicker red line in
|
||||
{numref}`mi250-block`.
|
||||
Reference in New Issue
Block a user