mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-05 03:45:12 -05:00
* Initial MI250 Guide * Limit line length to 80 columns * References using MyST * Move to figure-md and numref * Add MI250 to TOC
142 lines
6.2 KiB
Markdown
142 lines
6.2 KiB
Markdown
# AMD Instinct Hardware
|
||
|
||
This chapter briefly reviews hardware aspects of the AMD Instinct MI250
|
||
accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||
|
||
## AMD CDNA 2 Micro-architecture
|
||
|
||
The micro-architecture of the AMD Instinct MI250 accelerators is based on the
|
||
AMD CDNA 2 architecture that targets compute applications such as HPC,
|
||
artificial intelligence (AI), and Machine Learning (ML) and that run on
|
||
everything from individual servers to the world’s largest exascale
|
||
supercomputers. The overall system architecture is designed for extreme
|
||
scalability and compute performance.
|
||
|
||
{numref}`mi250-gcd` shows the components of a single Graphics Compute Die (GCD
|
||
) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™
|
||
interfaces and their physical links that are used to connect the GPU die to the
|
||
other system-level components of the node (see also Section 2.2). Both
|
||
interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity
|
||
Fabric links of the controller at the bottom can be configured as a PCIe link.
|
||
Each of the AMD Infinity Fabric links between GPUs can run at up to 25 GT/sec,
|
||
which correlates to a peak transfer bandwidth of 50 GB/sec for a 16-wide link (
|
||
two bytes per transaction). Section 2.2 has more details on the number of AMD
|
||
Infinity Fabric links and the resulting transfer rates between the system-level
|
||
components.
|
||
|
||
To the left and the right are memory controllers that attach the High Bandwidth
|
||
Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers
|
||
a peak memory bandwidth of 1.6 TB/sec per GCD.
|
||
|
||
The execution units of the GPU are depicted in {numref}`mi250-gcd` as Compute
|
||
Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further
|
||
subdivided into four SIMD units that process SIMD instructions of 16 data
|
||
elements per instruction (for the FP64 data type). This enables the CU to
|
||
process 64 work items (a so-called “wavefront”) at a peak clock frequency of 1.7
|
||
GHz. Therefore, the theoretical maximum FP64 peak performance per GCD is 45.3
|
||
TFLOPS for vector instructions. The MI250 compute units also provide specialized
|
||
execution units (also called matrix cores), which are geared toward executing
|
||
matrix operations like matrix-matrix multiplications. For FP64, the peak
|
||
performance of these units amounts to 90.5 TFLOPS.
|
||
|
||
:::{figure-md} mi250-gcd
|
||
|
||
<img src="../../data/reference/gpu_arch/image.001.png" alt="Structure of a single GCD in the AMD Instinct MI250 accelerator.">
|
||
|
||
Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.
|
||
:::
|
||
|
||
```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
|
||
:header-rows: 1
|
||
:name: mi250-perf
|
||
|
||
* - Computation and Data Type
|
||
- FLOPS/CLOCK/CU
|
||
- Peak TFLOPS
|
||
* - Matrix FP64
|
||
- 256
|
||
- 90.5
|
||
* - Vector FP64
|
||
- 128
|
||
- 45.3
|
||
* - Matrix FP32
|
||
- 256
|
||
- 90.5
|
||
* - Packed FP32
|
||
- 256
|
||
- 90.5
|
||
* - Vector FP32
|
||
- 128
|
||
- 45.3
|
||
* - Matrix FP16
|
||
- 1024
|
||
- 362.1
|
||
* - Matrix BF16
|
||
- 1024
|
||
- 362.1
|
||
* - Matrix INT8
|
||
- 1024
|
||
- 362.1
|
||
|
||
```
|
||
|
||
{numref}`mi250-perf` summarizes the aggregated peak performance of the AMD
|
||
Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
|
||
Platform) and its two GCDs for different data types and execution units. The
|
||
middle column lists the peak performance (number of data elements processed in a
|
||
single instruction) of a single compute unit if a SIMD (or matrix) instruction
|
||
is being retired in each clock cycle. The third column lists the theoretical
|
||
peak performance of the OAM module. The theoretical aggregated peak memory
|
||
bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
|
||
|
||
:::{figure-md} mi250-arch
|
||
|
||
<img src="../../data/reference/gpu_arch/image.002.png" alt="Dual-GCD architecture of the AMD Instinct MI250 accelerators.">
|
||
|
||
Dual-GCD architecture of the AMD Instinct MI250 accelerators.
|
||
:::
|
||
|
||
{numref}`mi250-arch` shows the block diagram of an OAM package that consists
|
||
of two GCDs, each of which constitutes one GPU device in the system. The two
|
||
GCDs in the package are connected via four AMD Infinity Fabric links running at
|
||
a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth
|
||
between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
|
||
400 GB/sec for the same.
|
||
|
||
## Node-level Architecture
|
||
|
||
{numref}`mi250-block` shows the node-level architecture of a system that is
|
||
based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
|
||
system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
|
||
x16 link to the host part of the system. Depending on the server platform, the
|
||
GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
|
||
. Note that some platforms may offer an x8 interface to the GCDs, which reduces
|
||
the available host-to-GPU bandwidth.
|
||
|
||
:::{figure-md} mi250-block
|
||
|
||
<img src="../../data/reference/gpu_arch/image.003.png" alt="Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.">
|
||
|
||
Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation
|
||
AMD EPYC processor.
|
||
:::
|
||
|
||
{numref}`mi250-block` shows the node-level architecture of a system with AMD
|
||
EPYC processors in a dual-socket configuration and four AMD Instinct MI250
|
||
accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
|
||
x16 links (yellow lines). Depending on the system design, a PCIe switch may
|
||
exist to make more PCIe lanes available for additional components like network
|
||
interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to
|
||
the host part of the system or to the PCIe switch. Please note, some platforms
|
||
may offer an x8 interface to the GCDs, which will reduce the available
|
||
host-to-GPU bandwidth.
|
||
|
||
Between the OAMs and their respective GCDs, a peer-to-peer (P2P) network allows
|
||
for direct data exchange between the GPU dies via AMD Infinity Fabric links (
|
||
black, green, and red lines). Each of these 16-wide links connects to one of the
|
||
two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a
|
||
theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec
|
||
bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0
|
||
and 4 connect via two XGMI links, which is indicated by the thicker red line in
|
||
{numref}`mi250-block`.
|