mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Files

Peter Park 7b883f3af4 Add MI300X tuning guides (#3448 )

* Add MI300X tuning guides

Add mi300x doc (pandoc conversion)

fix headings

add metadata

move images to shared/

move images to shared/

convert tuning-guides.md to rst using pandoc

add mi300x to tuning-guides.rst landing page

update h1s, toc, and landing page

fix spelling

fix fmt

format code blocks

add tensilelite imgs

fix formatting

fix formatting some more

fix formatting

more formatting

spelling

remove --enforce-eager note

satisfy spellcheck linter

more spelling

add fixes from hongxia

fix env var in D5

add fixes to PyTorch inductor section

fix

fix

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update 'torch_compile_debug' suggestion based on Hongxia's feedback

fix PyTorch inductor env vars

minor formatting fixes

Apply suggestions from code review

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update vllm path

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

disable numfig in Sphinx configuration

fix formatting and capitalization

add words to wordlist

update index

update wordlist

update optimizing-triton-kernel

convert cards to table

fix link in index.md

add @lpaoletti's feedback

Add system tuning guide

add images

add system section

add os settings and sys management

remove pcie=noats recommendation

reorg

add blurb to developer section

impr formatting

remove windows os from tuning guides pages in conf.py

add suggestions from review

fix typo and link

remove os windows from relevant pages in conf

mi300x

add suggestions from review

fix toc

fix index links

reorg

update vLLM vars

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

update vLLM vars

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

reorganize

add warnings

add text to system tuning

add filler text on index pages

reorg tuning pages

fix links

fix vars

* rm old pages

fix toc

* add suggestions from review

small change

add more suggestions

rewrite intro

* add 'workload tuning philosophy'

* refactor

* fix broken links

* black format conf.py

* simplify cmd and update doc structure

* add higher-level heading for consistency (mi300x.rst)

* add fixes from review

fix url

add fixes

fix formatting

fix fmt

fix hipBLASLt section

change words

fix tensilelite section

fix

fix

fix fmt

* style guide

* fix some formatting

* satisfy spellcheck linter

* update wordlist

* fix bad conflict resolution

2024-07-22 17:24:14 -04:00

4.6 KiB

Raw Permalink Blame History

AMD Instinct™ MI300 series microarchitecture

The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3 architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running on everything from individual servers to the world’s largest exascale supercomputers.

With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the GPU computational elements of the processor along with the lower levels of the cache hierarchy.

The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series.

---
name: mi300-xcd
align: center
---
XCD-level system architecture showing 40 Compute Units, each with 32 KB L1 cache, a Unified Compute System with 4 ACE Compute Accelerators, shared 4MB of L2 cache and an HWS Hardware Scheduler.

On the XCD, four Asynchronous Compute Engines (ACEs) send compute shader workgroups to the Compute Units (CUs). The XCD has 40 CUs: 38 active CUs at the aggregate level and 2 disabled CUs for yield management. The CUs all share a 4 MB L2 cache that serves to coalesce all memory traffic for the die. With less than half of the CUs of the AMD Instinct MI200 Series compute die, the AMD CDNA™ 3 XCD die is a smaller building block. However, it uses more advanced packaging and the processor can include 6 or 8 XCDs for up to 304 CUs, roughly 40% more than MI250X.

The MI300 Series integrate up to 8 vertically stacked XCDs, 8 stacks of High-Bandwidth Memory 3 (HBM3) and 4 I/O dies (containing system infrastructure) using the AMD Infinity Fabric™ technology as interconnect.

The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine learning, enhancing throughput of existing data types while adding support for new data types. CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32, while TF32 has a gain of 4 times compared to FP32.

:header-rows: 1
:name: mi300x-perf-table

*
  - Computation and Data Type
  - FLOPS/CLOCK/CU
  - Peak TFLOPS
*
  - Matrix FP64
  - 256
  - 163.4
*
  - Vector FP64
  - 128
  - 81.7
*
  - Matrix FP32
  - 256
  - 163.4
*
  - Vector FP32
  - 256
  - 163.4
*
  - Vector TF32
  - 1024
  - 653.7
*
  - Matrix FP16
  - 2048
  - 1307.4
*
  - Matrix BF16
  - 2048
  - 1307.4
*
  - Matrix FP8
  - 4096
  - 2614.9
*
  - Matrix INT8
  - 4096
  - 2614.9

The above table summarizes the aggregated peak performance of the AMD Instinct MI300X Open Compute Platform (OCP) Open Accelerator Modules (OAMs) for different data types and command processors. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is submitted in each clock cycle. The third column lists the theoretical peak performance of the OAM. The theoretical aggregated peak memory bandwidth of the GPU is 5.3 TB per second.

The following image shows the block diagram of the APU (left) and the OAM package (right) both connected via AMD Infinity Fabric™ network on-chip.

---
name: mi300-arch
alt:
align: center
---
MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.

Node-level architecture

---
name: mi300-node

align: center
---
MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.

The image above shows the node-level architecture of a system with AMD EPYC processors in a dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth, low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system.

4.6 KiB Raw Permalink Blame History Unescape Escape

AMD Instinct™ MI300 series microarchitecture

Node-level architecture

4.6 KiB

Raw Permalink Blame History