mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-10 07:08:08 -05:00
Revamped PCIe into new format and incorporated style guide (#4051)
* Revamped PCIe into new format and incorporated style guide * Title case fixed * Quick fix and changes * Added RMW to wordlist and updated titles * Grammatical fixes incorporated * Sandra's review feedback incorporated * Removed PCIe3 feature reference * Leo's feedback incorporated * Sandra's feedback incorporated * Replaced execute with run * Replaced executing with running * SME review feedback incorporated * Minor feedback updated * Sandra's feedback incorporated * Filename renamed * File rename changes updated * Document title updated --------- Co-authored-by: prbasyal <prbasyal@amd.com>
This commit is contained in:
@@ -315,6 +315,7 @@ RDMA
|
||||
RDNA
|
||||
README
|
||||
RHEL
|
||||
RMW
|
||||
RNN
|
||||
RNNs
|
||||
ROC
|
||||
|
||||
@@ -1,156 +0,0 @@
|
||||
.. meta::
|
||||
:description: How ROCm uses PCIe atomics
|
||||
:keywords: PCIe, PCIe atomics, atomics, BAR memory, AMD, ROCm
|
||||
|
||||
*****************************************************************************
|
||||
How ROCm uses PCIe atomics
|
||||
*****************************************************************************
|
||||
|
||||
ROCm PCIe feature and overview of BAR memory
|
||||
================================================================
|
||||
|
||||
ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model,
|
||||
signaling and synchronization protocols. Platform atomics are integral to perform queuing and
|
||||
signaling memory operations where there may be multiple-writers across CPU and GPU agents.
|
||||
|
||||
The full list of HSA system architecture platform requirements are here:
|
||||
`HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
|
||||
|
||||
AMD ROCm Software uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe]
|
||||
3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization
|
||||
mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling
|
||||
memory operations.
|
||||
|
||||
The new PCIe atomic operations operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``,
|
||||
``SWAP`` atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and
|
||||
128-bit operand which target address have to be naturally aligned to operation sizes.
|
||||
|
||||
For ROCm the Platform atomics are used in ROCm in the following ways:
|
||||
|
||||
* Update HSA queue's read_dispatch_id: 64 bit atomic add used by the command processor on the
|
||||
GPU agent to update the packet ID it processed.
|
||||
* Update HSA queue's write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to
|
||||
support multi-writer queue insertions.
|
||||
* Update HSA Signals -- 64bit atomic ops are used for CPU & GPU synchronization.
|
||||
|
||||
The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through
|
||||
and completed by PCIe components. Routing and completion does not require software support.
|
||||
Component support for each is detectable via the Device Capabilities 2 (DevCap2) register. Upstream
|
||||
bridges need to have atomic operations routing enabled or the atomic operations will fail even though
|
||||
PCIe endpoint and PCIe I/O devices has the capability to atomic operations.
|
||||
|
||||
To do atomic operations routing capability between two or more Root Ports, each associated Root Port
|
||||
must indicate that capability via the atomic operations routing supported bit in the DevCap2 register.
|
||||
|
||||
If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic
|
||||
operations requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
|
||||
field is set. These requests can only be serviced if the upstream components support atomic operation
|
||||
completion and/or routing to a component which does. Atomic operations routing support=1, routing
|
||||
is supported; atomic operations routing support=0, routing is not supported.
|
||||
|
||||
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there
|
||||
must be a response for Completion containing the result of the operation. Errors associated with the
|
||||
operation (uncorrectable error accessing the target location or carrying out the atomic operation) are
|
||||
signaled to the requester by setting the Completion Status field in the completion descriptor, they are
|
||||
set to to Completer Abort (CA) or Unsupported Request (UR).
|
||||
|
||||
To understand more about how PCIe atomic operations work, see
|
||||
`PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
|
||||
|
||||
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
|
||||
|
||||
There are also a number of papers which talk about these new capabilities:
|
||||
|
||||
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
|
||||
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
|
||||
Other I/O devices with PCIe atomics support:
|
||||
|
||||
* Mellanox ConnectX-5 InfiniBand Card
|
||||
* Cray Aries Interconnect
|
||||
* Xilinx 7 Series Devices
|
||||
|
||||
Future bus technology with richer I/O atomics operation Support
|
||||
|
||||
* GenZ
|
||||
|
||||
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
|
||||
with PCIe Generation 3.0 support.
|
||||
|
||||
* Mellanox Bluefield SOC
|
||||
* Cavium Thunder X2
|
||||
|
||||
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
|
||||
originates two writes to two different targets:
|
||||
|
||||
* Write to another GPU memory
|
||||
* Write to system memory to indicate transfer complete
|
||||
|
||||
They are routed off to different ends of the computer but we want to make sure the write to system
|
||||
memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
|
||||
|
||||
BAR memory overview
|
||||
----------------------------------------------------------------------------------------------------
|
||||
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set
|
||||
memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.
|
||||
|
||||
In the Supermicro system in the system bios you need to see the following
|
||||
|
||||
* Advanced->PCIe/PCI/PnP configuration-\> Above 4G Decoding = Enabled
|
||||
* Advanced->PCIe/PCI/PnP Configuration-\>MMIOH Base = 512G
|
||||
* Advanced->PCIe/PCI/PnP Configuration-\>MMIO High Size = 256G
|
||||
|
||||
When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.
|
||||
|
||||
For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.
|
||||
|
||||
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must
|
||||
be placed < 2^44 to support P2P access from other Vega10.
|
||||
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed \< 2^44 to support P2P access from
|
||||
other Vega10.
|
||||
* BAR4 register: Optional, not a boot device.
|
||||
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed \< 4GB.
|
||||
|
||||
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
|
||||
|
||||
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
|
||||
Series] (rev c1)
|
||||
|
||||
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
|
||||
|
||||
Flags: bus master, fast devsel, latency 0, IRQ 119
|
||||
|
||||
Memory at bf40000000 (64-bit, prefetchable) [size=256M]
|
||||
|
||||
Memory at bf50000000 (64-bit, prefetchable) [size=2M]
|
||||
|
||||
I/O ports at 3000 [size=256]
|
||||
|
||||
Memory at c7400000 (32-bit, non-prefetchable) [size=256K]
|
||||
|
||||
Expansion ROM at c7440000 [disabled] [size=128K]
|
||||
|
||||
Legend:
|
||||
|
||||
1 : GPU Frame Buffer BAR -- In this example it happens to be 256M, but typically this will be size of the
|
||||
GPU memory (typically 4GB+). This BAR has to be placed \< 2^40 to allow peer-to-peer access from
|
||||
other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed \< 2^44 to allow peer-to-peer
|
||||
access from other GFX9 AMD GPUs.
|
||||
|
||||
2 : Doorbell BAR -- The size of the BAR is typically will be \< 10MB (currently fixed at 2MB) for this
|
||||
generation GPUs. This BAR has to be placed \< 2^40 to allow peer-to-peer access from other current
|
||||
generation AMD GPUs.
|
||||
|
||||
3 : IO BAR -- This is for legacy VGA and boot device support, but since this the GPUs in this project are
|
||||
not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
|
||||
|
||||
4 : MMIO BAR -- This is required for the AMD Driver SW to access the configuration registers. Since the
|
||||
reminder of the BAR available is only 1 DWORD (32bit), this is placed \< 4GB. This is fixed at 256KB.
|
||||
|
||||
5 : Expansion ROM -- This is required for the AMD Driver SW to access the GPU video-bios. This is
|
||||
currently fixed at 128KB.
|
||||
|
||||
For more information, you can review
|
||||
`Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.
|
||||
57
docs/conceptual/pcie-atomics.rst
Normal file
57
docs/conceptual/pcie-atomics.rst
Normal file
@@ -0,0 +1,57 @@
|
||||
.. meta::
|
||||
:description: How ROCm uses PCIe atomics
|
||||
:keywords: PCIe, PCIe atomics, atomics, Atomic operations, AMD, ROCm
|
||||
|
||||
*****************************************************************************
|
||||
How ROCm uses PCIe atomics
|
||||
*****************************************************************************
|
||||
AMD ROCm is an extension of the Heterogeneous System Architecture (HSA). To meet the requirements of an HSA-compliant system, ROCm supports queuing models, memory models, and signaling and synchronization protocols. ROCm can perform atomic Read-Modify-Write (RMW) transactions that extend inter-processor synchronization mechanisms to Input/Output (I/O) devices starting from Peripheral Component Interconnect Express 3.0 (PCIe™ 3.0). It supports the defined HSA capabilities for queuing and signaling memory operations. To learn more about the requirements of an HSA-compliant system, see the
|
||||
`HSA Platform System Architecture Specification <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
|
||||
|
||||
ROCm uses platform atomics to perform memory operations like queuing, signaling, and synchronization across multiple CPU, GPU agents, and I/O devices. Platform atomics ensure that atomic operations run synchronously, without interruptions or conflicts, across multiple shared resources.
|
||||
|
||||
Platform atomics in ROCm
|
||||
==============================
|
||||
Platform atomics enable the set of atomic operations that perform RMW actions across multiple processors, devices, and memory locations so that they run synchronously without interruption. An atomic operation is a sequence of computing instructions run as a single, indivisible unit. These instructions are completed in their entirety without any interruptions. If the instructions can't be completed as a unit without interruption, none of the instructions are run. These operations support 32-bit and 64-bit address formats.
|
||||
|
||||
Some of the operations for which ROCm uses platform atomics are:
|
||||
|
||||
* Update the HSA queue's ``read_dispatch_id``. The command processor on the GPU agent uses a 64-bit atomic add operation. It updates the packet ID it processed.
|
||||
* Update the HSA queue's ``write_dispatch_id``. The CPU and GPU agents use a 64-bit atomic add operation. It supports multi-writer queue insertions.
|
||||
* Update HSA Signals. A 64-bit atomic operation is used for CPU & GPU synchronization.
|
||||
|
||||
|
||||
PCIe for atomic operations
|
||||
----------------------------
|
||||
ROCm requires CPUs that support PCIe atomics. Similarly, all connected I/O devices should also support PCIe atomics for optimum compatibility. PCIe supports the ``CAS`` (Compare and Swap), ``FetchADD``, and ``SWAP`` atomic operations across multiple resources. These atomic operations are initiated by the I/O devices that support 32-bit, 64-bit, and 128-bit operands. Likewise, the target memory address where these atomic operations are performed should also be aligned to the size of the operand. This alignment ensures that the operations are performed efficiently and correctly without failure.
|
||||
|
||||
When an atomic operation is successful, the requester receives a response of completion along with the operation result. However, any errors associated with the operation are signaled to the requester by updating the Completion Status field. Issues accessing the target location or running the atomic operation are common errors. Depending upon the error, the Completion Status field is updated to Completer Abort (CA) or Unsupported Request (UR). The field is present in the Completion Descriptor.
|
||||
|
||||
To learn more about the industry standards and specifications of PCIe, see `PCI-SIG Specification <https://pcisig.com/specifications>`_.
|
||||
|
||||
To learn more about PCIe and its capabilities, consult the following white papers:
|
||||
|
||||
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
|
||||
* `PCI Express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
|
||||
Working with PCIe 3.0 in ROCm
|
||||
-------------------------------
|
||||
Starting with PCIe 3.0, atomic operations can be requested, routed through, and completed by PCIe components. Routing and completion do not require software support. Component support for each can be identified by the Device Capabilities 2 (DevCap2) register. Upstream
|
||||
bridges need to have atomic operations routing enabled. If not enabled, the atomic operations will fail even if the
|
||||
PCIe endpoint and PCIe I/O devices can perform atomic operations.
|
||||
|
||||
If your system uses PCIe switches to connect and enable communication between multiple PCIe components, the switches must also support atomic operations routing.
|
||||
|
||||
To enable atomic operations routing between multiple root ports, each root port must support atomic operation routing. This capability can be identified from the atomic operations routing support bit in the DevCap2 register. If the bit has value of 1, routing is supported. Atomic operation requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
|
||||
field is set. These requests can only be serviced if the upstream components also support atomic operation completion or if the requests can be routed to a component that supports atomic operation completion.
|
||||
|
||||
ROCm uses the PCIe-ID-based ordering technology for peer-to-peer (P2P) data transmission. PCIe-ID-based ordering technology is used when the GPU initiates multiple write operations to different memory locations.
|
||||
|
||||
For more information on changes implemented in PCIe 3.0, see `Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -58,7 +58,7 @@ ROCm documentation is organized into the following categories:
|
||||
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
|
||||
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
|
||||
* [Using CMake](./conceptual/cmake-packages.rst)
|
||||
* [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
|
||||
* [PCIe atomics in ROCm](./conceptual/pcie-atomics.rst)
|
||||
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
|
||||
* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
|
||||
:::
|
||||
|
||||
@@ -154,8 +154,8 @@ subtrees:
|
||||
title: GPU isolation techniques
|
||||
- file: conceptual/cmake-packages.rst
|
||||
title: Using CMake
|
||||
- file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
|
||||
title: ROCm & PCIe atomics
|
||||
- file: conceptual/pcie-atomics.rst
|
||||
title: PCIe atomics in ROCm
|
||||
- file: conceptual/ai-pytorch-inception.md
|
||||
title: Inception v3 with PyTorch
|
||||
- file: conceptual/oversubscription.rst
|
||||
|
||||
Reference in New Issue
Block a user