update release note files (#2617)

---------

Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
This commit is contained in:
Lisa
2023-11-10 15:14:59 -07:00
committed by GitHub
parent 3f855e386c
commit 37c48060f7
27 changed files with 1699 additions and 995 deletions

View File

@@ -3,7 +3,7 @@ Docker image support matrix
******************************************************************
AMD validates and publishes `PyTorch <https://hub.docker.com/r/rocm/pytorch>`_ and
`TensorFlow <https://hub.docker.com/r/rocm/tensorflow>`_ containers on dockerhub. The following
`TensorFlow <https://hub.docker.com/r/rocm/tensorflow>`_ containers on Docker Hub. The following
tags, and associated inventories, are validated with ROCm 5.7.
.. tab-set::

View File

@@ -1,36 +1,58 @@
===========================
*****************************************************************************
How ROCm uses PCIe atomics
===========================
*****************************************************************************
ROCm PCIe feature and overview of BAR memory
======================================================================
================================================================
ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model,
signaling and synchronization protocols. Platform atomics are integral to perform queuing and
signaling memory operations where there may be multiple-writers across CPU and GPU agents.
ROCm is an extension of HSA platform architecture, so it shares the queueing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.
The full list of HSA system architecture platform requirements are here:
`HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
The full list of HSA system architecture platform requirements are here: `HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
The ROCm platform uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe]
3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization
mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling
memory operations.
The ROCm Platform uses the new PCI Express 3.0 (PCIe 3.0) features for Atomic Read-Modify-Write Transactions which extends inter-processor synchronization mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling memory operations.
The new PCIe AtomicOps operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``, ``SWAP`` atomics. The AtomicsOps are initiated by the
I/O device which support 32-bit, 64-bit and 128-bit operand which target address have to be naturally aligned to operation sizes.
The new PCIe atomic operations operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``,
``SWAP`` atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and
128-bit operand which target address have to be naturally aligned to operation sizes.
For ROCm the Platform atomics are used in ROCm in the following ways:
* Update HSA queues read_dispatch_id: 64 bit atomic add used by the command processor on the GPU agent to update the packet ID it processed.
* Update HSA queues write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.
* Update HSA Signals 64bit atomic ops are used for CPU & GPU synchronization.
* Update HSA queue's read_dispatch_id: 64 bit atomic add used by the command processor on the
GPU agent to update the packet ID it processed.
* Update HSA queue's write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to
support multi-writer queue insertions.
* Update HSA Signals -- 64bit atomic ops are used for CPU & GPU synchronization.
The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O devices has the capability to Atomics Operations.
The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through
and completed by PCIe components. Routing and completion does not require software support.
Component support for each is detectable via the DevCap2 register. Upstream bridges need to have
atomic operations routing enabled or the atomic operations will fail even though PCIe endpoint and
PCIe I/O devices has the capability to atomic operations.
To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp routing supported bit in the Device Capabilities 2 register.
To do atomic operations routing capability between two or more Root Ports, each associated Root Port
must indicate that capability via the atomic operations routing supported bit in the Device Capabilities
2 register.
If your system has a PCIe Express Switch it needs to support AtomicsOp routing. AtomicOp requests are permitted only if a components ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE`` field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.
If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic
operations requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
field is set. These requests can only be serviced if the upstream components support atomic operation
completion and/or routing to a component which does. Atomic operations routing support=1, routing
is supported; atomic operations routing support=0, routing is not supported.
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there
must be a response for Completion containing the result of the operation. Errors associated with the
operation (uncorrectable error accessing the target location or carrying out the atomic operation) are
signaled to the requester by setting the Completion Status field in the completion descriptor, they are
set to to Completer Abort (CA) or Unsupported Request (UR).
To understand more about how PCIe atomic operations work, see `PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
To understand more about how PCIe atomic operations work, see
`PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
@@ -39,56 +61,60 @@ There are also a number of papers which talk about these new capabilities:
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
* `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
* `PCIe Generation 4 Base Specification includes Atomics Operation <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
Other I/O devices with PCIe atomics support
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
Future bus technology with richer I/O atomics operation Support
* GenZ
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs with PCIe Generation 3.0 support.
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
with PCIe Generation 3.0 support.
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
originates two writes to two different targets:
| 1. write to another GPU memory,
# Write to another GPU memory
# Write to system memory to indicate transfer complete
| 2. then write to system memory to indicate transfer complete.
They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
They are routed off to different ends of the computer but we want to make sure the write to system
memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
BAR memory overview
***************************************************************************************************
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.
----------------------------------------------------------------------------------------------------
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set
memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.
In SuperMicro system in the system bios you need to see the following
In the Supermicro system in the system bios you need to see the following
* Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
* Advanced->PCIe/PCI/PnP configuration-\> Above 4G Decoding = Enabled
* Advanced->PCIe/PCI/PnP Configuration-\>MMIOH Base = 512G
* Advanced->PCIe/PCI/PnP Configuration-\>MMIO High Size = 256G
* Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
* Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G
When we support Large Bar Capability there is a Large Bar Vbios which also disable the IO bar.
When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.
For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must be placed < 2^44 to support P2P access from other Vega10.
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed < 2^44 to support P2P access from other Vega10.
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must
be placed < 2^44 to support P2P access from other Vega10.
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed \< 2^44 to support P2P access from
other Vega10.
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed \< 4GB.
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
Series] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
@@ -106,40 +132,23 @@ Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit P
Legend:
1 : GPU Frame Buffer BAR In this example it happens to be 256M, but typically this will be size of the GPU memory (typically 4GB+). This BAR has to be placed < 2^40 to allow peer-to-peer access from other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed < 2^44 to allow peer-to-peer access from other GFX9 AMD GPUs.
1 : GPU Frame Buffer BAR -- In this example it happens to be 256M, but typically this will be size of the
GPU memory (typically 4GB+). This BAR has to be placed \< 2^40 to allow peer-to-peer access from
other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed \< 2^44 to allow peer-to-peer
access from other GFX9 AMD GPUs.
2 : Doorbell BAR The size of the BAR is typically will be < 10MB (currently fixed at 2MB) for this generation GPUs. This BAR has to be placed < 2^40 to allow peer-to-peer access from other current generation AMD GPUs.
2 : Doorbell BAR -- The size of the BAR is typically will be \< 10MB (currently fixed at 2MB) for this
generation GPUs. This BAR has to be placed \< 2^40 to allow peer-to-peer access from other current
generation AMD GPUs.
3 : IO BAR - This is for legacy VGA and boot device support, but since this the GPUs in this project are not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
3 : IO BAR -- This is for legacy VGA and boot device support, but since this the GPUs in this project are
not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
4 : MMIO BAR This is required for the AMD Driver SW to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32bit), this is placed < 4GB. This is fixed at 256KB.
4 : MMIO BAR -- This is required for the AMD Driver SW to access the configuration registers. Since the
reminder of the BAR available is only 1 DWORD (32bit), this is placed \< 4GB. This is fixed at 256KB.
5 : Expansion ROM This is required for the AMD Driver SW to access the GPUs video-bios. This is currently fixed at 128KB.
5 : Expansion ROM -- This is required for the AMD Driver SW to access the GPU video-bios. This is
currently fixed at 128KB.
Excerpts from 'Overview of Changes to PCI Express 3.0'
================================================================
By Mike Jackson, Senior Staff Architect, MindShare, Inc.
***************************************************************************************************
Atomic operations goal:
***************************************************************************************************
Support SMP-type operations across a PCIe network to allow for things like offloading tasks between CPU cores and accelerators like a GPU. The spec says this enables advanced synchronization mechanisms that are particularly useful with multiple producers or consumers that need to be synchronized in a non-blocking fashion. Three new atomic non-posted requests were added, plus the corresponding completion (the address must be naturally aligned with the operand size or the TLP is malformed):
* Fetch and Add uses one operand as the “add” value. Reads the target location, adds the operand, and then writes the result back to the original location.
* Unconditional Swap uses one operand as the “swap” value. Reads the target location and then writes the swap value to it.
* Compare and Swap uses 2 operands: first data is compare value, second is swap value. Reads the target location, checks it against the compare value and, if equal, writes the swap value to the target location.
* AtomicOpCompletion new completion to give the result so far atomic request and indicate that the atomicity of the transaction has been maintained.
Since atomic operations are not locked they don't have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.
Atomic operations can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route atomic operations is also indicated in the registers for a given port.
ID-based ordering goal:
***************************************************************************************************
Improve performance by avoiding stalls caused by ordering rules. For example, posted writes are never normally allowed to pass each other in a queue, but if they are requested by different functions, we can have some confidence that the requests are not dependent on each other. The previously reserved Attribute bit [2] is now combined with the RO bit to indicate ID ordering with or without relaxed ordering.
This only has meaning for memory requests, and is reserved for Configuration or IO requests. Completers are not required to copy this bit into a completion, and only use the bit if their enable bit is set for this operation.
To read more on PCIe Gen 3 new options https://www.mindshare.com/files/resources/PCIe%203-0.pdf
For more information, you can review
`Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.

View File

@@ -4,31 +4,32 @@ Using CMake
Most components in ROCm support CMake. Projects depending on header-only or
library components typically require CMake 3.5 or higher whereas those wanting
to make use of CMake's HIP language support will require CMake 3.21 or higher.
to make use of the CMake HIP language support will require CMake 3.21 or higher.
Finding dependencies
====================
.. note::
For a complete
reference on how to deal with dependencies in CMake, refer to the CMake docs
on `find_package
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
`Using Dependencies Guide
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
to get an overview of CMake's related facilities.
For a complete
reference on how to deal with dependencies in CMake, refer to the CMake docs
on `find_package
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
`Using Dependencies Guide
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
to get an overview of CMake related facilities.
In short, CMake supports finding dependencies in two ways:
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to
find the component in typical install locations and layouts. CMake ships a
few dozen such scripts, but users and projects may ship them as well.
find the component in typical install locations and layouts. CMake ships a
few dozen such scripts, but users and projects may ship them as well.
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
``<PackageName>Config.cmake`` which describes the installed component in all
regards needed to consume it.
``<PackageName>Config.cmake`` which describes the installed component in all
regards needed to consume it.
ROCm predominantly relies on Config mode, one notable exception being the Module
driving the compilation of HIP programs on Nvidia runtimes. As such, when
driving the compilation of HIP programs on NVIDIA runtimes. As such, when
dependencies are not found in standard system locations, one either has to
instruct CMake to search for package config files in additional folders using
the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of file system
@@ -57,7 +58,7 @@ Using HIP in CMake
==================
ROCm components providing a C/C++ interface support consumption via any
C/C++ toolchain that CMake knows how to drive. ROCm also supports CMake's HIP
C/C++ toolchain that CMake knows how to drive. ROCm also supports the CMake HIP
language features, allowing users to program using the HIP single-source
programming model. When a program (or translation-unit) uses the HIP API without
compiling any GPU device code, HIP can be treated in CMake as a simple C/C++
@@ -70,22 +71,22 @@ Source code written in the HIP dialect of C++ typically uses the `.hip`
extension. When the HIP CMake language is enabled, it will automatically
associate such source files with the HIP toolchain being used.
::
.. code-block:: cpp
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)
Should you have existing CUDA code that is from the source compatible subset of
HIP, you can tell CMake that despite their `.cu` extension, they're HIP sources.
Do note that this mostly facilitates compiling kernel code-only source files,
as host-side CUDA API won't compile in this fashion.
::
.. code-block:: cpp
add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
CMake itself only hosts part of the HIP language support, such as defining
HIP-specific properties, etc. while the other half ships with the HIP
@@ -110,19 +111,20 @@ Illustrated in the example below is a C++ application using MIOpen from CMake.
It calls ``find_package(miopen)``, which provides the ``MIOpen`` imported
target. This can be linked with ``target_link_libraries``
::
.. code-block:: cpp
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)
.. note::
Most libraries are designed as host-only API, so using a GPU device
compiler is not necessary for downstream projects unless they use GPU device
code.
Most libraries are designed as host-only API, so using a GPU device
compiler is not necessary for downstream projects unless they use GPU device
code.
Consuming the HIP API in C++ code
---------------------------------
@@ -131,24 +133,25 @@ Use the HIP API without compiling the GPU device code. As there is no GPU code,
any C or C++ compiler can be used. The ``find_package(hip)`` provides the
``hip::host`` imported target to use HIP in this context.
::
.. code-block:: cpp
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)
Compiling device code in C++ language mode
------------------------------------------
.. attention::
The workflow detailed here is considered legacy and is shown for
understanding's sake. It pre-dates the existence of HIP language support in
CMake. If source code has HIP device code in it, it is a HIP source file
and should be compiled as such. Only resort to the method below if your
HIP-enabled CMake codepath can't mandate CMake version 3.21.
The workflow detailed here is considered legacy and is shown for
understanding's sake. It pre-dates the existence of HIP language support in
CMake. If source code has HIP device code in it, it is a HIP source file
and should be compiled as such. Only resort to the method below if your
HIP-enabled CMake codepath can't mandate CMake version 3.21.
If code uses the HIP API and compiles GPU device code, it requires using a
device compiler. The compiler for CMake can be set using either the
@@ -160,18 +163,19 @@ compiler that supports AMD GPU targets, which is usually Clang.
The ``find_package(hip)`` provides the ``hip::device`` imported target to add
all the flags necessary for device compilation.
::
.. code-block:: cpp
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)
.. note::
Compiling for the GPU device requires at least C++11.
Compiling for the GPU device requires at least C++11.
This project can then be configured with for eg.
@@ -252,13 +256,12 @@ options.
IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came
up with their own way to register command-line fragments of different purpose in
a setup'n'forget fashion for quick assembly using graphical front-ends. This is
a setup-and-forget fashion for quick assembly using graphical front-ends. This is
all nice, but configurations aren't portable, nor can they be reused in
Continuous Intergration (CI) pipelines. CMake has condensed existing practice
Continuous Integration (CI) pipelines. CMake has condensed existing practice
into a portable JSON format that works in all IDEs and can be invoked from any
command line. This is
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_
.
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_.
There are two types of preset files: one supplied by the project, called
``CMakePresets.json`` which is meant to be committed to version control,
@@ -275,109 +278,110 @@ Following is an example ``CMakeUserPresets.json`` file which actually compiles
the `amd/rocm-examples <https://github.com/amd/rocm-examples>`_ suite of sample
applications on a typical ROCm installation:
::
.. code-block:: json
{
"version": 3,
"cmakeMinimumRequired": {
"major": 3,
"minor": 21,
"patch": 0
{
"version": 3,
"cmakeMinimumRequired": {
"major": 3,
"minor": 21,
"patch": 0
},
"configurePresets": [
{
"name": "layout",
"hidden": true,
"binaryDir": "${sourceDir}/build/${presetName}",
"installDir": "${sourceDir}/install/${presetName}"
},
"configurePresets": [
{
"name": "layout",
"hidden": true,
"binaryDir": "${sourceDir}/build/${presetName}",
"installDir": "${sourceDir}/install/${presetName}"
},
{
"name": "generator-ninja-multi-config",
"hidden": true,
"generator": "Ninja Multi-Config"
},
{
"name": "toolchain-makefiles-c/c++-amdclang",
"hidden": true,
"cacheVariables": {
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
}
},
{
"name": "clang-strict-iso-high-warn",
"hidden": true,
"cacheVariables": {
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
}
},
{
"name": "ninja-mc-rocm",
"displayName": "Ninja Multi-Config ROCm",
"inherits": [
"layout",
"generator-ninja-multi-config",
"toolchain-makefiles-c/c++-amdclang",
"clang-strict-iso-high-warn"
]
{
"name": "generator-ninja-multi-config",
"hidden": true,
"generator": "Ninja Multi-Config"
},
{
"name": "toolchain-makefiles-c/c++-amdclang",
"hidden": true,
"cacheVariables": {
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
}
],
"buildPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-debug-verbose",
"displayName": "Debug (verbose)",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "ninja-mc-rocm-release-verbose",
"displayName": "Release (verbose)",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "clang-strict-iso-high-warn",
"hidden": true,
"cacheVariables": {
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
}
],
"testPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
},
{
"name": "ninja-mc-rocm",
"displayName": "Ninja Multi-Config ROCm",
"inherits": [
"layout",
"generator-ninja-multi-config",
"toolchain-makefiles-c/c++-amdclang",
"clang-strict-iso-high-warn"
]
}
],
"buildPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-debug-verbose",
"displayName": "Debug (verbose)",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "ninja-mc-rocm-release-verbose",
"displayName": "Release (verbose)",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"verbose": true
}
],
"testPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
]
}
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
}
]
}
.. note::
Getting presets to work reliably on Windows requires some CMake improvements
and/or support from compiler vendors. (Refer to
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
.)
Getting presets to work reliably on Windows requires some CMake improvements
and/or support from compiler vendors. (Refer to
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
.)

View File

@@ -326,7 +326,7 @@ maintainers and installs all the required dependencies, including:
## Testing the PyTorch installation
You can use PyTorch unit tests to validate your PyTorch installation. If you used a
**prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
**prebuilt PyTorch Docker image from AMD ROCm Docker Hub** or installed an
**official wheels package**, validation tests are not necessary.
If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps: