mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-07 22:03:58 -05:00
AMD GPU Docs System optimization migration changes in ROCm Docs Develop (#4538)
* AMD GPU Docs System optimization migration changes in ROCm Docs (#296) * System optimization migration changes in ROCm * Linting issue fixed * Linking corrected * Minor change * Link updated to Instinct.docs.amd.com * ROCm docs grid updated by removing IOMMU.rst, pcie-atomics, and oversubscription pages * Files removed and reference fixed * Reference text updated * GPU atomics from 6.4.0 removed
This commit is contained in:
@@ -1,63 +0,0 @@
|
||||
.. meta::
|
||||
:description: Input-Output Memory Management Unit (IOMMU)
|
||||
:keywords: IOMMU, DMA, PCIe, xGMI, AMD, ROCm
|
||||
|
||||
****************************************************************
|
||||
Input-Output Memory Management Unit (IOMMU)
|
||||
****************************************************************
|
||||
|
||||
The I/O Memory Management Unit (IOMMU) provides memory remapping services for I/O devices. It adds support for address translation and system memory access protection on direct memory access (DMA) transfers from peripheral devices.
|
||||
|
||||
The IOMMU's memory remapping services:
|
||||
|
||||
* provide private I/O space for devices used in a guest virtual machine.
|
||||
* prevent unauthorized DMA requests to system memory and to memory-mapped I/O (MMIO).
|
||||
* help in debugging memory access issues.
|
||||
* facilitate peer-to-peer DMA.
|
||||
|
||||
The IOMMU also provides interrupt remapping, which is used by devices that support multiple interrupts and for interrupt delivery on hardware platforms with a large number of cores.
|
||||
|
||||
.. note::
|
||||
|
||||
AMD Instinct accelerators are connected via XGMI links and don't use PCI/PCIe for peer-to-peer DMA. Because PCI/PCIe is not used for peer-to-peer DMA, there are no device physical addressing limitations or platform root port limitations. However, because non-GPU devices such as RDMA NICs use PCIe for peer-to-peer DMA, there might still be physical addressing and platform root port limitations when these non-GPU devices interact with other devices, including GPUs.
|
||||
|
||||
Linux supports IOMMU in both virtualized environments and bare metal.
|
||||
|
||||
The IOMMU is enabled by default but can be disabled or put into passthrough mode through the Linux kernel command line:
|
||||
|
||||
.. list-table::
|
||||
:widths: 15 15 70
|
||||
:header-rows: 1
|
||||
|
||||
* - IOMMU Mode
|
||||
- Kernel command
|
||||
- Description
|
||||
* - Enabled
|
||||
- Default setting
|
||||
- Recommended for AMD Radeon GPUs that need peer-to-peer DMA.
|
||||
|
||||
The IOMMU is enabled in remapping mode. Each device gets its own I/O virtual address space. All devices on Linux register their DMA addressing capabilities, and the kernel will ensure that any address space mapped for DMA is mapped within the device's DMA addressing limits. Only address space explicitly mapped by the devices will be mapped into virtual address space. Attempts to access an unmapped page will generate an IOMMU page fault.
|
||||
* - Passthrough
|
||||
- ``iommu=pt``
|
||||
- Recommended for AMD Instinct Accelerators and for AMD Radeon GPUs that don't need peer-to-peer DMA.
|
||||
|
||||
Interrupt remapping is enabled but I/O remapping is disabled. The entire platform shares a common platform address space for system memory and MMIO spaces, ensuring compatibility with drivers from external vendors, while still supporting CPUs with a large number of cores.
|
||||
* - Disabled
|
||||
- ``iommu=off``
|
||||
- Not recommended.
|
||||
|
||||
The IOMMU is disabled and the entire platform shares a common platform address space for system memory and MMIO spaces.
|
||||
|
||||
This mode should only be used with older Linux distributions with kernels that are not configured to support peer-to-peer DMA with an IOMMU. In these cases, the IOMMU needs to be disabled to use peer-to-peer DMA.
|
||||
|
||||
The IOMMU also provides virtualized access to the MMIO portions of the platform address space for peer-to-peer DMA.
|
||||
|
||||
Because peer-to-peer DMA is not officially part of the PCI/PCIe specification, the behavior of peer-to-peer DMA varies between hardware platforms.
|
||||
|
||||
AMD CPUs earlier than AMD Zen only supported peer-to-peer DMA for writes. On CPUs from AMD Zen and later, peer-to-peer DMA is fully supported.
|
||||
|
||||
To use peer-to-peer DMA on Linux, enable the following options in your Linux kernel configuration:
|
||||
|
||||
* ``CONFIG_PCI_P2PDMA``
|
||||
* ``CONFIG_DMABUF_MOVE_NOTIFY``
|
||||
* ``CONFIG_HSA_AMD_P2P``
|
||||
@@ -1,34 +0,0 @@
|
||||
.. meta::
|
||||
:description: Learn what causes oversubscription.
|
||||
:keywords: warning, log, gpu, performance penalty, help
|
||||
|
||||
*******************************************************************
|
||||
Oversubscription of hardware resources in AMD Instinct accelerators
|
||||
*******************************************************************
|
||||
|
||||
When an AMD Instinct™ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following
|
||||
message.
|
||||
|
||||
``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.``
|
||||
|
||||
Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed
|
||||
state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However,
|
||||
this can result in reduced performance, as resources might be occupied by applications or queues not actively
|
||||
submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of
|
||||
milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other
|
||||
queues.
|
||||
|
||||
What triggers oversubscription?
|
||||
===============================
|
||||
|
||||
The system enters an oversubscribed state when one of the following conditions is met:
|
||||
|
||||
* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the
|
||||
hardware limit of 24 queues for current Instinct accelerators.
|
||||
|
||||
* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory
|
||||
context slots, which is 11 for current Instinct accelerators.
|
||||
|
||||
* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup
|
||||
feature, leading to resource contention.
|
||||
|
||||
@@ -1,57 +0,0 @@
|
||||
.. meta::
|
||||
:description: How ROCm uses PCIe atomics
|
||||
:keywords: PCIe, PCIe atomics, atomics, Atomic operations, AMD, ROCm
|
||||
|
||||
*****************************************************************************
|
||||
How ROCm uses PCIe atomics
|
||||
*****************************************************************************
|
||||
AMD ROCm is an extension of the Heterogeneous System Architecture (HSA). To meet the requirements of an HSA-compliant system, ROCm supports queuing models, memory models, and signaling and synchronization protocols. ROCm can perform atomic Read-Modify-Write (RMW) transactions that extend inter-processor synchronization mechanisms to Input/Output (I/O) devices starting from Peripheral Component Interconnect Express 3.0 (PCIe™ 3.0). It supports the defined HSA capabilities for queuing and signaling memory operations. To learn more about the requirements of an HSA-compliant system, see the
|
||||
`HSA Platform System Architecture Specification <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
|
||||
|
||||
ROCm uses platform atomics to perform memory operations like queuing, signaling, and synchronization across multiple CPU, GPU agents, and I/O devices. Platform atomics ensure that atomic operations run synchronously, without interruptions or conflicts, across multiple shared resources.
|
||||
|
||||
Platform atomics in ROCm
|
||||
==============================
|
||||
Platform atomics enable the set of atomic operations that perform RMW actions across multiple processors, devices, and memory locations so that they run synchronously without interruption. An atomic operation is a sequence of computing instructions run as a single, indivisible unit. These instructions are completed in their entirety without any interruptions. If the instructions can't be completed as a unit without interruption, none of the instructions are run. These operations support 32-bit and 64-bit address formats.
|
||||
|
||||
Some of the operations for which ROCm uses platform atomics are:
|
||||
|
||||
* Update the HSA queue's ``read_dispatch_id``. The command processor on the GPU agent uses a 64-bit atomic add operation. It updates the packet ID it processed.
|
||||
* Update the HSA queue's ``write_dispatch_id``. The CPU and GPU agents use a 64-bit atomic add operation. It supports multi-writer queue insertions.
|
||||
* Update HSA Signals. A 64-bit atomic operation is used for CPU & GPU synchronization.
|
||||
|
||||
|
||||
PCIe for atomic operations
|
||||
----------------------------
|
||||
ROCm requires CPUs that support PCIe atomics. Similarly, all connected I/O devices should also support PCIe atomics for optimum compatibility. PCIe supports the ``CAS`` (Compare and Swap), ``FetchADD``, and ``SWAP`` atomic operations across multiple resources. These atomic operations are initiated by the I/O devices that support 32-bit, 64-bit, and 128-bit operands. Likewise, the target memory address where these atomic operations are performed should also be aligned to the size of the operand. This alignment ensures that the operations are performed efficiently and correctly without failure.
|
||||
|
||||
When an atomic operation is successful, the requester receives a response of completion along with the operation result. However, any errors associated with the operation are signaled to the requester by updating the Completion Status field. Issues accessing the target location or running the atomic operation are common errors. Depending upon the error, the Completion Status field is updated to Completer Abort (CA) or Unsupported Request (UR). The field is present in the Completion Descriptor.
|
||||
|
||||
To learn more about the industry standards and specifications of PCIe, see `PCI-SIG Specification <https://pcisig.com/specifications>`_.
|
||||
|
||||
To learn more about PCIe and its capabilities, consult the following white papers:
|
||||
|
||||
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
|
||||
* `PCI Express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
|
||||
Working with PCIe 3.0 in ROCm
|
||||
-------------------------------
|
||||
Starting with PCIe 3.0, atomic operations can be requested, routed through, and completed by PCIe components. Routing and completion do not require software support. Component support for each can be identified by the Device Capabilities 2 (DevCap2) register. Upstream
|
||||
bridges need to have atomic operations routing enabled. If not enabled, the atomic operations will fail even if the
|
||||
PCIe endpoint and PCIe I/O devices can perform atomic operations.
|
||||
|
||||
If your system uses PCIe switches to connect and enable communication between multiple PCIe components, the switches must also support atomic operations routing.
|
||||
|
||||
To enable atomic operations routing between multiple root ports, each root port must support atomic operation routing. This capability can be identified from the atomic operations routing support bit in the DevCap2 register. If the bit has value of 1, routing is supported. Atomic operation requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
|
||||
field is set. These requests can only be serviced if the upstream components also support atomic operation completion or if the requests can be routed to a component that supports atomic operation completion.
|
||||
|
||||
ROCm uses the PCIe-ID-based ordering technology for peer-to-peer (P2P) data transmission. PCIe-ID-based ordering technology is used when the GPU initiates multiple write operations to different memory locations.
|
||||
|
||||
For more information on changes implemented in PCIe 3.0, see `Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ There are two ways to handle this:
|
||||
|
||||
* Ensure that the high MMIO aperture is within the physical addressing limits of the devices in the system. For example, if the devices have a 44-bit physical addressing limit, set the ``MMIO High Base`` and ``MMIO High size`` options in the BIOS such that the aperture is within the 44-bit address range, and ensure that the ``Above 4G Decoding`` option is Enabled.
|
||||
|
||||
* Enable the Input-Output Memory Management Unit (IOMMU). When the IOMMU is enabled in non-passthrough mode, it will create a virtual I/O address space for each device on the system. It also ensures that all virtual addresses created in that space are within the physical addressing limits of the device. For more information on IOMMU, see :doc:`../conceptual/iommu`.
|
||||
* Enable the Input-Output Memory Management Unit (IOMMU). When the IOMMU is enabled in non-passthrough mode, it will create a virtual I/O address space for each device on the system. It also ensures that all virtual addresses created in that space are within the physical addressing limits of the device. For more information on IOMMU, see `Input-Output Memory Management Unit (IOMMU) <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/conceptual/iommu.html>`_.
|
||||
|
||||
.. _bar-configuration:
|
||||
|
||||
|
||||
@@ -304,7 +304,7 @@ Further reading
|
||||
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
|
||||
|
||||
- To learn more about system settings and management practices to configure your system for
|
||||
MI300X accelerators, see :doc:`../../system-optimization/mi300x`.
|
||||
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
|
||||
|
||||
- To learn how to run LLM models from Hugging Face or your own model, see
|
||||
:doc:`Running models from Hugging Face <hugging-face-models>`.
|
||||
|
||||
@@ -8,108 +8,21 @@ System optimization
|
||||
*******************
|
||||
|
||||
This guide outlines system setup and tuning suggestions for AMD hardware to
|
||||
optimize performance for specific types of workloads or use-cases.
|
||||
optimize performance for specific types of workloads or use-cases. The contents are structured according to the hardware:
|
||||
|
||||
High-performance computing workloads
|
||||
====================================
|
||||
.. grid:: 2
|
||||
|
||||
High-performance computing (HPC) workloads have unique requirements. The default
|
||||
hardware and BIOS configurations for OEM platforms may not provide optimal
|
||||
performance for HPC workloads. To enable optimal HPC settings on a per-platform
|
||||
and per-workload level, this chapter describes:
|
||||
.. grid-item-card:: AMD RDNA
|
||||
|
||||
* BIOS settings that can impact performance
|
||||
* Hardware configuration best practices
|
||||
* Supported versions of operating systems
|
||||
* Workload-specific recommendations for optimal BIOS and operating system
|
||||
settings
|
||||
* :doc:`AMD RDNA2 system optimization <w6000-v620>`
|
||||
|
||||
There is also a discussion on the AMD Instinct™ software development
|
||||
environment, including information on how to install and run the DGEMM, STREAM,
|
||||
HPCG, and HPL benchmarks. This guide provides a good starting point but is
|
||||
not tested exhaustively across all compilers.
|
||||
.. grid-item-card:: AMD Instinct
|
||||
|
||||
Knowledge prerequisites to better understand this document and to perform tuning
|
||||
for HPC applications include:
|
||||
* `AMD Instinct MI300X <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
|
||||
* `AMD Instinct MI300A <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300a.html>`_
|
||||
* `AMD Instinct MI200 <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi200.html>`_
|
||||
* `AMD Instinct MI100 <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi100.html>`_
|
||||
|
||||
* Experience in configuring servers
|
||||
* Administrative access to the server's Management Interface (BMC)
|
||||
* Administrative access to the operating system
|
||||
* Familiarity with the OEM server's BMC (strongly recommended)
|
||||
* Familiarity with the OS specific tools for configuration, monitoring, and
|
||||
troubleshooting (strongly recommended)
|
||||
|
||||
This document provides guidance on tuning systems with various AMD Instinct
|
||||
accelerators for HPC workloads. The following sections don't comprise an
|
||||
all-inclusive guide, and some items referred to may have similar, but different,
|
||||
names in various OEM systems (for example, OEM-specific BIOS settings). This
|
||||
following sections also provide suggestions on items that should be the initial
|
||||
focus of additional, application-specific tuning.
|
||||
|
||||
While this guide is a good starting point, developers are encouraged to perform
|
||||
their own performance testing for additional tuning.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
|
||||
* - System optimization guide
|
||||
|
||||
- Architecture reference
|
||||
|
||||
- White papers
|
||||
|
||||
* - :doc:`AMD Instinct MI300X <mi300x>`
|
||||
|
||||
- `AMD Instinct MI300 instruction set architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`_
|
||||
|
||||
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
|
||||
|
||||
* - :doc:`AMD Instinct MI300A <mi300a>`
|
||||
|
||||
- `AMD Instinct MI300 instruction set architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`_
|
||||
|
||||
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
|
||||
|
||||
* - :doc:`AMD Instinct MI200 <mi200>`
|
||||
|
||||
- `AMD Instinct MI200 instruction set architecture <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`_
|
||||
|
||||
- `CDNA 2 architecture <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_
|
||||
|
||||
* - :doc:`AMD Instinct MI100 <mi100>`
|
||||
|
||||
- `AMD Instinct MI100 instruction set architecture <https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf>`_
|
||||
|
||||
- `CDNA architecture <https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf>`_
|
||||
|
||||
Workstation workloads
|
||||
=====================
|
||||
|
||||
Workstation workloads, much like those for HPC, have a unique set of
|
||||
requirements: a blend of both graphics and compute, certification, stability and
|
||||
others.
|
||||
|
||||
The document covers specific software requirements and processes needed to use
|
||||
these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning
|
||||
tasks.
|
||||
|
||||
The main purpose of this document is to help users utilize the RDNA™ 2 GPUs to
|
||||
their full potential.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
|
||||
* - System optimization guide
|
||||
|
||||
- Architecture reference
|
||||
|
||||
- White papers
|
||||
|
||||
* - :doc:`AMD Radeon PRO W6000 and V620 <w6000-v620>`
|
||||
|
||||
- `AMD RDNA 2 instruction set architecture <https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf>`_
|
||||
|
||||
- `RDNA 2 architecture <https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf>`_
|
||||
|
||||
|
||||
@@ -1,475 +0,0 @@
|
||||
---
|
||||
myst:
|
||||
html_meta:
|
||||
"description": "AMD Instinct MI100 system settings optimization guide."
|
||||
"keywords": "Instinct, MI100, microarchitecture, AMD, ROCm"
|
||||
---
|
||||
|
||||
# AMD Instinct MI100 system optimization
|
||||
|
||||
## System settings
|
||||
|
||||
This chapter reviews system settings that are required to configure the system
|
||||
for AMD Instinct™ MI100 accelerators and that can improve performance of the
|
||||
GPUs. It is advised to configure the system for best possible host configuration
|
||||
according to the high-performance computing tuning guides for AMD EPYC™
|
||||
7002 Series and EPYC™ 7003 Series processors, depending on the processor generation of the
|
||||
system.
|
||||
|
||||
In addition to the BIOS settings listed below the following settings
|
||||
({ref}`mi100-bios-settings`) will also have to be enacted via the command line (see
|
||||
{ref}`mi100-os-settings`):
|
||||
|
||||
* Core C states
|
||||
* AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors)
|
||||
* IOMMU (if needed)
|
||||
|
||||
(mi100-bios-settings)=
|
||||
|
||||
### System BIOS settings
|
||||
|
||||
For maximum MI100 GPU performance on systems with AMD EPYC™ 7002 series
|
||||
processors (codename "Rome") and AMI System BIOS, the following configuration of
|
||||
System BIOS settings has been validated. These settings must be used for the
|
||||
qualification process and should be set as default values for the system BIOS.
|
||||
Analogous settings for other non-AMI System BIOS providers could be set
|
||||
similarly. For systems with Intel processors, some settings may not apply or be
|
||||
available as listed in the following table.
|
||||
|
||||
```{list-table} Recommended settings for the system BIOS in a GIGABYTE platform.
|
||||
:header-rows: 1
|
||||
:name: mi100-bios
|
||||
|
||||
*
|
||||
- BIOS Setting Location
|
||||
- Parameter
|
||||
- Value
|
||||
- Comments
|
||||
*
|
||||
- Advanced / PCI Subsystem Settings
|
||||
- Above 4G Decoding
|
||||
- Enabled
|
||||
- GPU Large BAR Support
|
||||
*
|
||||
- AMD CBS / CPU Common Options
|
||||
- Global C-state Control
|
||||
- Auto
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / CPU Common Options
|
||||
- CCD/Core/Thread Enablement
|
||||
- Accept
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / CPU Common Options / Performance
|
||||
- SMT Control
|
||||
- Disable
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / DF Common Options / Memory Addressing
|
||||
- NUMA nodes per socket
|
||||
- NPS 1,2,4
|
||||
- NUMA Nodes (NPS)
|
||||
*
|
||||
- AMD CBS / DF Common Options / Memory Addressing
|
||||
- Memory interleaving
|
||||
- Auto
|
||||
- Numa Nodes (NPS)
|
||||
*
|
||||
- AMD CBS / DF Common Options / Link
|
||||
- 4-link xGMI max speed
|
||||
- 18 Gbps
|
||||
- Set AMD CPU xGMI speed to highest rate supported
|
||||
*
|
||||
- AMD CBS / DF Common Options / Link
|
||||
- 3-link xGMI max speed
|
||||
- 18 Gbps
|
||||
- Set AMD CPU xGMI speed to highest rate supported
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- IOMMU
|
||||
- Disable
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- PCIe Ten Bit Tag Support
|
||||
- Enable
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Preferred IO
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Preferred IO Bus
|
||||
- "Use lspci to find pci device id"
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Enhanced Preferred IO Mode
|
||||
- Enable
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Determinism Control
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Determinism Slider
|
||||
- Power
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- cTDP Control
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- cTDP
|
||||
- 240
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Package Power Limit Control
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Package Power Limit
|
||||
- 240
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Link Width Control
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Force Link Width
|
||||
- 2
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Force Link Width Control
|
||||
- Force
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- APBDIS
|
||||
- 1
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- DF C-states
|
||||
- Auto
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Fixed SOC P-state
|
||||
- P0
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options
|
||||
- Enforce POR
|
||||
- Accept
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
|
||||
- Overclock
|
||||
- Enabled
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
|
||||
- Memory Clock Speed
|
||||
- 1600 MHz
|
||||
- Set to max Memory Speed, if using 3200 MHz DIMMs
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller
|
||||
Configuration / DRAM Power Options
|
||||
- Power Down Enable
|
||||
- Disabled
|
||||
- RAM Power Down
|
||||
*
|
||||
- AMD CBS / Security
|
||||
- TSME
|
||||
- Disabled
|
||||
- Memory Encryption
|
||||
```
|
||||
|
||||
#### NBIO link clock frequency
|
||||
|
||||
The NBIOs (4x per AMD EPYC™ processor) are the serializers/deserializers (also
|
||||
known as "SerDes") that convert and prepare the I/O signals for the processor's
|
||||
128 external I/O interface lanes (32 per NBIO).
|
||||
|
||||
LCLK (short for link clock frequency) controls the link speed of the internal
|
||||
bus that connects the NBIO silicon with the data fabric. All data between the
|
||||
processor and its PCIe lanes flow to the data fabric based on these LCLK
|
||||
frequency settings. The link clock frequency of the NBIO components need to be
|
||||
forced to the maximum frequency for optimal PCIe performance.
|
||||
|
||||
For AMD EPYC™ 7002 series processors, this setting cannot be modified via
|
||||
configuration options in the server BIOS alone. Instead, the AMD-IOPM-UTIL (see
|
||||
Section 3.2.3) must be run at every server boot to disable Dynamic Power
|
||||
Management for all PCIe Root Complexes and NBIOs within the system and to lock
|
||||
the logic into the highest performance operational mode.
|
||||
|
||||
For AMD EPYC™ 7003 series processors, configuring all NBIOs to be in "Enhanced
|
||||
Preferred I/O" mode is sufficient to enable highest link clock frequency for the
|
||||
NBIO components.
|
||||
|
||||
#### Memory configuration
|
||||
|
||||
For the memory addressing modes, especially the
|
||||
number of NUMA nodes per socket/processor (NPS), the recommended setting is
|
||||
to follow the guidance of the high-performance computing tuning guides
|
||||
for AMD EPYC™ 7002 Series and AMD EPYC™ 7003 Series processors to provide the optimal
|
||||
configuration for host side computation.
|
||||
|
||||
If the system is set to one NUMA domain per socket/processor (NPS1),
|
||||
bidirectional copy bandwidth between host memory and GPU memory may be
|
||||
slightly higher (up to about 16% more) than with four NUMA domains per socket
|
||||
processor (NPS4). For memory bandwidth sensitive applications using MPI, NPS4
|
||||
is recommended. For applications that are not optimized for NUMA locality,
|
||||
NPS1 is the recommended setting.
|
||||
|
||||
(mi100-os-settings)=
|
||||
|
||||
### Operating system settings
|
||||
|
||||
#### CPU core states - C-states
|
||||
|
||||
There are several core states (C-states) that an AMD EPYC CPU can idle within:
|
||||
|
||||
* C0: active. This is the active state while running an application.
|
||||
* C1: idle
|
||||
* C2: idle and power gated. This is a deeper sleep state and will have a
|
||||
greater latency when moving back to the C0 state, compared to when the
|
||||
CPU is coming out of C1.
|
||||
|
||||
Disabling C2 is important for running with a high performance, low-latency
|
||||
network. To disable power-gating on all cores run the following on Linux
|
||||
systems:
|
||||
|
||||
```shell
|
||||
cpupower idle-set -d 2
|
||||
```
|
||||
|
||||
Note that the `cpupower` tool must be installed, as it is not part of the base
|
||||
packages of most Linux® distributions. The package needed varies with the
|
||||
respective Linux distribution.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
```shell
|
||||
sudo apt install linux-tools-common
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Red Hat Enterprise Linux
|
||||
:sync: RHEL
|
||||
|
||||
```shell
|
||||
sudo yum install cpupowerutils
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} SUSE Linux Enterprise Server
|
||||
:sync: SLES
|
||||
|
||||
```shell
|
||||
sudo zypper install cpupower
|
||||
```
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
#### AMD-IOPM-UTIL
|
||||
|
||||
This section applies to AMD EPYC™ 7002 processors to optimize advanced
|
||||
Dynamic Power Management (DPM) in the I/O logic (see NBIO description above)
|
||||
for performance. Certain I/O workloads may benefit from disabling this power
|
||||
management. This utility disables DPM for all PCI-e root complexes in the
|
||||
system and locks the logic into the highest performance operational mode.
|
||||
|
||||
Disabling I/O DPM will reduce the latency and/or improve the throughput of
|
||||
low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads
|
||||
with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if
|
||||
multiple such PCI-e devices are installed in the system.
|
||||
|
||||
The actions of the utility do not persist across reboots. There is no need to
|
||||
change any existing firmware settings when using this utility. The "Preferred
|
||||
I/O" and "Enhanced Preferred I/O" settings should remain unchanged at enabled.
|
||||
|
||||
```{tip}
|
||||
The recommended method to use the utility is either to create a system
|
||||
start-up script, for example, a one-shot `systemd` service unit, or run the
|
||||
utility when starting up a job scheduler on the system. The installer
|
||||
packages (see
|
||||
[Power Management Utility](https://developer.amd.com/iopm-utility/)) will
|
||||
create and enable a `systemd` service unit for you. This service unit is
|
||||
configured to run in one-shot mode. This means that even when the service
|
||||
unit runs as expected, the status of the service unit will show inactive.
|
||||
This is the expected behavior when the utility runs normally. If the service
|
||||
unit shows failed, the utility did not run as expected. The output in either
|
||||
case can be shown with the `systemctl status` command.
|
||||
|
||||
Stopping the service unit has no effect since the utility does not leave
|
||||
anything running. To undo the effects of the utility, disable the service
|
||||
unit with the `systemctl disable` command and reboot the system.
|
||||
|
||||
The utility does not have any command-line options, and it must be run with
|
||||
super-user permissions.
|
||||
```
|
||||
|
||||
#### Systems with 256 CPU threads - IOMMU configuration
|
||||
|
||||
For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™
|
||||
7763 in a dual-socket configuration and SMT enabled), setting the input-output
|
||||
memory management unit (IOMMU) configuration to "disabled" can limit the number
|
||||
of available logical cores to 255. The reason is that the Linux® kernel disables
|
||||
X2APIC in this case and falls back to Advanced Programmable Interrupt Controller
|
||||
(APIC), which can only enumerate a maximum of 255 (logical) cores.
|
||||
|
||||
If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
|
||||
"enable", the following steps can be applied to the system to enable all
|
||||
(logical) cores of the system:
|
||||
|
||||
* In the server BIOS, set IOMMU to "Enabled".
|
||||
* When configuring the Grub boot loader, add the following argument for the
|
||||
Linux kernel: `iommu=pt`
|
||||
* Update Grub to use the modified configuration:
|
||||
|
||||
```shell
|
||||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
```
|
||||
|
||||
* Reboot the system.
|
||||
* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`:
|
||||
|
||||
```none
|
||||
[...]
|
||||
[ 0.000000] Kernel command line: [...] iommu=pt
|
||||
[...]
|
||||
```
|
||||
|
||||
Once the system is properly configured, ROCm software can be
|
||||
installed.
|
||||
|
||||
## System management
|
||||
|
||||
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
|
||||
{doc}`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. To verify that the installation was
|
||||
successful, refer to the
|
||||
{doc}`post-install instructions<rocm-install-on-linux:install/post-install>` and
|
||||
[system tools](../../reference/rocm-tools.md). Should verification
|
||||
fail, consult the [System Debugging Guide](../system-debugging.md).
|
||||
|
||||
(mi100-hw-verification)=
|
||||
|
||||
### Hardware verification with ROCm
|
||||
|
||||
The AMD ROCm™ platform ships with tools to query the system structure. To query
|
||||
the GPU hardware, the `rocm-smi` command is available. It can show available
|
||||
GPUs in the system with their device ID and their respective firmware (or VBIOS)
|
||||
versions:
|
||||
|
||||

|
||||
|
||||
Another important query is to show the system structure, the localization of the
|
||||
GPUs in the system, and the fabric connections between the system components:
|
||||
|
||||

|
||||
|
||||
The previous command shows the system structure in four blocks:
|
||||
|
||||
* The first block of the output shows the distance between the GPUs similar to
|
||||
what the `numactl` command outputs for the NUMA domains of a system. The
|
||||
weight is a qualitative measure for the "distance" data must travel to reach
|
||||
one GPU from another one. While the values do not carry a special (physical)
|
||||
meaning, the higher the value the more hops are needed to reach the
|
||||
destination from the source GPU.
|
||||
* The second block has a matrix for the number of hops required to send data
|
||||
from one GPU to another. For the GPUs in the local hive, this number is one,
|
||||
while for the others it is three (one hop to leave the hive, one hop across
|
||||
the processors, and one hop within the destination hive).
|
||||
* The third block outputs the link types between the GPUs. This can either be
|
||||
"XGMI" for AMD Infinity Fabric™ links or "PCIE" for PCIe Gen4 links.
|
||||
* The fourth block reveals the localization of a GPU with respect to the NUMA
|
||||
organization of the shared memory of the AMD EPYC™ processors.
|
||||
|
||||
To query the compute capabilities of the GPU devices, the `rocminfo` command is
|
||||
available with the AMD ROCm™ platform. It lists specific details about the GPU
|
||||
devices, including but not limited to the number of compute units, width of the
|
||||
SIMD pipelines, memory information, and Instruction Set Architecture:
|
||||
|
||||

|
||||
|
||||
For a complete list of architecture (LLVM target) names, refer to
|
||||
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>` support.
|
||||
|
||||
### Testing inter-device bandwidth
|
||||
|
||||
{ref}`mi100-hw-verification` showed the `rocm-smi --showtopo` command to show
|
||||
how the system structure and how the GPUs are located and connected in this
|
||||
structure. For more details, the `rocm-bandwidth-test` can run benchmarks to
|
||||
show the effective link bandwidth between the components of the system.
|
||||
|
||||
The ROCm Bandwidth Test program can be installed with the following
|
||||
package-manager commands:
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
```shell
|
||||
sudo apt install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Red Hat Enterprise Linux
|
||||
:sync: RHEL
|
||||
|
||||
```shell
|
||||
sudo yum install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} SUSE Linux Enterprise Server
|
||||
:sync: SLES
|
||||
|
||||
```shell
|
||||
sudo zypper install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
Alternatively, the source code can be downloaded and built from
|
||||
[source](https://github.com/ROCm/rocm_bandwidth_test).
|
||||
|
||||
The output will list the available compute devices (CPUs and GPUs):
|
||||
|
||||

|
||||
|
||||
The output will also show a matrix that contains a "1" if a device can
|
||||
communicate to another device (CPU and GPU) of the system and it will show the
|
||||
NUMA distance (similar to `rocm-smi`):
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
The output also contains the measured bandwidth for unidirectional and
|
||||
bidirectional transfers between the devices (CPU and GPU):
|
||||
|
||||

|
||||
@@ -1,459 +0,0 @@
|
||||
---
|
||||
myst:
|
||||
html_meta:
|
||||
"description": "Learn about AMD Instinct MI200 system settings and performance tuning."
|
||||
"keywords": "Instinct, MI200, microarchitecture, AMD, ROCm"
|
||||
---
|
||||
|
||||
# AMD Instinct MI200 system optimization
|
||||
|
||||
## System settings
|
||||
|
||||
This chapter reviews system settings that are required to configure the system
|
||||
for AMD Instinct MI250 accelerators and improve the performance of the GPUs. It
|
||||
is advised to configure the system for the best possible host configuration
|
||||
according to the *High Performance Computing (HPC) Tuning Guide for AMD EPYC
|
||||
7003 Series Processors*.
|
||||
|
||||
Configure the system BIOS settings as explained in {ref}`mi200-bios-settings` and
|
||||
enact the below given settings via the command line as explained in
|
||||
{ref}`mi200-os-settings`:
|
||||
|
||||
* Core C states
|
||||
* input-output memory management unit (IOMMU), if needed
|
||||
|
||||
(mi200-bios-settings)=
|
||||
|
||||
### System BIOS settings
|
||||
|
||||
For maximum MI250 GPU performance on systems with AMD EPYC™ 7003-series
|
||||
processors (codename "Milan") and AMI System BIOS, the following configuration
|
||||
of system BIOS settings has been validated. These settings must be used for the
|
||||
qualification process and should be set as default values for the system BIOS.
|
||||
Analogous settings for other non-AMI System BIOS providers could be set
|
||||
similarly. For systems with Intel processors, some settings may not apply or be
|
||||
available as listed in the following table.
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:name: mi200-bios
|
||||
|
||||
*
|
||||
- BIOS Setting Location
|
||||
- Parameter
|
||||
- Value
|
||||
- Comments
|
||||
*
|
||||
- Advanced / PCI Subsystem Settings
|
||||
- Above 4G Decoding
|
||||
- Enabled
|
||||
- GPU Large BAR Support
|
||||
*
|
||||
- Advanced / PCI Subsystem Settings
|
||||
- SR-IOV Support
|
||||
- Disabled
|
||||
- Disable Single Root IO Virtualization
|
||||
*
|
||||
- AMD CBS / CPU Common Options
|
||||
- Global C-state Control
|
||||
- Auto
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / CPU Common Options
|
||||
- CCD/Core/Thread Enablement
|
||||
- Accept
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / CPU Common Options / Performance
|
||||
- SMT Control
|
||||
- Disable
|
||||
- Global C-States
|
||||
*
|
||||
- AMD CBS / DF Common Options / Memory Addressing
|
||||
- NUMA nodes per socket
|
||||
- NPS 1,2,4
|
||||
- NUMA Nodes (NPS)
|
||||
*
|
||||
- AMD CBS / DF Common Options / Memory Addressing
|
||||
- Memory interleaving
|
||||
- Auto
|
||||
- Numa Nodes (NPS)
|
||||
*
|
||||
- AMD CBS / DF Common Options / Link
|
||||
- 4-link xGMI max speed
|
||||
- 18 Gbps
|
||||
- Set AMD CPU xGMI speed to highest rate supported
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- IOMMU
|
||||
- Disable
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- PCIe Ten Bit Tag Support
|
||||
- Auto
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Preferred IO
|
||||
- Bus
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Preferred IO Bus
|
||||
- "Use lspci to find pci device id"
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options
|
||||
- Enhanced Preferred IO Mode
|
||||
- Enable
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Determinism Control
|
||||
- Manual
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Determinism Slider
|
||||
- Power
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- cTDP Control
|
||||
- Manual
|
||||
- Set cTDP to the maximum supported by the installed CPU
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- cTDP
|
||||
- 280
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Package Power Limit Control
|
||||
- Manual
|
||||
- Set Package Power Limit to the maximum supported by the installed CPU
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Package Power Limit
|
||||
- 280
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Link Width Control
|
||||
- Manual
|
||||
- Set AMD CPU xGMI width to 16 bits
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Force Link Width
|
||||
- 2
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- xGMI Force Link Width Control
|
||||
- Force
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- APBDIS
|
||||
- 1
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- DF C-states
|
||||
- Enabled
|
||||
-
|
||||
*
|
||||
- AMD CBS / NBIO Common Options / SMU Common Options
|
||||
- Fixed SOC P-state
|
||||
- P0
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options
|
||||
- Enforce POR
|
||||
- Accept
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
|
||||
- Overclock
|
||||
- Enabled
|
||||
-
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
|
||||
- Memory Clock Speed
|
||||
- 1600 MHz
|
||||
- Set to max Memory Speed, if using 3200 MHz DIMMs
|
||||
*
|
||||
- AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller
|
||||
Configuration / DRAM Power Options
|
||||
- Power Down Enable
|
||||
- Disabled
|
||||
- RAM Power Down
|
||||
*
|
||||
- AMD CBS / Security
|
||||
- TSME
|
||||
- Disabled
|
||||
- Memory Encryption
|
||||
```
|
||||
|
||||
#### NBIO link clock frequency
|
||||
|
||||
The NBIOs (4x per AMD EPYC™ processor) are the serializers/deserializers (also
|
||||
known as "SerDes") that convert and prepare the I/O signals for the processor's
|
||||
128 external I/O interface lanes (32 per NBIO).
|
||||
|
||||
LCLK (short for link clock frequency) controls the link speed of the internal
|
||||
bus that connects the NBIO silicon with the data fabric. All data between the
|
||||
processor and its PCIe lanes flow to the data fabric based on these LCLK
|
||||
frequency settings. The link clock frequency of the NBIO components need to be
|
||||
forced to the maximum frequency for optimal PCIe performance.
|
||||
|
||||
For AMD EPYC™ 7003 series processors, configuring all NBIOs to be in "Enhanced
|
||||
Preferred I/O" mode is sufficient to enable highest link clock frequency for the
|
||||
NBIO components.
|
||||
|
||||
#### Memory configuration
|
||||
|
||||
For setting the memory addressing modes, especially
|
||||
the number of NUMA nodes per socket/processor (NPS), follow the guidance of the
|
||||
"High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series
|
||||
Processors" to provide the optimal configuration for host side computation. For
|
||||
most HPC workloads, NPS=4 is the recommended value.
|
||||
|
||||
(mi200-os-settings)=
|
||||
|
||||
### Operating system settings
|
||||
|
||||
#### CPU core states - C-states
|
||||
|
||||
There are several core states (C-states) that an AMD EPYC CPU can idle within:
|
||||
|
||||
* C0: active. This is the active state while running an application.
|
||||
* C1: idle
|
||||
* C2: idle and power gated. This is a deeper sleep state and will have a
|
||||
greater latency when moving back to the C0 state, compared to when the
|
||||
CPU is coming out of C1.
|
||||
|
||||
Disabling C2 is important for running with a high performance, low-latency
|
||||
network. To disable power-gating on all cores run the following on Linux
|
||||
systems:
|
||||
|
||||
```shell
|
||||
cpupower idle-set -d 2
|
||||
```
|
||||
|
||||
Note that the `cpupower` tool must be installed, as it is not part of the base
|
||||
packages of most Linux® distributions. The package needed varies with the
|
||||
respective Linux distribution.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
```shell
|
||||
sudo apt install linux-tools-common
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Red Hat Enterprise Linux
|
||||
:sync: RHEL
|
||||
|
||||
```shell
|
||||
sudo yum install cpupowerutils
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} SUSE Linux Enterprise Server
|
||||
:sync: SLES
|
||||
|
||||
```shell
|
||||
sudo zypper install cpupower
|
||||
```
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
#### AMD-IOPM-UTIL
|
||||
|
||||
This section applies to AMD EPYC™ 7002 processors to optimize advanced
|
||||
Dynamic Power Management (DPM) in the I/O logic (see NBIO description above)
|
||||
for performance. Certain I/O workloads may benefit from disabling this power
|
||||
management. This utility disables DPM for all PCI-e root complexes in the
|
||||
system and locks the logic into the highest performance operational mode.
|
||||
|
||||
Disabling I/O DPM will reduce the latency and/or improve the throughput of
|
||||
low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads
|
||||
with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if
|
||||
multiple such PCI-e devices are installed in the system.
|
||||
|
||||
The actions of the utility do not persist across reboots. There is no need to
|
||||
change any existing firmware settings when using this utility. The "Preferred
|
||||
I/O" and "Enhanced Preferred I/O" settings should remain unchanged at enabled.
|
||||
|
||||
```{tip}
|
||||
The recommended method to use the utility is either to create a system
|
||||
start-up script, for example, a one-shot `systemd` service unit, or run the
|
||||
utility when starting up a job scheduler on the system. The installer
|
||||
packages (see
|
||||
[Power Management Utility](https://developer.amd.com/iopm-utility/)) will
|
||||
create and enable a `systemd` service unit for you. This service unit is
|
||||
configured to run in one-shot mode. This means that even when the service
|
||||
unit runs as expected, the status of the service unit will show inactive.
|
||||
This is the expected behavior when the utility runs normally. If the service
|
||||
unit shows failed, the utility did not run as expected. The output in either
|
||||
case can be shown with the `systemctl status` command.
|
||||
|
||||
Stopping the service unit has no effect since the utility does not leave
|
||||
anything running. To undo the effects of the utility, disable the service
|
||||
unit with the `systemctl disable` command and reboot the system.
|
||||
|
||||
The utility does not have any command-line options, and it must be run with
|
||||
super-user permissions.
|
||||
```
|
||||
|
||||
#### Systems with 256 CPU threads - IOMMU configuration
|
||||
|
||||
For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™
|
||||
7763 in a dual-socket configuration and SMT enabled), setting the input-output
|
||||
memory management unit (IOMMU) configuration to "disabled" can limit the number
|
||||
of available logical cores to 255. The reason is that the Linux® kernel disables
|
||||
X2APIC in this case and falls back to Advanced Programmable Interrupt Controller
|
||||
(APIC), which can only enumerate a maximum of 255 (logical) cores.
|
||||
|
||||
If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
|
||||
"enable", the following steps can be applied to the system to enable all
|
||||
(logical) cores of the system:
|
||||
|
||||
* In the server BIOS, set IOMMU to "Enabled".
|
||||
* When configuring the Grub boot loader, add the following argument for the
|
||||
Linux kernel: `iommu=pt`
|
||||
* Update Grub to use the modified configuration:
|
||||
|
||||
```shell
|
||||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
```
|
||||
|
||||
* Reboot the system.
|
||||
* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`:
|
||||
|
||||
```none
|
||||
[...]
|
||||
[ 0.000000] Kernel command line: [...] iommu=pt
|
||||
[...]
|
||||
```
|
||||
|
||||
Once the system is properly configured, ROCm software can be
|
||||
installed.
|
||||
|
||||
## System management
|
||||
|
||||
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
|
||||
{doc}`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. For verifying that the
|
||||
installation was successful, refer to the
|
||||
{doc}`post-install instructions<rocm-install-on-linux:install/post-install>` and
|
||||
[system tools](../../reference/rocm-tools.md). Should verification
|
||||
fail, consult the [System Debugging Guide](../system-debugging.md).
|
||||
|
||||
(mi200-hw-verification)=
|
||||
|
||||
### Hardware verification with ROCm
|
||||
|
||||
The AMD ROCm™ platform ships with tools to query the system structure. To query
|
||||
the GPU hardware, the `rocm-smi` command is available. It can show available
|
||||
GPUs in the system with their device ID and their respective firmware (or VBIOS)
|
||||
versions:
|
||||
|
||||

|
||||
|
||||
To see the system structure, the localization of the GPUs in the system, and the
|
||||
fabric connections between the system components, use:
|
||||
|
||||

|
||||
|
||||
* The first block of the output shows the distance between the GPUs similar to
|
||||
what the `numactl` command outputs for the NUMA domains of a system. The
|
||||
weight is a qualitative measure for the "distance" data must travel to reach
|
||||
one GPU from another one. While the values do not carry a special (physical)
|
||||
meaning, the higher the value the more hops are needed to reach the
|
||||
destination from the source GPU.
|
||||
* The second block has a matrix named "Hops between two GPUs", where 1 means the
|
||||
two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the
|
||||
same CPU socket and GPU communications will go through the CPU, and 3 means
|
||||
both GPUs are linked to different CPU sockets so communications will go
|
||||
through both CPU sockets. This number is one for all GPUs in this case since
|
||||
they are all connected to each other through the Infinity Fabric links.
|
||||
* The third block outputs the link types between the GPUs. This can either be
|
||||
"XGMI" for AMD Infinity Fabric links or "PCIE" for PCIe Gen4 links.
|
||||
* The fourth block reveals the localization of a GPU with respect to the NUMA
|
||||
organization of the shared memory of the AMD EPYC processors.
|
||||
|
||||
To query the compute capabilities of the GPU devices, use `rocminfo` command. It
|
||||
lists specific details about the GPU devices, including but not limited to the
|
||||
number of compute units, width of the SIMD pipelines, memory information, and
|
||||
Instruction Set Architecture (ISA):
|
||||
|
||||

|
||||
|
||||
For a complete list of architecture (LLVM target) names, refer to GPU OS Support for
|
||||
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
|
||||
|
||||
### Testing inter-device bandwidth
|
||||
|
||||
{ref}`mi100-hw-verification` showed the `rocm-smi --showtopo` command to show
|
||||
how the system structure and how the GPUs are located and connected in this
|
||||
structure. For more details, the `rocm-bandwidth-test` can run benchmarks to
|
||||
show the effective link bandwidth between the components of the system.
|
||||
|
||||
The ROCm Bandwidth Test program can be installed with the following
|
||||
package-manager commands:
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
```shell
|
||||
sudo apt install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Red Hat Enterprise Linux
|
||||
:sync: RHEL
|
||||
|
||||
```shell
|
||||
sudo yum install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} SUSE Linux Enterprise Server
|
||||
:sync: SLES
|
||||
|
||||
```shell
|
||||
sudo zypper install rocm-bandwidth-test
|
||||
```
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
Alternatively, the source code can be downloaded and built from
|
||||
[source](https://github.com/ROCm/rocm_bandwidth_test).
|
||||
|
||||
The output will list the available compute devices (CPUs and GPUs), including
|
||||
their device ID and PCIe ID:
|
||||
|
||||

|
||||
|
||||
The output will also show a matrix that contains a "1" if a device can
|
||||
communicate to another device (CPU and GPU) of the system and it will show the
|
||||
NUMA distance (similar to `rocm-smi`):
|
||||
|
||||

|
||||
|
||||
The output also contains the measured bandwidth for unidirectional and
|
||||
bidirectional transfers between the devices (CPU and GPU):
|
||||
|
||||

|
||||
@@ -1,452 +0,0 @@
|
||||
.. meta::
|
||||
:description: Learn about AMD Instinct MI300A system settings and performance tuning.
|
||||
:keywords: AMD, Instinct, MI300A, HPC, tuning, BIOS settings, NBIO, ROCm,
|
||||
environment variable, performance, accelerator, GPU, EPYC, GRUB,
|
||||
operating system
|
||||
|
||||
***************************************************
|
||||
AMD Instinct MI300A system optimization
|
||||
***************************************************
|
||||
|
||||
This topic discusses the operating system settings and system management commands for
|
||||
the AMD Instinct MI300A accelerator. This topic can help you optimize performance.
|
||||
|
||||
System settings
|
||||
========================================
|
||||
|
||||
This section reviews the system settings required to configure a MI300A SOC system and
|
||||
optimize its performance.
|
||||
|
||||
The MI300A system-on-a-chip (SOC) design requires you to review and potentially adjust your OS configuration as explained in
|
||||
the :ref:`operating-system-settings-label` section. These settings are critical for
|
||||
performance because the OS on an accelerated processing unit (APU) is responsible for memory management across the CPU and GPU accelerators.
|
||||
In the APU memory model, system settings are available to limit GPU memory allocation.
|
||||
This limit is important because legacy software often determines the
|
||||
amount of allowable memory at start-up time
|
||||
by probing discrete memory until it is exhausted. If left unchecked, this practice
|
||||
can starve the OS of resources.
|
||||
|
||||
System BIOS settings
|
||||
-----------------------------------
|
||||
|
||||
System BIOS settings are preconfigured for optimal performance from the
|
||||
platform vendor. This means that you do not need to adjust these settings
|
||||
when using MI300A. If you have any questions regarding these settings,
|
||||
contact your MI300A platform vendor.
|
||||
|
||||
GRUB settings
|
||||
-----------------------------------
|
||||
|
||||
The ``/etc/default/grub`` file is used to configure the GRUB bootloader on modern Linux distributions.
|
||||
Linux uses the string assigned to ``GRUB_CMDLINE_LINUX`` in this file as
|
||||
its command line parameters during boot.
|
||||
|
||||
Appending strings using the Linux command line
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
It is recommended that you append the following string to ``GRUB_CMDLINE_LINUX``.
|
||||
|
||||
``pci=realloc=off``
|
||||
This setting disables the automatic reallocation
|
||||
of PCI resources, so Linux is able to unambiguously detect all GPUs on the
|
||||
MI300A-based system. It's used when Single Root I/O Virtualization (SR-IOV) Base
|
||||
Address Registers (BARs) have not been allocated by the BIOS. This can help
|
||||
avoid potential issues with certain hardware configurations.
|
||||
|
||||
Validating the IOMMU setting
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
IOMMU is a system-specific IO mapping mechanism for DMA mapping
|
||||
and isolation. IOMMU is turned off by default in the operating system settings
|
||||
for optimal performance.
|
||||
|
||||
To verify IOMMU is turned off, first install the ``acpica-tools`` package using your
|
||||
package manager.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo apt install acpica-tools
|
||||
|
||||
Then confirm that the following commands do not return any results.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo acpidump | grep IVRS
|
||||
sudo acpidump | grep DMAR
|
||||
|
||||
Update GRUB
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Use this command to update GRUB to use the modified configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
|
||||
On some Red Hat-based systems, the ``grub2-mkconfig`` command might not be available. In this case,
|
||||
use ``grub-mkconfig`` instead. Verify that you have the
|
||||
correct version by using the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
grub-mkconfig -version
|
||||
|
||||
.. _operating-system-settings-label:
|
||||
|
||||
Operating system settings
|
||||
-----------------------------------
|
||||
|
||||
The operating system provides several options to customize and tune performance. For more information
|
||||
about supported operating systems, see the :doc:`Compatibility matrix <../../compatibility/compatibility-matrix>`.
|
||||
|
||||
If you are using a distribution other than RHEL or SLES, the latest Linux kernel is recommended.
|
||||
Performance considerations for the Zen4, which is the core architecture in the MI300A,
|
||||
require a Linux kernel running version 5.18 or higher.
|
||||
|
||||
This section describes performance-based settings.
|
||||
|
||||
* **Enable transparent huge pages**
|
||||
|
||||
To enable transparent huge pages, use one of the following methods:
|
||||
|
||||
* From the command line, run the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
echo always > /sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
* Set the Linux kernel parameter ``transparent_hugepage`` as follows in the
|
||||
relevant ``.cfg`` file for your system.
|
||||
|
||||
.. code-block:: cfg
|
||||
|
||||
transparent_hugepage=always
|
||||
|
||||
* **Increase the amount of allocatable memory**
|
||||
|
||||
By default, when using a device allocator via HIP, it is only possible to allocate 96 GiB out of
|
||||
a possible 128 GiB of memory on the MI300A. This limitation does not affect host allocations.
|
||||
To increase the available system memory, load the ``amdttm`` module with new values for
|
||||
``pages_limit`` and ``page_pool_size``. These numbers correspond to the number of 4 KiB pages of memory.
|
||||
To make 128 GiB of memory available across all four devices, for a total amount of 512 GiB,
|
||||
set ``pages_limit`` and ``page_pool_size`` to ``134217728``. For a two-socket system, divide these values
|
||||
by two. After setting these values, reload the AMDGPU driver.
|
||||
|
||||
First, review the current settings using this shell command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cat /sys/module/amdttm/parameters/pages_limit
|
||||
|
||||
To set the amount of allocatable memory to all available memory on all four APU devices, run these commands:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo modprobe amdttm pages_limit=134217728 page_pool_size=134217728
|
||||
sudo modprobe amdgpu
|
||||
|
||||
These settings can also be hardcoded in the ``/etc/modprobe.d/amdttm.conf`` file or specified as boot
|
||||
parameters.
|
||||
|
||||
To use the hardcoded method,
|
||||
the filesystem must already be set up when the kernel driver is loaded.
|
||||
To hardcode the settings, add the following lines to ``/etc/modprobe.d/amdttm.conf``:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
options amdttm pages_limit=134217728
|
||||
options amdttm page_pool_size=134217728
|
||||
|
||||
If the filesystem is not already set up when the kernel driver is loaded, then the options
|
||||
must be specified as boot parameters. To specify the settings
|
||||
as boot parameters when loading the kernel, use this example as a guideline:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
vmlinux-[...] amdttm.pages_limit=134217728 amdttm.page_pool_size=134217728 [...]
|
||||
|
||||
To verify the new settings and confirm the change, use this command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cat /sys/module/amdttm/parameters/pages_limit
|
||||
|
||||
.. note::
|
||||
|
||||
The system settings for ``pages_limit`` and ``page_pool_size`` are calculated by multiplying the
|
||||
per-APU limit of 4 KiB pages, which is ``33554432``, by the number of APUs on the node. The limit for a system with
|
||||
two APUs ``33554432 x 2`` or ``67108864``.
|
||||
This means the ``modprobe`` command for two APUs is ``sudo modprobe amdttm pages_limit=67108864 page_pool_size=67108864``.
|
||||
|
||||
* **Limit the maximum and single memory allocations on the GPU**
|
||||
|
||||
Many AI-related applications were originally developed on discrete GPUs. Some of these applications
|
||||
have fixed problem sizes associated with the targeted GPU size, and some attempt to determine the
|
||||
system memory limits by allocating chunks until failure. These techniques can cause issues in an
|
||||
APU with a shared space.
|
||||
|
||||
To allow these applications to run on the APU without further changes,
|
||||
ROCm supports a default memory policy that restricts the percentage of the GPU that can be allocated.
|
||||
The following environment variables control this feature:
|
||||
|
||||
* ``GPU_MAX_ALLOC_PERCENT``
|
||||
* ``GPU_SINGLE_ALLOC_PERCENT``
|
||||
|
||||
These settings can be added to the default shell environment or the user environment. The effect of the memory allocation
|
||||
settings varies depending on the system, configuration, and task. They might require adjustment, especially when performing GPU benchmarks. Setting these values to ``100``
|
||||
lets the GPU allocate any amount of free memory. However, the risk of encountering
|
||||
an operating system out-of-memory (OMM) condition increases when almost
|
||||
all the available memory is used.
|
||||
|
||||
Before setting either of these items to 100 percent,
|
||||
carefully consider the expected CPU workload allocation and the anticipated OS usage.
|
||||
For instance, if the OS requires 8GB on a 128GB system, setting these
|
||||
variables to ``100`` authorizes a single
|
||||
workload to allocate up to 120GB of memory. Unless the system has swap space configured
|
||||
any over-allocation attempts will be handled by the OMM policies.
|
||||
|
||||
* **Disable NUMA (Non-uniform memory access) balancing**
|
||||
|
||||
ROCm uses information from the compiled application to ensure an affinity exists
|
||||
between the GPU agent processes and their CPU hosts or co-processing agents.
|
||||
Because the APU has OS threads,
|
||||
including threads with memory management, the default kernel NUMA policies can
|
||||
adversely impact workload performance without additional tuning.
|
||||
|
||||
.. note::
|
||||
|
||||
At the kernel level, ``pci_relloc`` can also be set to ``off`` as an additional tuning measure.
|
||||
|
||||
To disable NUMA balancing, use one of the following methods:
|
||||
|
||||
* From the command line, run the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
echo 0 > /proc/sys/kernel/numa_balancing
|
||||
|
||||
* Set the following Linux kernel parameters in the
|
||||
relevant ``.cfg`` file for your system.
|
||||
|
||||
.. code-block:: cfg
|
||||
|
||||
pci=realloc=off numa_balancing=disable
|
||||
|
||||
* **Enable compaction**
|
||||
|
||||
Compaction is necessary for proper MI300A operation because the APU dynamically shares memory
|
||||
between the CPU and GPU. Compaction can be done proactively, which reduces
|
||||
allocation costs, or performed during allocation, in which case it is part of the background activities.
|
||||
Without compaction, the MI300A application performance eventually degrades as fragmentation increases.
|
||||
In RHEL distributions, compaction is disabled by default. In Ubuntu, it's enabled by default.
|
||||
|
||||
To enable compaction, enter the following commands using the command line:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
echo 20 > /proc/sys/vm/compaction_proactiveness
|
||||
echo 1 > /proc/sys/vm/compact_unevictable_allowed
|
||||
|
||||
.. _mi300a-processor-affinity:
|
||||
|
||||
* **Change affinity of ROCm helper threads**
|
||||
|
||||
Changing the affinity prevents internal ROCm threads from having their CPU core affinity mask
|
||||
set to all CPU cores available. With this setting, the threads inherit their parent's
|
||||
CPU core affinity mask. Before adjusting this setting, ensure you thoroughly understand
|
||||
your system topology and how the application, runtime environment, and batch system
|
||||
set the thread-to-core affinity. If you have any questions regarding this setting,
|
||||
contact your MI300A platform vendor or the AMD support team.
|
||||
To enable this setting, enter the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
|
||||
|
||||
* **CPU core states and C-states**
|
||||
|
||||
The system BIOS handles these settings for the MI300A.
|
||||
They don't need to be configured on the operating system.
|
||||
|
||||
System management
|
||||
========================================
|
||||
|
||||
For a complete guide on installing, managing, and uninstalling ROCm on Linux, see
|
||||
:doc:`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. To verify that the
|
||||
installation was successful, see the
|
||||
:doc:`Post-installation instructions<rocm-install-on-linux:install/post-install>` and
|
||||
:doc:`ROCm tools <../../reference/rocm-tools>` guides. If verification
|
||||
fails, consult the :doc:`System debugging guide <../system-debugging>`.
|
||||
|
||||
.. _hw-verification-rocm-label:
|
||||
|
||||
Hardware verification with ROCm
|
||||
-----------------------------------
|
||||
|
||||
ROCm includes tools to query the system structure. To query
|
||||
the GPU hardware, use the ``rocm-smi`` command.
|
||||
|
||||
``rocm-smi`` reports statistics per socket, so the power results combine CPU and GPU utilization.
|
||||
In an idle state on a multi-socket system, some power imbalances are expected because
|
||||
the distribution of OS threads can keep some APU devices at higher power states.
|
||||
|
||||
.. note::
|
||||
|
||||
The MI300A VRAM settings show as ``N/A``.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-output.png
|
||||
:alt: Output from the rocm-smi command
|
||||
|
||||
The ``rocm-smi --showhw`` command shows the available system
|
||||
GPUs and their device ID and firmware details.
|
||||
|
||||
In the MI300A hardware settings, the system BIOS handles the UMC RAS. The
|
||||
ROCm-supplied GPU driver does not manage this setting.
|
||||
This results in a value of ``DISABLED`` for the ``UMC RAS`` setting.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showhw-output.png
|
||||
:alt: Output from the ``rocm-smi showhw`` command
|
||||
|
||||
To see the system structure, the localization of the GPUs in the system, and the
|
||||
fabric connections between the system components, use the ``rocm-smi --showtopo`` command.
|
||||
|
||||
* The first block of the output shows the distance between the GPUs. The weight is a qualitative
|
||||
measure of the “distance” data must travel to reach one GPU from another.
|
||||
While the values do not have a precise physical meaning, the higher the value the
|
||||
more hops are required to reach the destination from the source GPU.
|
||||
* The second block contains a matrix named “Hops between two GPUs”, where ``1`` means
|
||||
the two GPUs are directly connected with XGMI, ``2`` means both GPUs are linked to the
|
||||
same CPU socket and GPU communications go through the CPU, and ``3`` means
|
||||
both GPUs are linked to different CPU sockets so communications go
|
||||
through both CPU sockets.
|
||||
* The third block indicates the link types between the GPUs. This can either be
|
||||
``XGMI`` for AMD Infinity Fabric links or ``PCIE`` for PCIe Gen4 links.
|
||||
* The fourth block reveals the localization of a GPU with respect to the NUMA organization
|
||||
of the shared memory of the AMD EPYC processors.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showtopo-output.png
|
||||
:alt: Output from the ``rocm-smi showtopo`` command
|
||||
|
||||
Testing inter-device bandwidth
|
||||
-----------------------------------
|
||||
|
||||
The ``rocm-smi --showtopo`` command from the :ref:`hw-verification-rocm-label` section
|
||||
displays the system structure and shows how the GPUs are located and connected within this
|
||||
structure. For more information, use the :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`, which can run benchmarks to
|
||||
show the effective link bandwidth between the system components.
|
||||
|
||||
For information on how to install the ROCm Bandwidth Test, see :doc:`Building the environment <rocm_bandwidth_test:install/install>`.
|
||||
|
||||
The output lists the available compute devices (CPUs and GPUs), including
|
||||
their device ID and PCIe ID:
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-bandwidth-test-output.png
|
||||
:alt: Output from the rocm-bandwidth-test utility
|
||||
|
||||
It also displays the measured bandwidth for unidirectional and
|
||||
bidirectional transfers between the devices on the CPU and GPU:
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-peak-bandwidth-output.png
|
||||
:alt: Bandwidth information from the rocm-bandwidth-test utility
|
||||
|
||||
Abbreviations
|
||||
=============
|
||||
|
||||
APBDIS
|
||||
Algorithmic Performance Boost Disable
|
||||
|
||||
APU
|
||||
Accelerated processing unit
|
||||
|
||||
BAR
|
||||
Base Address Register
|
||||
|
||||
BIOS
|
||||
Basic Input/Output System
|
||||
|
||||
CBS
|
||||
Common BIOS Settings
|
||||
|
||||
CCD
|
||||
Compute Core Die
|
||||
|
||||
CDNA
|
||||
Compute DNA
|
||||
|
||||
CLI
|
||||
Command Line Interface
|
||||
|
||||
CPU
|
||||
Central Processing Unit
|
||||
|
||||
cTDP
|
||||
Configurable Thermal Design Power
|
||||
|
||||
DF
|
||||
Data Fabric
|
||||
|
||||
DMA
|
||||
Direct Memory Access
|
||||
|
||||
GPU
|
||||
Graphics Processing Unit
|
||||
|
||||
GRUB
|
||||
Grand Unified Bootloader
|
||||
|
||||
HBM
|
||||
High Bandwidth Memory
|
||||
|
||||
HPC
|
||||
High Performance Computing
|
||||
|
||||
IOMMU
|
||||
Input-Output Memory Management Unit
|
||||
|
||||
ISA
|
||||
Instruction Set Architecture
|
||||
|
||||
NBIO
|
||||
North Bridge Input/Output
|
||||
|
||||
NUMA
|
||||
Non-Uniform Memory Access
|
||||
|
||||
OMM
|
||||
Out of Memory
|
||||
|
||||
PCI
|
||||
Peripheral Component Interconnect
|
||||
|
||||
PCIe
|
||||
PCI Express
|
||||
|
||||
POR
|
||||
Power-On Reset
|
||||
|
||||
RAS
|
||||
Reliability, availability and serviceability
|
||||
|
||||
SMI
|
||||
System Management Interface
|
||||
|
||||
SMT
|
||||
Simultaneous Multi-threading
|
||||
|
||||
SOC
|
||||
System On Chip
|
||||
|
||||
SR-IOV
|
||||
Single Root I/O Virtualization
|
||||
|
||||
TSME
|
||||
Transparent Secure Memory Encryption
|
||||
|
||||
UMC
|
||||
Unified Memory Controller
|
||||
|
||||
VRAM
|
||||
Video RAM
|
||||
|
||||
xGMI
|
||||
Inter-chip Global Memory Interconnect
|
||||
@@ -1,836 +0,0 @@
|
||||
.. meta::
|
||||
:description: Learn about AMD Instinct MI300X system settings and performance tuning.
|
||||
:keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
|
||||
environment variable, performance, accelerator, GPU, EPYC, GRUB,
|
||||
operating system
|
||||
|
||||
***************************************
|
||||
AMD Instinct MI300X system optimization
|
||||
***************************************
|
||||
|
||||
This document covers essential system settings and management practices required
|
||||
to configure your system effectively. Ensuring that your system operates
|
||||
correctly is the first step before delving into advanced performance tuning.
|
||||
|
||||
The main topics of discussion in this document are:
|
||||
|
||||
* :ref:`System settings <mi300x-system-settings>`
|
||||
|
||||
* :ref:`System BIOS settings <mi300x-bios-settings>`
|
||||
|
||||
* :ref:`GRUB settings <mi300x-grub-settings>`
|
||||
|
||||
* :ref:`Operating system settings <mi300x-os-settings>`
|
||||
|
||||
* :ref:`System management <mi300x-system-management>`
|
||||
|
||||
.. _mi300x-system-settings:
|
||||
|
||||
System settings
|
||||
===============
|
||||
|
||||
This guide discusses system settings that are required to configure your system
|
||||
for AMD Instinct™ MI300X accelerators. It is important to ensure a system is
|
||||
functioning correctly before trying to improve its overall performance. In this
|
||||
section, the settings discussed mostly ensure proper functionality of your
|
||||
Instinct-based system. Some settings discussed are known to improve performance
|
||||
for most applications running on a MI300X system. See
|
||||
:doc:`../rocm-for-ai/inference-optimization/workload` for how to improve performance for
|
||||
specific applications or workloads.
|
||||
|
||||
.. _mi300x-bios-settings:
|
||||
|
||||
System BIOS settings
|
||||
--------------------
|
||||
|
||||
AMD EPYC 9004-based systems
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For maximum MI300X GPU performance on systems with AMD EPYC™ 9004-series
|
||||
processors and AMI System BIOS, the following configuration
|
||||
of system BIOS settings has been validated. These settings must be used for the
|
||||
qualification process and should be set as default values in the system BIOS.
|
||||
Analogous settings for other non-AMI System BIOS providers could be set
|
||||
similarly. For systems with Intel processors, some settings may not apply or be
|
||||
available as listed in the following table.
|
||||
|
||||
Each row in the table details a setting but the specific location within the
|
||||
BIOS setup menus may be different, or the option may not be present.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - BIOS setting location
|
||||
|
||||
- Parameter
|
||||
|
||||
- Value
|
||||
|
||||
- Comments
|
||||
|
||||
* - Advanced / PCI subsystem settings
|
||||
|
||||
- Above 4G decoding
|
||||
|
||||
- Enabled
|
||||
|
||||
- GPU large BAR support.
|
||||
|
||||
* - Advanced / PCI subsystem settings
|
||||
|
||||
- SR-IOV support
|
||||
|
||||
- Enabled
|
||||
|
||||
- Enable single root IO virtualization.
|
||||
|
||||
* - AMD CBS / GPU common options
|
||||
|
||||
- Global C-state control
|
||||
|
||||
- Auto
|
||||
|
||||
- Global C-states -- do not disable this menu item).
|
||||
|
||||
* - AMD CBS / GPU common options
|
||||
|
||||
- CCD/Core/Thread enablement
|
||||
|
||||
- Accept
|
||||
|
||||
- May be necessary to enable the SMT control menu.
|
||||
|
||||
* - AMD CBS / GPU common options / performance
|
||||
|
||||
- SMT control
|
||||
|
||||
- Disable
|
||||
|
||||
- Set to Auto if the primary application is not compute-bound.
|
||||
|
||||
* - AMD CBS / DF common options / memory addressing
|
||||
|
||||
- NUMA nodes per socket
|
||||
|
||||
- Auto
|
||||
|
||||
- Auto = NPS1. At this time, the other options for NUMA nodes per socket
|
||||
should not be used.
|
||||
|
||||
* - AMD CBS / DF common options / memory addressing
|
||||
|
||||
- Memory interleaving
|
||||
|
||||
- Auto
|
||||
|
||||
- Depends on NUMA nodes (NPS) setting.
|
||||
|
||||
* - AMD CBS / DF common options / link
|
||||
|
||||
- 4-link xGMI max speed
|
||||
|
||||
- 32 Gbps
|
||||
|
||||
- Auto results in the speed being set to the lower of the max speed the
|
||||
motherboard is designed to support and the max speed of the CPU in use.
|
||||
|
||||
* - AMD CBS / NBIO common options
|
||||
|
||||
- IOMMU
|
||||
|
||||
- Enabled
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options
|
||||
|
||||
- PCIe ten bit tag support
|
||||
|
||||
- Auto
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- Determinism control
|
||||
|
||||
- Manual
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- Determinism slider
|
||||
|
||||
- Power
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- cTDP control
|
||||
|
||||
- Manual
|
||||
|
||||
- Set cTDP to the maximum supported by the installed CPU.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- cTDP
|
||||
|
||||
- 400
|
||||
|
||||
- Value in watts.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- Package power limit control
|
||||
|
||||
- Manual
|
||||
|
||||
- Set package power limit to the maximum supported by the installed CPU.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- Package power limit
|
||||
|
||||
- 400
|
||||
|
||||
- Value in watts.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- xGMI link width control
|
||||
|
||||
- Manual
|
||||
|
||||
- Set package power limit to the maximum supported by the installed CPU.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- xGMI force width control
|
||||
|
||||
- Force
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- xGMI force link width
|
||||
|
||||
- 2
|
||||
|
||||
- * 0: Force xGMI link width to x2
|
||||
* 1: Force xGMI link width to x8
|
||||
* 2: Force xGMI link width to x16
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- xGMI max speed
|
||||
|
||||
- Auto
|
||||
|
||||
- Auto results in the speed being set to the lower of the max speed the
|
||||
motherboard is designed to support and the max speed of the CPU in use.
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- APBDIS
|
||||
|
||||
- 1
|
||||
|
||||
- Disable DF (data fabric) P-states
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- DF C-states
|
||||
|
||||
- Auto
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / NBIO common options / SMU common options
|
||||
|
||||
- Fixed SOC P-state
|
||||
|
||||
- P0
|
||||
|
||||
-
|
||||
|
||||
* - AMD CBS / security
|
||||
|
||||
- TSME
|
||||
|
||||
- Disabled
|
||||
|
||||
- Memory encryption
|
||||
|
||||
.. _mi300x-grub-settings:
|
||||
|
||||
GRUB settings
|
||||
-------------
|
||||
|
||||
In any modern Linux distribution, the ``/etc/default/grub`` file is used to
|
||||
configure GRUB. In this file, the string assigned to ``GRUB_CMDLINE_LINUX`` is
|
||||
the command line parameters that Linux uses during boot.
|
||||
|
||||
Appending strings via Linux command line
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
It is recommended to append the following strings in ``GRUB_CMDLINE_LINUX``.
|
||||
|
||||
``pci=realloc=off``
|
||||
With this setting Linux is able to unambiguously detect all GPUs of the
|
||||
MI300X-based system because this setting disables the automatic reallocation
|
||||
of PCI resources. It's used when Single Root I/O Virtualization (SR-IOV) Base
|
||||
Address Registers (BARs) have not been allocated by the BIOS. This can help
|
||||
avoid potential issues with certain hardware configurations.
|
||||
|
||||
``iommu=pt``
|
||||
The ``iommu=pt`` setting enables IOMMU pass-through mode. When in pass-through
|
||||
mode, the adapter does not need to use DMA translation to the memory, which can
|
||||
improve performance.
|
||||
|
||||
IOMMU is a system specific IO mapping mechanism and can be used for DMA mapping
|
||||
and isolation. This can be beneficial for virtualization and device assignment
|
||||
to virtual machines. It is recommended to enable IOMMU support.
|
||||
|
||||
For a system that has AMD host CPUs add this to ``GRUB_CMDLINE_LINUX``:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
iommu=pt
|
||||
|
||||
Otherwise, if the system has Intel host CPUs add this instead to
|
||||
``GRUB_CMDLINE_LINUX``:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
intel_iommu=on iommu=pt
|
||||
|
||||
``modprobe.blacklist=amdgpu``
|
||||
For some system configurations, the ``amdgpu`` driver needs to be blocked during kernel initialization to avoid an issue where after boot, the GPUs are not listed when running the command ``rocm-smi`` or ``amd-smi``.
|
||||
|
||||
Alternatively, configuring the AMD recommended system optimized BIOS settings might remove the need for using this setting. Some manufacturers and users might not implement the recommended system optimized BIOS settings.
|
||||
|
||||
If you experience the mentioned issue, then add this to ``GRUB_CMDLINE_LINUX``:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
modprobe.blacklist=amdgpu
|
||||
|
||||
After the change, the ``amdgpu`` module must be loaded to support the ROCm framework
|
||||
software tools and utilities. Run the following command to load the ``amdgpu`` module:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
sudo modprobe amdgpu
|
||||
|
||||
Update GRUB
|
||||
-----------
|
||||
|
||||
Update GRUB to use the modified configuration:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
|
||||
On some Debian systems, the ``grub2-mkconfig`` command may not be available. Instead,
|
||||
check for the presence of ``grub-mkconfig``. Additionally, verify that you have the
|
||||
correct version by using the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
grub-mkconfig -version
|
||||
|
||||
.. _mi300x-os-settings:
|
||||
|
||||
Operating system settings
|
||||
-------------------------
|
||||
|
||||
CPU core states (C-states)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are several core states (C-states) that an AMD EPYC CPU can idle within:
|
||||
|
||||
* **C0**: active. This is the active state while running an application.
|
||||
|
||||
* **C1**: idle. This state consumes less power compared to C0, but can quickly
|
||||
return to the active state (C0) with minimal latency.
|
||||
|
||||
* **C2**: idle and power-gated. This is a deeper sleep state and will have greater
|
||||
latency when moving back to the active (C0) state as compared to when the CPU
|
||||
is coming out of C1.
|
||||
|
||||
Disabling C2 is important for running with a high performance, low-latency
|
||||
network. To disable the C2 state, install the ``cpupower`` tool using your Linux
|
||||
distribution's package manager. ``cpupower`` is not a base package in most Linux
|
||||
distributions. The specific package to be installed varies per Linux
|
||||
distribution.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo apt install linux-tools-common
|
||||
|
||||
.. tab-item:: RHEL
|
||||
:sync: rhel
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo yum install cpupowerutils
|
||||
|
||||
.. tab-item:: SLES
|
||||
:sync: sles
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo zypper install cpupower
|
||||
|
||||
Now, to disable power-gating on all cores run the following on Linux
|
||||
systems, run the following command.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
cpupower idle-set -d 2
|
||||
|
||||
`/proc` and `/sys` file system settings
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. _mi300x-disable-numa:
|
||||
|
||||
Disable NUMA auto-balancing
|
||||
'''''''''''''''''''''''''''
|
||||
|
||||
The NUMA balancing feature allows the OS to scan memory and attempt to migrate
|
||||
to a DIMM that is logically closer to the cores accessing it. This causes an
|
||||
overhead because the OS is second-guessing your NUMA allocations but may be
|
||||
useful if the NUMA locality access is very poor. Applications can therefore, in
|
||||
general, benefit from disabling NUMA balancing; however, there are workloads where
|
||||
doing so is detrimental to performance. Test this setting
|
||||
by toggling the ``numa_balancing`` value and running the application; compare
|
||||
the performance of one run with this set to ``0`` and another run with this to
|
||||
``1``.
|
||||
|
||||
Run the command ``cat /proc/sys/kernel/numa_balancing`` to check the current
|
||||
NUMA (Non-Uniform Memory Access) settings. Output ``0`` indicates this
|
||||
setting is disabled. If no output or output is ``1``, run the command
|
||||
``sudo sh -c \\'echo 0 > /proc/sys/kernel/numa_balancing`` to disable it.
|
||||
|
||||
For these settings, the ``env_check.sh`` script automates setting, resetting,
|
||||
and checking your environments. Find the script at
|
||||
`<https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh>`__.
|
||||
|
||||
Run the script as follows to set or reset the settings:
|
||||
|
||||
``./env_check.sh [set/reset/check]``
|
||||
|
||||
.. tip::
|
||||
|
||||
Use ``./env_check.sh -h`` for help info.
|
||||
|
||||
Automate disabling NUMA auto-balance using Cron
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The :ref:`mi300x-disable-numa` section describes the command to disable NUMA
|
||||
auto-balance. To automate the command with Cron, edit the ``crontab``
|
||||
configuration file for the root user:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo crontab -e
|
||||
|
||||
#. Add the following Cron entry to run the script at a specific interval:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@reboot sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
|
||||
|
||||
#. Save the file and exit the text editor.
|
||||
|
||||
#. Optionally, restart the system to apply changes by issuing ``sudo reboot``.
|
||||
|
||||
#. Verify your new configuration.
|
||||
|
||||
.. code-block::
|
||||
|
||||
cat /proc/sys/kernel/numa_balancing
|
||||
|
||||
The ``/proc/sys/kernel/numa_balancing`` file controls NUMA balancing in the
|
||||
Linux kernel. If the value in this file is set to ``0``, the NUMA balancing
|
||||
is disabled. If the value is set to ``1``, NUMA balancing is enabled.
|
||||
|
||||
.. note::
|
||||
|
||||
Disabling NUMA balancing should be done cautiously and for
|
||||
specific reasons, such as performance optimization or addressing
|
||||
particular issues. Always test the impact of disabling NUMA balancing in
|
||||
a controlled environment before applying changes to a production system.
|
||||
|
||||
.. _mi300x-env-vars:
|
||||
|
||||
Environment variables
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
HIP provides an environment variable export ``HIP_FORCE_DEV_KERNARG=1`` that
|
||||
can put arguments of HIP kernels directly to device memory to reduce the
|
||||
latency of accessing those kernel arguments. It can improve performance by 2 to
|
||||
3 µs for some kernels.
|
||||
|
||||
It is recommended to set the following environment variable:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HIP_FORCE_DEV_KERNARG=1
|
||||
|
||||
.. note::
|
||||
|
||||
This is the default option as of ROCm 6.2.
|
||||
|
||||
Change affinity of ROCm helper threads
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This change prevents internal ROCm threads from having their CPU core affinity mask
|
||||
set to all CPU cores available. With this setting, the threads inherit their parent's
|
||||
CPU core affinity mask. If you have any questions regarding this setting,
|
||||
contact your MI300X platform vendor. To enable this setting, enter the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
|
||||
|
||||
IOMMU configuration -- systems with 256 CPU threads
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For systems that have 256 logical CPU cores or more, setting the input-output
|
||||
memory management unit (IOMMU) configuration to ``disabled`` can limit the
|
||||
number of available logical cores to 255. The reason is that the Linux kernel
|
||||
disables X2APIC in this case and falls back to Advanced Programmable Interrupt
|
||||
Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.
|
||||
|
||||
If SMT is enabled by setting ``CCD/Core/Thread Enablement > SMT Control`` to
|
||||
``enable``, you can apply the following steps to the system to enable all
|
||||
(logical) cores of the system:
|
||||
|
||||
#. In the server BIOS, set IOMMU to ``Enabled``.
|
||||
|
||||
#. When configuring the GRUB boot loader, add the following argument for the Linux kernel: ``iommu=pt``.
|
||||
|
||||
#. Update GRUB.
|
||||
|
||||
#. Reboot the system.
|
||||
|
||||
#. Verify IOMMU passthrough mode by inspecting the kernel log via ``dmesg``:
|
||||
|
||||
.. code-block::
|
||||
|
||||
dmesg | grep iommu
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
[...]
|
||||
[ 0.000000] Kernel command line: [...] iommu=pt
|
||||
[...]
|
||||
|
||||
Once the system is properly configured, ROCm software can be
|
||||
:doc:`installed <rocm-install-on-linux:index>`.
|
||||
|
||||
.. _mi300x-system-management:
|
||||
|
||||
System management
|
||||
=================
|
||||
|
||||
To optimize system performance, it's essential to first understand the existing
|
||||
system configuration parameters and settings. ROCm offers several CLI tools that
|
||||
can provide system-level information, offering valuable insights for
|
||||
optimizing user applications.
|
||||
|
||||
For a complete guide on how to install, manage, or uninstall ROCm on Linux, refer to
|
||||
:doc:`rocm-install-on-linux:install/quick-start`. For verifying that the
|
||||
installation was successful, refer to the
|
||||
:doc:`rocm-install-on-linux:install/post-install`.
|
||||
Should verification fail, consult :doc:`/how-to/system-debugging`.
|
||||
|
||||
.. _mi300x-hardware-verification-with-rocm:
|
||||
|
||||
Hardware verification with ROCm
|
||||
-------------------------------
|
||||
|
||||
The ROCm platform provides tools to query the system structure. These include
|
||||
:ref:`ROCm SMI <mi300x-rocm-smi>` and :ref:`ROCm Bandwidth Test <mi300x-bandwidth-test>`.
|
||||
|
||||
.. _mi300x-rocm-smi:
|
||||
|
||||
ROCm SMI
|
||||
^^^^^^^^
|
||||
|
||||
To query your GPU hardware, use the ``rocm-smi`` command. ROCm SMI lists
|
||||
GPUs available to your system -- with their device ID and their respective
|
||||
firmware (or VBIOS) versions.
|
||||
|
||||
The following screenshot shows that all 8 GPUs of MI300X are recognized by ROCm.
|
||||
Performance of an application could be otherwise suboptimal if, for example, out
|
||||
of the 8 GPUs only 5 of them are recognized.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/rocm-smi-showhw.png
|
||||
:align: center
|
||||
:alt: ``rocm-smi --showhw`` output
|
||||
|
||||
To see the system structure, the localization of the GPUs in the system, and the
|
||||
fabric connections between the system components, use the command
|
||||
``rocm-smi --showtopo``.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/rocm-smi-showtopo.png
|
||||
:align: center
|
||||
:alt: ``rocm-smi --showtopo`` output
|
||||
|
||||
The first block of the output shows the distance between the GPUs similar to
|
||||
what the ``numactl`` command outputs for the NUMA domains of a system. The
|
||||
weight is a qualitative measure for the “distance” data must travel to reach one
|
||||
GPU from another one. While the values do not carry a special, or "physical"
|
||||
meaning, the higher the value the more hops are needed to reach the destination
|
||||
from the source GPU. This information has performance implication for a
|
||||
GPU-based application that moves data among GPUs. You can choose a minimum
|
||||
distance among GPUs to be used to make the application more performant.
|
||||
|
||||
The second block has a matrix named *Hops between two GPUs*, where:
|
||||
|
||||
* ``1`` means the two GPUs are directly connected with xGMI,
|
||||
|
||||
* ``2`` means both GPUs are linked to the same CPU socket and GPU communications
|
||||
will go through the CPU, and
|
||||
|
||||
* ``3`` means both GPUs are linked to different CPU sockets so communications will
|
||||
go through both CPU sockets. This number is one for all GPUs in this case
|
||||
since they are all connected to each other through the Infinity Fabric links.
|
||||
|
||||
The third block outputs the link types between the GPUs. This can either be
|
||||
``XGMI`` for AMD Infinity Fabric links or ``PCIE`` for PCIe Gen5 links.
|
||||
|
||||
The fourth block reveals the localization of a GPU with respect to the NUMA
|
||||
organization of the shared memory of the AMD EPYC processors.
|
||||
|
||||
To query the compute capabilities of the GPU devices, use rocminfo command. It
|
||||
lists specific details about the GPU devices, including but not limited to the
|
||||
number of compute units, width of the SIMD pipelines, memory information, and
|
||||
instruction set architecture (ISA). The following is the truncated output of the
|
||||
command:
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/rocminfo.png
|
||||
:align: center
|
||||
:alt: rocminfo.txt example
|
||||
|
||||
For a complete list of architecture (such as CDNA3) and LLVM target names
|
||||
(such gfx942 for MI300X), refer to the
|
||||
:doc:`Supported GPUs section of the System requirements for Linux page <rocm-install-on-linux:reference/system-requirements>`.
|
||||
|
||||
|
||||
Deterministic clock
|
||||
'''''''''''''''''''
|
||||
|
||||
Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock
|
||||
speed up to 1900 MHz instead of the default 2100 MHz. This can reduce
|
||||
the chance of a PCC event lowering the attainable GPU clocks. This
|
||||
setting will not be required for new IFWI releases with the production
|
||||
PRC feature. Restore this setting to its default value with the
|
||||
``rocm-smi -r`` command.
|
||||
|
||||
.. _mi300x-bandwidth-test:
|
||||
|
||||
ROCm Bandwidth Test
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The section Hardware verification with ROCm showed how the command
|
||||
``rocm-smi --showtopo`` can be used to view the system structure and how the
|
||||
GPUs are connected. For more details on the link bandwidth,
|
||||
``rocm-bandwidth-test`` can run benchmarks to show the effective link bandwidth
|
||||
between the components of the system.
|
||||
|
||||
You can install ROCm Bandwidth Test, which can test inter-device bandwidth,
|
||||
using the following package manager commands:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu
|
||||
:sync: ubuntu
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo apt install rocm-bandwidth-test
|
||||
|
||||
.. tab-item:: RHEL
|
||||
:sync: rhel
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo yum install rocm-bandwidth-test
|
||||
|
||||
.. tab-item:: SLES
|
||||
:sync: sles
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo zypper install rocm-bandwidth-test
|
||||
|
||||
Alternatively, you can download the source code from
|
||||
`<https://github.com/ROCm/rocm_bandwidth_test>`__ and build from source.
|
||||
|
||||
The output will list the available compute devices (CPUs and GPUs), including
|
||||
their device ID and PCIe ID. The following screenshot is an example of the
|
||||
beginning part of the output of running ``rocm-bandwidth-test``. It shows the
|
||||
devices present in the system.
|
||||
|
||||
.. image:: ../../data/how-to/tuning-guides/rocm-bandwidth-test.png
|
||||
:align: center
|
||||
:alt: rocm-bandwidth-test sample output
|
||||
|
||||
The output will also show a matrix that contains a ``1`` if a device can
|
||||
communicate to another device (CPU and GPU) of the system and it will show the
|
||||
NUMA distance -- similar to ``rocm-smi``.
|
||||
|
||||
Inter-device distance:
|
||||
|
||||
.. figure:: ../../data/how-to/tuning-guides/rbt-inter-device-access.png
|
||||
:align: center
|
||||
:alt: rocm-bandwidth-test inter-device distance
|
||||
|
||||
Inter-device distance
|
||||
|
||||
Inter-device NUMA distance:
|
||||
|
||||
.. figure:: ../../data/how-to/tuning-guides/rbt-inter-device-numa-distance.png
|
||||
:align: center
|
||||
:alt: rocm-bandwidth-test inter-device NUMA distance
|
||||
|
||||
Inter-device NUMA distance
|
||||
|
||||
The output also contains the measured bandwidth for unidirectional and
|
||||
bidirectional transfers between the devices (CPU and GPU):
|
||||
|
||||
Unidirectional bandwidth:
|
||||
|
||||
.. figure:: ../../data/how-to/tuning-guides/rbt-unidirectional-bandwidth.png
|
||||
:align: center
|
||||
:alt: rocm-bandwidth-test unidirectional bandwidth
|
||||
|
||||
Unidirectional bandwidth
|
||||
|
||||
Bidirectional bandwidth
|
||||
|
||||
.. figure:: ../../data/how-to/tuning-guides/rbt-bidirectional-bandwidth.png
|
||||
:align: center
|
||||
:alt: rocm-bandwidth-test bidirectional bandwidth
|
||||
|
||||
Bidirectional bandwidth
|
||||
|
||||
Abbreviations
|
||||
=============
|
||||
|
||||
AMI
|
||||
American Megatrends International
|
||||
|
||||
APBDIS
|
||||
Algorithmic Performance Boost Disable
|
||||
|
||||
ATS
|
||||
Address Translation Services
|
||||
|
||||
BAR
|
||||
Base Address Register
|
||||
|
||||
BIOS
|
||||
Basic Input/Output System
|
||||
|
||||
CBS
|
||||
Common BIOS Settings
|
||||
|
||||
CLI
|
||||
Command Line Interface
|
||||
|
||||
CPU
|
||||
Central Processing Unit
|
||||
|
||||
cTDP
|
||||
Configurable Thermal Design Power
|
||||
|
||||
DDR5
|
||||
Double Data Rate 5 DRAM
|
||||
|
||||
DF
|
||||
Data Fabric
|
||||
|
||||
DIMM
|
||||
Dual In-line Memory Module
|
||||
|
||||
DMA
|
||||
Direct Memory Access
|
||||
|
||||
DPM
|
||||
Dynamic Power Management
|
||||
|
||||
GPU
|
||||
Graphics Processing Unit
|
||||
|
||||
GRUB
|
||||
Grand Unified Bootloader
|
||||
|
||||
HPC
|
||||
High Performance Computing
|
||||
|
||||
IOMMU
|
||||
Input-Output Memory Management Unit
|
||||
|
||||
ISA
|
||||
Instruction Set Architecture
|
||||
|
||||
LCLK
|
||||
Link Clock Frequency
|
||||
|
||||
NBIO
|
||||
North Bridge Input/Output
|
||||
|
||||
NUMA
|
||||
Non-Uniform Memory Access
|
||||
|
||||
PCC
|
||||
Power Consumption Control
|
||||
|
||||
PCI
|
||||
Peripheral Component Interconnect
|
||||
|
||||
PCIe
|
||||
PCI Express
|
||||
|
||||
POR
|
||||
Power-On Reset
|
||||
|
||||
SIMD
|
||||
Single Instruction, Multiple Data
|
||||
|
||||
SMT
|
||||
Simultaneous Multi-threading
|
||||
|
||||
SMI
|
||||
System Management Interface
|
||||
|
||||
SOC
|
||||
System On Chip
|
||||
|
||||
SR-IOV
|
||||
Single Root I/O Virtualization
|
||||
|
||||
TP
|
||||
Tensor Parallelism
|
||||
|
||||
TSME
|
||||
Transparent Secure Memory Encryption
|
||||
|
||||
X2APIC
|
||||
Extended Advanced Programmable Interrupt Controller
|
||||
|
||||
xGMI
|
||||
Inter-chip Global Memory Interconnect
|
||||
@@ -7,6 +7,36 @@ myst:
|
||||
|
||||
# AMD RDNA2 system optimization
|
||||
|
||||
## Workstation workloads
|
||||
|
||||
Workstation workloads, much like those for HPC, have a unique set of
|
||||
requirements: a blend of both graphics and compute, certification, stability and
|
||||
others.
|
||||
|
||||
The document covers specific software requirements and processes needed to use
|
||||
these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning
|
||||
tasks.
|
||||
|
||||
The main purpose of this document is to help users utilize the RDNA™ 2 GPUs to
|
||||
their full potential.
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:stub-columns: 1
|
||||
|
||||
* - System Guide
|
||||
|
||||
- Architecture reference
|
||||
|
||||
- White papers
|
||||
|
||||
* - [System settings](#system-settings)
|
||||
|
||||
- [AMD RDNA 2 instruction set architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf)
|
||||
|
||||
- [RDNA 2 architecture](https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf)
|
||||
```
|
||||
|
||||
## System settings
|
||||
|
||||
This chapter reviews system settings that are required to configure the system
|
||||
|
||||
19
docs/how-to/tuning-guides/mi300x/index.rst
Normal file
19
docs/how-to/tuning-guides/mi300x/index.rst
Normal file
@@ -0,0 +1,19 @@
|
||||
.. meta::
|
||||
:description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
|
||||
:keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
|
||||
|
||||
************************
|
||||
AMD MI300X tuning guides
|
||||
************************
|
||||
|
||||
The tuning guides in this section provide a comprehensive summary of the
|
||||
necessary steps to properly configure your system for AMD Instinct™ MI300X
|
||||
accelerators. They include detailed instructions on system settings and
|
||||
application tuning suggestions to help you fully leverage the capabilities of
|
||||
these accelerators, thereby achieving optimal performance.
|
||||
|
||||
* :doc:`../../rocm-for-ai/inference/vllm-benchmark`
|
||||
* :doc:`../../rocm-for-ai/inference-optimization/workload`
|
||||
* `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
|
||||
|
||||
|
||||
@@ -53,13 +53,10 @@ ROCm documentation is organized into the following categories:
|
||||
:class-body: rocm-card-banner rocm-hue-8
|
||||
|
||||
* [GPU architecture overview](./conceptual/gpu-arch.md)
|
||||
* [Input-Output Memory Management Unit (IOMMU)](./conceptual/iommu.rst)
|
||||
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
|
||||
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
|
||||
* [Using CMake](./conceptual/cmake-packages.rst)
|
||||
* [PCIe atomics in ROCm](./conceptual/pcie-atomics.rst)
|
||||
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
|
||||
* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
|
||||
:::
|
||||
|
||||
:::{grid-item-card} Reference
|
||||
|
||||
@@ -101,14 +101,6 @@ subtrees:
|
||||
title: System optimization
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: how-to/system-optimization/mi300x.rst
|
||||
title: AMD Instinct MI300X
|
||||
- file: how-to/system-optimization/mi300a.rst
|
||||
title: AMD Instinct MI300A
|
||||
- file: how-to/system-optimization/mi200.md
|
||||
title: AMD Instinct MI200
|
||||
- file: how-to/system-optimization/mi100.md
|
||||
title: AMD Instinct MI100
|
||||
- file: how-to/system-optimization/w6000-v620.md
|
||||
title: AMD RDNA 2
|
||||
- file: how-to/gpu-performance/mi300x.rst
|
||||
@@ -164,20 +156,14 @@ subtrees:
|
||||
title: AMD Instinct MI100/CDNA1 ISA
|
||||
- url: https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna-white-paper.pdf
|
||||
title: White paper
|
||||
- file: conceptual/iommu.rst
|
||||
title: Input-Output Memory Management Unit (IOMMU)
|
||||
- file: conceptual/file-reorg.md
|
||||
title: File structure (Linux FHS)
|
||||
- file: conceptual/gpu-isolation.md
|
||||
title: GPU isolation techniques
|
||||
- file: conceptual/cmake-packages.rst
|
||||
title: Using CMake
|
||||
- file: conceptual/pcie-atomics.rst
|
||||
title: PCIe atomics in ROCm
|
||||
- file: conceptual/ai-pytorch-inception.md
|
||||
title: Inception v3 with PyTorch
|
||||
- file: conceptual/oversubscription.rst
|
||||
title: Oversubscription of hardware resources
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
|
||||
Reference in New Issue
Block a user