AMD GPU Docs System optimization migration changes in ROCm Docs Develop (#4538)

* AMD GPU Docs System optimization migration changes in ROCm Docs (#296)

* System optimization migration changes in ROCm

* Linting issue fixed

* Linking corrected

* Minor change

* Link updated to Instinct.docs.amd.com

* ROCm docs grid updated by removing IOMMU.rst, pcie-atomics, and oversubscription pages

* Files removed and reference fixed

* Reference text updated

* GPU atomics from 6.4.0 removed
This commit is contained in:
Pratik Basyal
2025-03-27 16:38:10 -04:00
committed by GitHub
parent 4bee895a1b
commit a0faccba37
14 changed files with 60 additions and 2491 deletions

View File

@@ -1,63 +0,0 @@
.. meta::
:description: Input-Output Memory Management Unit (IOMMU)
:keywords: IOMMU, DMA, PCIe, xGMI, AMD, ROCm
****************************************************************
Input-Output Memory Management Unit (IOMMU)
****************************************************************
The I/O Memory Management Unit (IOMMU) provides memory remapping services for I/O devices. It adds support for address translation and system memory access protection on direct memory access (DMA) transfers from peripheral devices.
The IOMMU's memory remapping services:
* provide private I/O space for devices used in a guest virtual machine.
* prevent unauthorized DMA requests to system memory and to memory-mapped I/O (MMIO).
* help in debugging memory access issues.
* facilitate peer-to-peer DMA.
The IOMMU also provides interrupt remapping, which is used by devices that support multiple interrupts and for interrupt delivery on hardware platforms with a large number of cores.
.. note::
AMD Instinct accelerators are connected via XGMI links and don't use PCI/PCIe for peer-to-peer DMA. Because PCI/PCIe is not used for peer-to-peer DMA, there are no device physical addressing limitations or platform root port limitations. However, because non-GPU devices such as RDMA NICs use PCIe for peer-to-peer DMA, there might still be physical addressing and platform root port limitations when these non-GPU devices interact with other devices, including GPUs.
Linux supports IOMMU in both virtualized environments and bare metal.
The IOMMU is enabled by default but can be disabled or put into passthrough mode through the Linux kernel command line:
.. list-table::
:widths: 15 15 70
:header-rows: 1
* - IOMMU Mode
- Kernel command
- Description
* - Enabled
- Default setting
- Recommended for AMD Radeon GPUs that need peer-to-peer DMA.
The IOMMU is enabled in remapping mode. Each device gets its own I/O virtual address space. All devices on Linux register their DMA addressing capabilities, and the kernel will ensure that any address space mapped for DMA is mapped within the device's DMA addressing limits. Only address space explicitly mapped by the devices will be mapped into virtual address space. Attempts to access an unmapped page will generate an IOMMU page fault.
* - Passthrough
- ``iommu=pt``
- Recommended for AMD Instinct Accelerators and for AMD Radeon GPUs that don't need peer-to-peer DMA.
Interrupt remapping is enabled but I/O remapping is disabled. The entire platform shares a common platform address space for system memory and MMIO spaces, ensuring compatibility with drivers from external vendors, while still supporting CPUs with a large number of cores.
* - Disabled
- ``iommu=off``
- Not recommended.
The IOMMU is disabled and the entire platform shares a common platform address space for system memory and MMIO spaces.
This mode should only be used with older Linux distributions with kernels that are not configured to support peer-to-peer DMA with an IOMMU. In these cases, the IOMMU needs to be disabled to use peer-to-peer DMA.
The IOMMU also provides virtualized access to the MMIO portions of the platform address space for peer-to-peer DMA.
Because peer-to-peer DMA is not officially part of the PCI/PCIe specification, the behavior of peer-to-peer DMA varies between hardware platforms.
AMD CPUs earlier than AMD Zen only supported peer-to-peer DMA for writes. On CPUs from AMD Zen and later, peer-to-peer DMA is fully supported.
To use peer-to-peer DMA on Linux, enable the following options in your Linux kernel configuration:
* ``CONFIG_PCI_P2PDMA``
* ``CONFIG_DMABUF_MOVE_NOTIFY``
* ``CONFIG_HSA_AMD_P2P``

View File

@@ -1,34 +0,0 @@
.. meta::
:description: Learn what causes oversubscription.
:keywords: warning, log, gpu, performance penalty, help
*******************************************************************
Oversubscription of hardware resources in AMD Instinct accelerators
*******************************************************************
When an AMD Instinct™ MI series accelerator enters an oversubscribed state, the ``amdgpu`` driver outputs the following
message.
``amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.``
Oversubscription occurs when application demands exceed the available hardware resources. In an oversubscribed
state, the hardware scheduler tries to manage resource usage in a round-robin fashion. However,
this can result in reduced performance, as resources might be occupied by applications or queues not actively
submitting work. The granularity of hardware resources occupied by an inactive queue can be in the order of
milliseconds, during which the accelerator or GPU is effectively blocked and unable to process work submitted by other
queues.
What triggers oversubscription?
===============================
The system enters an oversubscribed state when one of the following conditions is met:
* **Hardware queue limit exceeded**: The number of user-mode compute queues requested by applications exceeds the
hardware limit of 24 queues for current Instinct accelerators.
* **Virtual memory context slots exceeded**: The number of user processes exceeds the number of available virtual memory
context slots, which is 11 for current Instinct accelerators.
* **Multiple processes using cooperative workgroups**: More than one process attempts to use the cooperative workgroup
feature, leading to resource contention.

View File

@@ -1,57 +0,0 @@
.. meta::
:description: How ROCm uses PCIe atomics
:keywords: PCIe, PCIe atomics, atomics, Atomic operations, AMD, ROCm
*****************************************************************************
How ROCm uses PCIe atomics
*****************************************************************************
AMD ROCm is an extension of the Heterogeneous System Architecture (HSA). To meet the requirements of an HSA-compliant system, ROCm supports queuing models, memory models, and signaling and synchronization protocols. ROCm can perform atomic Read-Modify-Write (RMW) transactions that extend inter-processor synchronization mechanisms to Input/Output (I/O) devices starting from Peripheral Component Interconnect Express 3.0 (PCIe™ 3.0). It supports the defined HSA capabilities for queuing and signaling memory operations. To learn more about the requirements of an HSA-compliant system, see the
`HSA Platform System Architecture Specification <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
ROCm uses platform atomics to perform memory operations like queuing, signaling, and synchronization across multiple CPU, GPU agents, and I/O devices. Platform atomics ensure that atomic operations run synchronously, without interruptions or conflicts, across multiple shared resources.
Platform atomics in ROCm
==============================
Platform atomics enable the set of atomic operations that perform RMW actions across multiple processors, devices, and memory locations so that they run synchronously without interruption. An atomic operation is a sequence of computing instructions run as a single, indivisible unit. These instructions are completed in their entirety without any interruptions. If the instructions can't be completed as a unit without interruption, none of the instructions are run. These operations support 32-bit and 64-bit address formats.
Some of the operations for which ROCm uses platform atomics are:
* Update the HSA queue's ``read_dispatch_id``. The command processor on the GPU agent uses a 64-bit atomic add operation. It updates the packet ID it processed.
* Update the HSA queue's ``write_dispatch_id``. The CPU and GPU agents use a 64-bit atomic add operation. It supports multi-writer queue insertions.
* Update HSA Signals. A 64-bit atomic operation is used for CPU & GPU synchronization.
PCIe for atomic operations
----------------------------
ROCm requires CPUs that support PCIe atomics. Similarly, all connected I/O devices should also support PCIe atomics for optimum compatibility. PCIe supports the ``CAS`` (Compare and Swap), ``FetchADD``, and ``SWAP`` atomic operations across multiple resources. These atomic operations are initiated by the I/O devices that support 32-bit, 64-bit, and 128-bit operands. Likewise, the target memory address where these atomic operations are performed should also be aligned to the size of the operand. This alignment ensures that the operations are performed efficiently and correctly without failure.
When an atomic operation is successful, the requester receives a response of completion along with the operation result. However, any errors associated with the operation are signaled to the requester by updating the Completion Status field. Issues accessing the target location or running the atomic operation are common errors. Depending upon the error, the Completion Status field is updated to Completer Abort (CA) or Unsupported Request (UR). The field is present in the Completion Descriptor.
To learn more about the industry standards and specifications of PCIe, see `PCI-SIG Specification <https://pcisig.com/specifications>`_.
To learn more about PCIe and its capabilities, consult the following white papers:
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
* `PCI Express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
Working with PCIe 3.0 in ROCm
-------------------------------
Starting with PCIe 3.0, atomic operations can be requested, routed through, and completed by PCIe components. Routing and completion do not require software support. Component support for each can be identified by the Device Capabilities 2 (DevCap2) register. Upstream
bridges need to have atomic operations routing enabled. If not enabled, the atomic operations will fail even if the
PCIe endpoint and PCIe I/O devices can perform atomic operations.
If your system uses PCIe switches to connect and enable communication between multiple PCIe components, the switches must also support atomic operations routing.
To enable atomic operations routing between multiple root ports, each root port must support atomic operation routing. This capability can be identified from the atomic operations routing support bit in the DevCap2 register. If the bit has value of 1, routing is supported. Atomic operation requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
field is set. These requests can only be serviced if the upstream components also support atomic operation completion or if the requests can be routed to a component that supports atomic operation completion.
ROCm uses the PCIe-ID-based ordering technology for peer-to-peer (P2P) data transmission. PCIe-ID-based ordering technology is used when the GPU initiates multiple write operations to different memory locations.
For more information on changes implemented in PCIe 3.0, see `Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.

View File

@@ -23,7 +23,7 @@ There are two ways to handle this:
* Ensure that the high MMIO aperture is within the physical addressing limits of the devices in the system. For example, if the devices have a 44-bit physical addressing limit, set the ``MMIO High Base`` and ``MMIO High size`` options in the BIOS such that the aperture is within the 44-bit address range, and ensure that the ``Above 4G Decoding`` option is Enabled.
* Enable the Input-Output Memory Management Unit (IOMMU). When the IOMMU is enabled in non-passthrough mode, it will create a virtual I/O address space for each device on the system. It also ensures that all virtual addresses created in that space are within the physical addressing limits of the device. For more information on IOMMU, see :doc:`../conceptual/iommu`.
* Enable the Input-Output Memory Management Unit (IOMMU). When the IOMMU is enabled in non-passthrough mode, it will create a virtual I/O address space for each device on the system. It also ensures that all virtual addresses created in that space are within the physical addressing limits of the device. For more information on IOMMU, see `Input-Output Memory Management Unit (IOMMU) <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/conceptual/iommu.html>`_.
.. _bar-configuration:

View File

@@ -304,7 +304,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X accelerators, see :doc:`../../system-optimization/mi300x`.
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run LLM models from Hugging Face or your own model, see
:doc:`Running models from Hugging Face <hugging-face-models>`.

View File

@@ -8,108 +8,21 @@ System optimization
*******************
This guide outlines system setup and tuning suggestions for AMD hardware to
optimize performance for specific types of workloads or use-cases.
optimize performance for specific types of workloads or use-cases. The contents are structured according to the hardware:
High-performance computing workloads
====================================
.. grid:: 2
High-performance computing (HPC) workloads have unique requirements. The default
hardware and BIOS configurations for OEM platforms may not provide optimal
performance for HPC workloads. To enable optimal HPC settings on a per-platform
and per-workload level, this chapter describes:
.. grid-item-card:: AMD RDNA
* BIOS settings that can impact performance
* Hardware configuration best practices
* Supported versions of operating systems
* Workload-specific recommendations for optimal BIOS and operating system
settings
* :doc:`AMD RDNA2 system optimization <w6000-v620>`
There is also a discussion on the AMD Instinct™ software development
environment, including information on how to install and run the DGEMM, STREAM,
HPCG, and HPL benchmarks. This guide provides a good starting point but is
not tested exhaustively across all compilers.
.. grid-item-card:: AMD Instinct
Knowledge prerequisites to better understand this document and to perform tuning
for HPC applications include:
* `AMD Instinct MI300X <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
* `AMD Instinct MI300A <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300a.html>`_
* `AMD Instinct MI200 <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi200.html>`_
* `AMD Instinct MI100 <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi100.html>`_
* Experience in configuring servers
* Administrative access to the server's Management Interface (BMC)
* Administrative access to the operating system
* Familiarity with the OEM server's BMC (strongly recommended)
* Familiarity with the OS specific tools for configuration, monitoring, and
troubleshooting (strongly recommended)
This document provides guidance on tuning systems with various AMD Instinct
accelerators for HPC workloads. The following sections don't comprise an
all-inclusive guide, and some items referred to may have similar, but different,
names in various OEM systems (for example, OEM-specific BIOS settings). This
following sections also provide suggestions on items that should be the initial
focus of additional, application-specific tuning.
While this guide is a good starting point, developers are encouraged to perform
their own performance testing for additional tuning.
.. list-table::
:header-rows: 1
:stub-columns: 1
* - System optimization guide
- Architecture reference
- White papers
* - :doc:`AMD Instinct MI300X <mi300x>`
- `AMD Instinct MI300 instruction set architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`_
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
* - :doc:`AMD Instinct MI300A <mi300a>`
- `AMD Instinct MI300 instruction set architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`_
- `CDNA 3 architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf>`_
* - :doc:`AMD Instinct MI200 <mi200>`
- `AMD Instinct MI200 instruction set architecture <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`_
- `CDNA 2 architecture <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_
* - :doc:`AMD Instinct MI100 <mi100>`
- `AMD Instinct MI100 instruction set architecture <https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf>`_
- `CDNA architecture <https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf>`_
Workstation workloads
=====================
Workstation workloads, much like those for HPC, have a unique set of
requirements: a blend of both graphics and compute, certification, stability and
others.
The document covers specific software requirements and processes needed to use
these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning
tasks.
The main purpose of this document is to help users utilize the RDNA™ 2 GPUs to
their full potential.
.. list-table::
:header-rows: 1
:stub-columns: 1
* - System optimization guide
- Architecture reference
- White papers
* - :doc:`AMD Radeon PRO W6000 and V620 <w6000-v620>`
- `AMD RDNA 2 instruction set architecture <https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf>`_
- `RDNA 2 architecture <https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf>`_

View File

@@ -1,475 +0,0 @@
---
myst:
html_meta:
"description": "AMD Instinct MI100 system settings optimization guide."
"keywords": "Instinct, MI100, microarchitecture, AMD, ROCm"
---
# AMD Instinct MI100 system optimization
## System settings
This chapter reviews system settings that are required to configure the system
for AMD Instinct™ MI100 accelerators and that can improve performance of the
GPUs. It is advised to configure the system for best possible host configuration
according to the high-performance computing tuning guides for AMD EPYC™
7002 Series and EPYC™ 7003 Series processors, depending on the processor generation of the
system.
In addition to the BIOS settings listed below the following settings
({ref}`mi100-bios-settings`) will also have to be enacted via the command line (see
{ref}`mi100-os-settings`):
* Core C states
* AMD-PCI-UTIL (on AMD EPYC™ 7002 series processors)
* IOMMU (if needed)
(mi100-bios-settings)=
### System BIOS settings
For maximum MI100 GPU performance on systems with AMD EPYC™ 7002 series
processors (codename "Rome") and AMI System BIOS, the following configuration of
System BIOS settings has been validated. These settings must be used for the
qualification process and should be set as default values for the system BIOS.
Analogous settings for other non-AMI System BIOS providers could be set
similarly. For systems with Intel processors, some settings may not apply or be
available as listed in the following table.
```{list-table} Recommended settings for the system BIOS in a GIGABYTE platform.
:header-rows: 1
:name: mi100-bios
*
- BIOS Setting Location
- Parameter
- Value
- Comments
*
- Advanced / PCI Subsystem Settings
- Above 4G Decoding
- Enabled
- GPU Large BAR Support
*
- AMD CBS / CPU Common Options
- Global C-state Control
- Auto
- Global C-States
*
- AMD CBS / CPU Common Options
- CCD/Core/Thread Enablement
- Accept
- Global C-States
*
- AMD CBS / CPU Common Options / Performance
- SMT Control
- Disable
- Global C-States
*
- AMD CBS / DF Common Options / Memory Addressing
- NUMA nodes per socket
- NPS 1,2,4
- NUMA Nodes (NPS)
*
- AMD CBS / DF Common Options / Memory Addressing
- Memory interleaving
- Auto
- Numa Nodes (NPS)
*
- AMD CBS / DF Common Options / Link
- 4-link xGMI max speed
- 18 Gbps
- Set AMD CPU xGMI speed to highest rate supported
*
- AMD CBS / DF Common Options / Link
- 3-link xGMI max speed
- 18 Gbps
- Set AMD CPU xGMI speed to highest rate supported
*
- AMD CBS / NBIO Common Options
- IOMMU
- Disable
-
*
- AMD CBS / NBIO Common Options
- PCIe Ten Bit Tag Support
- Enable
-
*
- AMD CBS / NBIO Common Options
- Preferred IO
- Manual
-
*
- AMD CBS / NBIO Common Options
- Preferred IO Bus
- "Use lspci to find pci device id"
-
*
- AMD CBS / NBIO Common Options
- Enhanced Preferred IO Mode
- Enable
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Determinism Control
- Manual
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Determinism Slider
- Power
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- cTDP Control
- Manual
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- cTDP
- 240
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Package Power Limit Control
- Manual
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Package Power Limit
- 240
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Link Width Control
- Manual
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Force Link Width
- 2
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Force Link Width Control
- Force
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- APBDIS
- 1
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- DF C-states
- Auto
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Fixed SOC P-state
- P0
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options
- Enforce POR
- Accept
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
- Overclock
- Enabled
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
- Memory Clock Speed
- 1600 MHz
- Set to max Memory Speed, if using 3200 MHz DIMMs
*
- AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller
Configuration / DRAM Power Options
- Power Down Enable
- Disabled
- RAM Power Down
*
- AMD CBS / Security
- TSME
- Disabled
- Memory Encryption
```
#### NBIO link clock frequency
The NBIOs (4x per AMD EPYC™ processor) are the serializers/deserializers (also
known as "SerDes") that convert and prepare the I/O signals for the processor's
128 external I/O interface lanes (32 per NBIO).
LCLK (short for link clock frequency) controls the link speed of the internal
bus that connects the NBIO silicon with the data fabric. All data between the
processor and its PCIe lanes flow to the data fabric based on these LCLK
frequency settings. The link clock frequency of the NBIO components need to be
forced to the maximum frequency for optimal PCIe performance.
For AMD EPYC™ 7002 series processors, this setting cannot be modified via
configuration options in the server BIOS alone. Instead, the AMD-IOPM-UTIL (see
Section 3.2.3) must be run at every server boot to disable Dynamic Power
Management for all PCIe Root Complexes and NBIOs within the system and to lock
the logic into the highest performance operational mode.
For AMD EPYC™ 7003 series processors, configuring all NBIOs to be in "Enhanced
Preferred I/O" mode is sufficient to enable highest link clock frequency for the
NBIO components.
#### Memory configuration
For the memory addressing modes, especially the
number of NUMA nodes per socket/processor (NPS), the recommended setting is
to follow the guidance of the high-performance computing tuning guides
for AMD EPYC™ 7002 Series and AMD EPYC™ 7003 Series processors to provide the optimal
configuration for host side computation.
If the system is set to one NUMA domain per socket/processor (NPS1),
bidirectional copy bandwidth between host memory and GPU memory may be
slightly higher (up to about 16% more) than with four NUMA domains per socket
processor (NPS4). For memory bandwidth sensitive applications using MPI, NPS4
is recommended. For applications that are not optimized for NUMA locality,
NPS1 is the recommended setting.
(mi100-os-settings)=
### Operating system settings
#### CPU core states - C-states
There are several core states (C-states) that an AMD EPYC CPU can idle within:
* C0: active. This is the active state while running an application.
* C1: idle
* C2: idle and power gated. This is a deeper sleep state and will have a
greater latency when moving back to the C0 state, compared to when the
CPU is coming out of C1.
Disabling C2 is important for running with a high performance, low-latency
network. To disable power-gating on all cores run the following on Linux
systems:
```shell
cpupower idle-set -d 2
```
Note that the `cpupower` tool must be installed, as it is not part of the base
packages of most Linux® distributions. The package needed varies with the
respective Linux distribution.
::::{tab-set}
:::{tab-item} Ubuntu
:sync: ubuntu
```shell
sudo apt install linux-tools-common
```
:::
:::{tab-item} Red Hat Enterprise Linux
:sync: RHEL
```shell
sudo yum install cpupowerutils
```
:::
:::{tab-item} SUSE Linux Enterprise Server
:sync: SLES
```shell
sudo zypper install cpupower
```
:::
::::
#### AMD-IOPM-UTIL
This section applies to AMD EPYC™ 7002 processors to optimize advanced
Dynamic Power Management (DPM) in the I/O logic (see NBIO description above)
for performance. Certain I/O workloads may benefit from disabling this power
management. This utility disables DPM for all PCI-e root complexes in the
system and locks the logic into the highest performance operational mode.
Disabling I/O DPM will reduce the latency and/or improve the throughput of
low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads
with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if
multiple such PCI-e devices are installed in the system.
The actions of the utility do not persist across reboots. There is no need to
change any existing firmware settings when using this utility. The "Preferred
I/O" and "Enhanced Preferred I/O" settings should remain unchanged at enabled.
```{tip}
The recommended method to use the utility is either to create a system
start-up script, for example, a one-shot `systemd` service unit, or run the
utility when starting up a job scheduler on the system. The installer
packages (see
[Power Management Utility](https://developer.amd.com/iopm-utility/)) will
create and enable a `systemd` service unit for you. This service unit is
configured to run in one-shot mode. This means that even when the service
unit runs as expected, the status of the service unit will show inactive.
This is the expected behavior when the utility runs normally. If the service
unit shows failed, the utility did not run as expected. The output in either
case can be shown with the `systemctl status` command.
Stopping the service unit has no effect since the utility does not leave
anything running. To undo the effects of the utility, disable the service
unit with the `systemctl disable` command and reboot the system.
The utility does not have any command-line options, and it must be run with
super-user permissions.
```
#### Systems with 256 CPU threads - IOMMU configuration
For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™
7763 in a dual-socket configuration and SMT enabled), setting the input-output
memory management unit (IOMMU) configuration to "disabled" can limit the number
of available logical cores to 255. The reason is that the Linux® kernel disables
X2APIC in this case and falls back to Advanced Programmable Interrupt Controller
(APIC), which can only enumerate a maximum of 255 (logical) cores.
If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
"enable", the following steps can be applied to the system to enable all
(logical) cores of the system:
* In the server BIOS, set IOMMU to "Enabled".
* When configuring the Grub boot loader, add the following argument for the
Linux kernel: `iommu=pt`
* Update Grub to use the modified configuration:
```shell
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
```
* Reboot the system.
* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`:
```none
[...]
[ 0.000000] Kernel command line: [...] iommu=pt
[...]
```
Once the system is properly configured, ROCm software can be
installed.
## System management
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
{doc}`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. To verify that the installation was
successful, refer to the
{doc}`post-install instructions<rocm-install-on-linux:install/post-install>` and
[system tools](../../reference/rocm-tools.md). Should verification
fail, consult the [System Debugging Guide](../system-debugging.md).
(mi100-hw-verification)=
### Hardware verification with ROCm
The AMD ROCm™ platform ships with tools to query the system structure. To query
the GPU hardware, the `rocm-smi` command is available. It can show available
GPUs in the system with their device ID and their respective firmware (or VBIOS)
versions:
![rocm-smi --showhw output on an 8*MI100 system](../../data/how-to/tuning-guides/tuning001.png "'rocm-smi --showhw' output on an 8*MI100 system")
Another important query is to show the system structure, the localization of the
GPUs in the system, and the fabric connections between the system components:
![mi100-smi-showtopo output on an 8*MI100 system](../../data/how-to/tuning-guides/tuning002.png "'mi100-smi-showtopo' output on an 8*MI100 system")
The previous command shows the system structure in four blocks:
* The first block of the output shows the distance between the GPUs similar to
what the `numactl` command outputs for the NUMA domains of a system. The
weight is a qualitative measure for the "distance" data must travel to reach
one GPU from another one. While the values do not carry a special (physical)
meaning, the higher the value the more hops are needed to reach the
destination from the source GPU.
* The second block has a matrix for the number of hops required to send data
from one GPU to another. For the GPUs in the local hive, this number is one,
while for the others it is three (one hop to leave the hive, one hop across
the processors, and one hop within the destination hive).
* The third block outputs the link types between the GPUs. This can either be
"XGMI" for AMD Infinity Fabric™ links or "PCIE" for PCIe Gen4 links.
* The fourth block reveals the localization of a GPU with respect to the NUMA
organization of the shared memory of the AMD EPYC™ processors.
To query the compute capabilities of the GPU devices, the `rocminfo` command is
available with the AMD ROCm™ platform. It lists specific details about the GPU
devices, including but not limited to the number of compute units, width of the
SIMD pipelines, memory information, and Instruction Set Architecture:
![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system")
For a complete list of architecture (LLVM target) names, refer to
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>` support.
### Testing inter-device bandwidth
{ref}`mi100-hw-verification` showed the `rocm-smi --showtopo` command to show
how the system structure and how the GPUs are located and connected in this
structure. For more details, the `rocm-bandwidth-test` can run benchmarks to
show the effective link bandwidth between the components of the system.
The ROCm Bandwidth Test program can be installed with the following
package-manager commands:
::::{tab-set}
:::{tab-item} Ubuntu
:sync: ubuntu
```shell
sudo apt install rocm-bandwidth-test
```
:::
:::{tab-item} Red Hat Enterprise Linux
:sync: RHEL
```shell
sudo yum install rocm-bandwidth-test
```
:::
:::{tab-item} SUSE Linux Enterprise Server
:sync: SLES
```shell
sudo zypper install rocm-bandwidth-test
```
:::
::::
Alternatively, the source code can be downloaded and built from
[source](https://github.com/ROCm/rocm_bandwidth_test).
The output will list the available compute devices (CPUs and GPUs):
![rocm-bandwidth-test output fragment on an 8*MI100 system listing devices](../../data/how-to/tuning-guides/tuning004.png "'rocm-bandwidth-test' output fragment on an 8*MI100 system listing devices")
The output will also show a matrix that contains a "1" if a device can
communicate to another device (CPU and GPU) of the system and it will show the
NUMA distance (similar to `rocm-smi`):
![rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device access matrix](../../data/how-to/tuning-guides/tuning005.png "'rocm-bandwidth-test' output fragment on an 8*MI100 system showing inter-device access matrix")
![rocm-bandwidth-test output fragment on an 8*MI100 system showing inter-device NUMA distance](../../data/how-to/tuning-guides/tuning006.png "'rocm-bandwidth-test' output fragment on an 8*MI100 system showing inter-device NUMA distance")
The output also contains the measured bandwidth for unidirectional and
bidirectional transfers between the devices (CPU and GPU):
![rocm-bandwidth-test output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths](../../data/how-to/tuning-guides/tuning004.png "'rocm-bandwidth-test' output fragment on an 8*MI100 system showing uni- and bidirectional bandwidths")

View File

@@ -1,459 +0,0 @@
---
myst:
html_meta:
"description": "Learn about AMD Instinct MI200 system settings and performance tuning."
"keywords": "Instinct, MI200, microarchitecture, AMD, ROCm"
---
# AMD Instinct MI200 system optimization
## System settings
This chapter reviews system settings that are required to configure the system
for AMD Instinct MI250 accelerators and improve the performance of the GPUs. It
is advised to configure the system for the best possible host configuration
according to the *High Performance Computing (HPC) Tuning Guide for AMD EPYC
7003 Series Processors*.
Configure the system BIOS settings as explained in {ref}`mi200-bios-settings` and
enact the below given settings via the command line as explained in
{ref}`mi200-os-settings`:
* Core C states
* input-output memory management unit (IOMMU), if needed
(mi200-bios-settings)=
### System BIOS settings
For maximum MI250 GPU performance on systems with AMD EPYC™ 7003-series
processors (codename "Milan") and AMI System BIOS, the following configuration
of system BIOS settings has been validated. These settings must be used for the
qualification process and should be set as default values for the system BIOS.
Analogous settings for other non-AMI System BIOS providers could be set
similarly. For systems with Intel processors, some settings may not apply or be
available as listed in the following table.
```{list-table}
:header-rows: 1
:name: mi200-bios
*
- BIOS Setting Location
- Parameter
- Value
- Comments
*
- Advanced / PCI Subsystem Settings
- Above 4G Decoding
- Enabled
- GPU Large BAR Support
*
- Advanced / PCI Subsystem Settings
- SR-IOV Support
- Disabled
- Disable Single Root IO Virtualization
*
- AMD CBS / CPU Common Options
- Global C-state Control
- Auto
- Global C-States
*
- AMD CBS / CPU Common Options
- CCD/Core/Thread Enablement
- Accept
- Global C-States
*
- AMD CBS / CPU Common Options / Performance
- SMT Control
- Disable
- Global C-States
*
- AMD CBS / DF Common Options / Memory Addressing
- NUMA nodes per socket
- NPS 1,2,4
- NUMA Nodes (NPS)
*
- AMD CBS / DF Common Options / Memory Addressing
- Memory interleaving
- Auto
- Numa Nodes (NPS)
*
- AMD CBS / DF Common Options / Link
- 4-link xGMI max speed
- 18 Gbps
- Set AMD CPU xGMI speed to highest rate supported
*
- AMD CBS / NBIO Common Options
- IOMMU
- Disable
-
*
- AMD CBS / NBIO Common Options
- PCIe Ten Bit Tag Support
- Auto
-
*
- AMD CBS / NBIO Common Options
- Preferred IO
- Bus
-
*
- AMD CBS / NBIO Common Options
- Preferred IO Bus
- "Use lspci to find pci device id"
-
*
- AMD CBS / NBIO Common Options
- Enhanced Preferred IO Mode
- Enable
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Determinism Control
- Manual
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Determinism Slider
- Power
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- cTDP Control
- Manual
- Set cTDP to the maximum supported by the installed CPU
*
- AMD CBS / NBIO Common Options / SMU Common Options
- cTDP
- 280
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Package Power Limit Control
- Manual
- Set Package Power Limit to the maximum supported by the installed CPU
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Package Power Limit
- 280
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Link Width Control
- Manual
- Set AMD CPU xGMI width to 16 bits
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Force Link Width
- 2
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- xGMI Force Link Width Control
- Force
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- APBDIS
- 1
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- DF C-states
- Enabled
-
*
- AMD CBS / NBIO Common Options / SMU Common Options
- Fixed SOC P-state
- P0
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options
- Enforce POR
- Accept
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
- Overclock
- Enabled
-
*
- AMD CBS / UMC Common Options / DDR4 Common Options / Enforce POR
- Memory Clock Speed
- 1600 MHz
- Set to max Memory Speed, if using 3200 MHz DIMMs
*
- AMD CBS / UMC Common Options / DDR4 Common Options / DRAM Controller
Configuration / DRAM Power Options
- Power Down Enable
- Disabled
- RAM Power Down
*
- AMD CBS / Security
- TSME
- Disabled
- Memory Encryption
```
#### NBIO link clock frequency
The NBIOs (4x per AMD EPYC™ processor) are the serializers/deserializers (also
known as "SerDes") that convert and prepare the I/O signals for the processor's
128 external I/O interface lanes (32 per NBIO).
LCLK (short for link clock frequency) controls the link speed of the internal
bus that connects the NBIO silicon with the data fabric. All data between the
processor and its PCIe lanes flow to the data fabric based on these LCLK
frequency settings. The link clock frequency of the NBIO components need to be
forced to the maximum frequency for optimal PCIe performance.
For AMD EPYC™ 7003 series processors, configuring all NBIOs to be in "Enhanced
Preferred I/O" mode is sufficient to enable highest link clock frequency for the
NBIO components.
#### Memory configuration
For setting the memory addressing modes, especially
the number of NUMA nodes per socket/processor (NPS), follow the guidance of the
"High Performance Computing (HPC) Tuning Guide for AMD EPYC 7003 Series
Processors" to provide the optimal configuration for host side computation. For
most HPC workloads, NPS=4 is the recommended value.
(mi200-os-settings)=
### Operating system settings
#### CPU core states - C-states
There are several core states (C-states) that an AMD EPYC CPU can idle within:
* C0: active. This is the active state while running an application.
* C1: idle
* C2: idle and power gated. This is a deeper sleep state and will have a
greater latency when moving back to the C0 state, compared to when the
CPU is coming out of C1.
Disabling C2 is important for running with a high performance, low-latency
network. To disable power-gating on all cores run the following on Linux
systems:
```shell
cpupower idle-set -d 2
```
Note that the `cpupower` tool must be installed, as it is not part of the base
packages of most Linux® distributions. The package needed varies with the
respective Linux distribution.
::::{tab-set}
:::{tab-item} Ubuntu
:sync: ubuntu
```shell
sudo apt install linux-tools-common
```
:::
:::{tab-item} Red Hat Enterprise Linux
:sync: RHEL
```shell
sudo yum install cpupowerutils
```
:::
:::{tab-item} SUSE Linux Enterprise Server
:sync: SLES
```shell
sudo zypper install cpupower
```
:::
::::
#### AMD-IOPM-UTIL
This section applies to AMD EPYC™ 7002 processors to optimize advanced
Dynamic Power Management (DPM) in the I/O logic (see NBIO description above)
for performance. Certain I/O workloads may benefit from disabling this power
management. This utility disables DPM for all PCI-e root complexes in the
system and locks the logic into the highest performance operational mode.
Disabling I/O DPM will reduce the latency and/or improve the throughput of
low-bandwidth messages for PCI-e InfiniBand NICs and GPUs. Other workloads
with low-bandwidth bursty PCI-e I/O characteristics may benefit as well if
multiple such PCI-e devices are installed in the system.
The actions of the utility do not persist across reboots. There is no need to
change any existing firmware settings when using this utility. The "Preferred
I/O" and "Enhanced Preferred I/O" settings should remain unchanged at enabled.
```{tip}
The recommended method to use the utility is either to create a system
start-up script, for example, a one-shot `systemd` service unit, or run the
utility when starting up a job scheduler on the system. The installer
packages (see
[Power Management Utility](https://developer.amd.com/iopm-utility/)) will
create and enable a `systemd` service unit for you. This service unit is
configured to run in one-shot mode. This means that even when the service
unit runs as expected, the status of the service unit will show inactive.
This is the expected behavior when the utility runs normally. If the service
unit shows failed, the utility did not run as expected. The output in either
case can be shown with the `systemctl status` command.
Stopping the service unit has no effect since the utility does not leave
anything running. To undo the effects of the utility, disable the service
unit with the `systemctl disable` command and reboot the system.
The utility does not have any command-line options, and it must be run with
super-user permissions.
```
#### Systems with 256 CPU threads - IOMMU configuration
For systems that have 256 logical CPU cores or more (e.g., 64-core AMD EPYC™
7763 in a dual-socket configuration and SMT enabled), setting the input-output
memory management unit (IOMMU) configuration to "disabled" can limit the number
of available logical cores to 255. The reason is that the Linux® kernel disables
X2APIC in this case and falls back to Advanced Programmable Interrupt Controller
(APIC), which can only enumerate a maximum of 255 (logical) cores.
If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
"enable", the following steps can be applied to the system to enable all
(logical) cores of the system:
* In the server BIOS, set IOMMU to "Enabled".
* When configuring the Grub boot loader, add the following argument for the
Linux kernel: `iommu=pt`
* Update Grub to use the modified configuration:
```shell
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
```
* Reboot the system.
* Verify IOMMU passthrough mode by inspecting the kernel log via `dmesg`:
```none
[...]
[ 0.000000] Kernel command line: [...] iommu=pt
[...]
```
Once the system is properly configured, ROCm software can be
installed.
## System management
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
{doc}`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. For verifying that the
installation was successful, refer to the
{doc}`post-install instructions<rocm-install-on-linux:install/post-install>` and
[system tools](../../reference/rocm-tools.md). Should verification
fail, consult the [System Debugging Guide](../system-debugging.md).
(mi200-hw-verification)=
### Hardware verification with ROCm
The AMD ROCm™ platform ships with tools to query the system structure. To query
the GPU hardware, the `rocm-smi` command is available. It can show available
GPUs in the system with their device ID and their respective firmware (or VBIOS)
versions:
![rocm-smi --showhw output on an 8*MI200 system](../../data/how-to/tuning-guides/tuning008.png "'rocm-smi --showhw' output on an 8*MI200 system")
To see the system structure, the localization of the GPUs in the system, and the
fabric connections between the system components, use:
![rocm-smi --showtopo output on an 8*MI200 system](../../data/how-to/tuning-guides/tuning009.png "'rocm-smi --showtopo' output on an 8*MI200 system")
* The first block of the output shows the distance between the GPUs similar to
what the `numactl` command outputs for the NUMA domains of a system. The
weight is a qualitative measure for the "distance" data must travel to reach
one GPU from another one. While the values do not carry a special (physical)
meaning, the higher the value the more hops are needed to reach the
destination from the source GPU.
* The second block has a matrix named "Hops between two GPUs", where 1 means the
two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the
same CPU socket and GPU communications will go through the CPU, and 3 means
both GPUs are linked to different CPU sockets so communications will go
through both CPU sockets. This number is one for all GPUs in this case since
they are all connected to each other through the Infinity Fabric links.
* The third block outputs the link types between the GPUs. This can either be
"XGMI" for AMD Infinity Fabric links or "PCIE" for PCIe Gen4 links.
* The fourth block reveals the localization of a GPU with respect to the NUMA
organization of the shared memory of the AMD EPYC processors.
To query the compute capabilities of the GPU devices, use `rocminfo` command. It
lists specific details about the GPU devices, including but not limited to the
number of compute units, width of the SIMD pipelines, memory information, and
Instruction Set Architecture (ISA):
![rocminfo output fragment on an 8*MI200 system](../../data/how-to/tuning-guides/tuning010.png "'rocminfo' output fragment on an 8*MI200 system")
For a complete list of architecture (LLVM target) names, refer to GPU OS Support for
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
### Testing inter-device bandwidth
{ref}`mi100-hw-verification` showed the `rocm-smi --showtopo` command to show
how the system structure and how the GPUs are located and connected in this
structure. For more details, the `rocm-bandwidth-test` can run benchmarks to
show the effective link bandwidth between the components of the system.
The ROCm Bandwidth Test program can be installed with the following
package-manager commands:
::::{tab-set}
:::{tab-item} Ubuntu
:sync: ubuntu
```shell
sudo apt install rocm-bandwidth-test
```
:::
:::{tab-item} Red Hat Enterprise Linux
:sync: RHEL
```shell
sudo yum install rocm-bandwidth-test
```
:::
:::{tab-item} SUSE Linux Enterprise Server
:sync: SLES
```shell
sudo zypper install rocm-bandwidth-test
```
:::
::::
Alternatively, the source code can be downloaded and built from
[source](https://github.com/ROCm/rocm_bandwidth_test).
The output will list the available compute devices (CPUs and GPUs), including
their device ID and PCIe ID:
![rocm-bandwidth-test output fragment on an 8*MI200 system listing devices](../../data/how-to/tuning-guides/tuning011.png "'rocm-bandwidth-test' output fragment on an 8*MI200 system listing devices")
The output will also show a matrix that contains a "1" if a device can
communicate to another device (CPU and GPU) of the system and it will show the
NUMA distance (similar to `rocm-smi`):
!['rocm-bandwidth-test' output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances](../../data/how-to/tuning-guides/tuning012.png "'rocm-bandwidth-test' output fragment on an 8*MI200 system showing inter-device access matrix and NUMA distances")
The output also contains the measured bandwidth for unidirectional and
bidirectional transfers between the devices (CPU and GPU):
!['rocm-bandwidth-test' output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths](../../data/how-to/tuning-guides/tuning013.png "'rocm-bandwidth-test' output fragment on an 8*MI200 system showing uni- and bidirectional bandwidths")

View File

@@ -1,452 +0,0 @@
.. meta::
:description: Learn about AMD Instinct MI300A system settings and performance tuning.
:keywords: AMD, Instinct, MI300A, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, accelerator, GPU, EPYC, GRUB,
operating system
***************************************************
AMD Instinct MI300A system optimization
***************************************************
This topic discusses the operating system settings and system management commands for
the AMD Instinct MI300A accelerator. This topic can help you optimize performance.
System settings
========================================
This section reviews the system settings required to configure a MI300A SOC system and
optimize its performance.
The MI300A system-on-a-chip (SOC) design requires you to review and potentially adjust your OS configuration as explained in
the :ref:`operating-system-settings-label` section. These settings are critical for
performance because the OS on an accelerated processing unit (APU) is responsible for memory management across the CPU and GPU accelerators.
In the APU memory model, system settings are available to limit GPU memory allocation.
This limit is important because legacy software often determines the
amount of allowable memory at start-up time
by probing discrete memory until it is exhausted. If left unchecked, this practice
can starve the OS of resources.
System BIOS settings
-----------------------------------
System BIOS settings are preconfigured for optimal performance from the
platform vendor. This means that you do not need to adjust these settings
when using MI300A. If you have any questions regarding these settings,
contact your MI300A platform vendor.
GRUB settings
-----------------------------------
The ``/etc/default/grub`` file is used to configure the GRUB bootloader on modern Linux distributions.
Linux uses the string assigned to ``GRUB_CMDLINE_LINUX`` in this file as
its command line parameters during boot.
Appending strings using the Linux command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It is recommended that you append the following string to ``GRUB_CMDLINE_LINUX``.
``pci=realloc=off``
This setting disables the automatic reallocation
of PCI resources, so Linux is able to unambiguously detect all GPUs on the
MI300A-based system. It's used when Single Root I/O Virtualization (SR-IOV) Base
Address Registers (BARs) have not been allocated by the BIOS. This can help
avoid potential issues with certain hardware configurations.
Validating the IOMMU setting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IOMMU is a system-specific IO mapping mechanism for DMA mapping
and isolation. IOMMU is turned off by default in the operating system settings
for optimal performance.
To verify IOMMU is turned off, first install the ``acpica-tools`` package using your
package manager.
.. code-block:: shell
sudo apt install acpica-tools
Then confirm that the following commands do not return any results.
.. code-block:: shell
sudo acpidump | grep IVRS
sudo acpidump | grep DMAR
Update GRUB
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use this command to update GRUB to use the modified configuration:
.. code-block:: shell
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
On some Red Hat-based systems, the ``grub2-mkconfig`` command might not be available. In this case,
use ``grub-mkconfig`` instead. Verify that you have the
correct version by using the following command:
.. code-block:: shell
grub-mkconfig -version
.. _operating-system-settings-label:
Operating system settings
-----------------------------------
The operating system provides several options to customize and tune performance. For more information
about supported operating systems, see the :doc:`Compatibility matrix <../../compatibility/compatibility-matrix>`.
If you are using a distribution other than RHEL or SLES, the latest Linux kernel is recommended.
Performance considerations for the Zen4, which is the core architecture in the MI300A,
require a Linux kernel running version 5.18 or higher.
This section describes performance-based settings.
* **Enable transparent huge pages**
To enable transparent huge pages, use one of the following methods:
* From the command line, run the following command:
.. code-block:: shell
echo always > /sys/kernel/mm/transparent_hugepage/enabled
* Set the Linux kernel parameter ``transparent_hugepage`` as follows in the
relevant ``.cfg`` file for your system.
.. code-block:: cfg
transparent_hugepage=always
* **Increase the amount of allocatable memory**
By default, when using a device allocator via HIP, it is only possible to allocate 96 GiB out of
a possible 128 GiB of memory on the MI300A. This limitation does not affect host allocations.
To increase the available system memory, load the ``amdttm`` module with new values for
``pages_limit`` and ``page_pool_size``. These numbers correspond to the number of 4 KiB pages of memory.
To make 128 GiB of memory available across all four devices, for a total amount of 512 GiB,
set ``pages_limit`` and ``page_pool_size`` to ``134217728``. For a two-socket system, divide these values
by two. After setting these values, reload the AMDGPU driver.
First, review the current settings using this shell command:
.. code-block:: shell
cat /sys/module/amdttm/parameters/pages_limit
To set the amount of allocatable memory to all available memory on all four APU devices, run these commands:
.. code-block:: shell
sudo modprobe amdttm pages_limit=134217728 page_pool_size=134217728
sudo modprobe amdgpu
These settings can also be hardcoded in the ``/etc/modprobe.d/amdttm.conf`` file or specified as boot
parameters.
To use the hardcoded method,
the filesystem must already be set up when the kernel driver is loaded.
To hardcode the settings, add the following lines to ``/etc/modprobe.d/amdttm.conf``:
.. code-block:: shell
options amdttm pages_limit=134217728
options amdttm page_pool_size=134217728
If the filesystem is not already set up when the kernel driver is loaded, then the options
must be specified as boot parameters. To specify the settings
as boot parameters when loading the kernel, use this example as a guideline:
.. code-block:: shell
vmlinux-[...] amdttm.pages_limit=134217728 amdttm.page_pool_size=134217728 [...]
To verify the new settings and confirm the change, use this command:
.. code-block:: shell
cat /sys/module/amdttm/parameters/pages_limit
.. note::
The system settings for ``pages_limit`` and ``page_pool_size`` are calculated by multiplying the
per-APU limit of 4 KiB pages, which is ``33554432``, by the number of APUs on the node. The limit for a system with
two APUs ``33554432 x 2`` or ``67108864``.
This means the ``modprobe`` command for two APUs is ``sudo modprobe amdttm pages_limit=67108864 page_pool_size=67108864``.
* **Limit the maximum and single memory allocations on the GPU**
Many AI-related applications were originally developed on discrete GPUs. Some of these applications
have fixed problem sizes associated with the targeted GPU size, and some attempt to determine the
system memory limits by allocating chunks until failure. These techniques can cause issues in an
APU with a shared space.
To allow these applications to run on the APU without further changes,
ROCm supports a default memory policy that restricts the percentage of the GPU that can be allocated.
The following environment variables control this feature:
* ``GPU_MAX_ALLOC_PERCENT``
* ``GPU_SINGLE_ALLOC_PERCENT``
These settings can be added to the default shell environment or the user environment. The effect of the memory allocation
settings varies depending on the system, configuration, and task. They might require adjustment, especially when performing GPU benchmarks. Setting these values to ``100``
lets the GPU allocate any amount of free memory. However, the risk of encountering
an operating system out-of-memory (OMM) condition increases when almost
all the available memory is used.
Before setting either of these items to 100 percent,
carefully consider the expected CPU workload allocation and the anticipated OS usage.
For instance, if the OS requires 8GB on a 128GB system, setting these
variables to ``100`` authorizes a single
workload to allocate up to 120GB of memory. Unless the system has swap space configured
any over-allocation attempts will be handled by the OMM policies.
* **Disable NUMA (Non-uniform memory access) balancing**
ROCm uses information from the compiled application to ensure an affinity exists
between the GPU agent processes and their CPU hosts or co-processing agents.
Because the APU has OS threads,
including threads with memory management, the default kernel NUMA policies can
adversely impact workload performance without additional tuning.
.. note::
At the kernel level, ``pci_relloc`` can also be set to ``off`` as an additional tuning measure.
To disable NUMA balancing, use one of the following methods:
* From the command line, run the following command:
.. code-block:: shell
echo 0 > /proc/sys/kernel/numa_balancing
* Set the following Linux kernel parameters in the
relevant ``.cfg`` file for your system.
.. code-block:: cfg
pci=realloc=off numa_balancing=disable
* **Enable compaction**
Compaction is necessary for proper MI300A operation because the APU dynamically shares memory
between the CPU and GPU. Compaction can be done proactively, which reduces
allocation costs, or performed during allocation, in which case it is part of the background activities.
Without compaction, the MI300A application performance eventually degrades as fragmentation increases.
In RHEL distributions, compaction is disabled by default. In Ubuntu, it's enabled by default.
To enable compaction, enter the following commands using the command line:
.. code-block:: shell
echo 20 > /proc/sys/vm/compaction_proactiveness
echo 1 > /proc/sys/vm/compact_unevictable_allowed
.. _mi300a-processor-affinity:
* **Change affinity of ROCm helper threads**
Changing the affinity prevents internal ROCm threads from having their CPU core affinity mask
set to all CPU cores available. With this setting, the threads inherit their parent's
CPU core affinity mask. Before adjusting this setting, ensure you thoroughly understand
your system topology and how the application, runtime environment, and batch system
set the thread-to-core affinity. If you have any questions regarding this setting,
contact your MI300A platform vendor or the AMD support team.
To enable this setting, enter the following command:
.. code-block:: shell
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
* **CPU core states and C-states**
The system BIOS handles these settings for the MI300A.
They don't need to be configured on the operating system.
System management
========================================
For a complete guide on installing, managing, and uninstalling ROCm on Linux, see
:doc:`Quick-start (Linux)<rocm-install-on-linux:install/quick-start>`. To verify that the
installation was successful, see the
:doc:`Post-installation instructions<rocm-install-on-linux:install/post-install>` and
:doc:`ROCm tools <../../reference/rocm-tools>` guides. If verification
fails, consult the :doc:`System debugging guide <../system-debugging>`.
.. _hw-verification-rocm-label:
Hardware verification with ROCm
-----------------------------------
ROCm includes tools to query the system structure. To query
the GPU hardware, use the ``rocm-smi`` command.
``rocm-smi`` reports statistics per socket, so the power results combine CPU and GPU utilization.
In an idle state on a multi-socket system, some power imbalances are expected because
the distribution of OS threads can keep some APU devices at higher power states.
.. note::
The MI300A VRAM settings show as ``N/A``.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-output.png
:alt: Output from the rocm-smi command
The ``rocm-smi --showhw`` command shows the available system
GPUs and their device ID and firmware details.
In the MI300A hardware settings, the system BIOS handles the UMC RAS. The
ROCm-supplied GPU driver does not manage this setting.
This results in a value of ``DISABLED`` for the ``UMC RAS`` setting.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showhw-output.png
:alt: Output from the ``rocm-smi showhw`` command
To see the system structure, the localization of the GPUs in the system, and the
fabric connections between the system components, use the ``rocm-smi --showtopo`` command.
* The first block of the output shows the distance between the GPUs. The weight is a qualitative
measure of the “distance” data must travel to reach one GPU from another.
While the values do not have a precise physical meaning, the higher the value the
more hops are required to reach the destination from the source GPU.
* The second block contains a matrix named “Hops between two GPUs”, where ``1`` means
the two GPUs are directly connected with XGMI, ``2`` means both GPUs are linked to the
same CPU socket and GPU communications go through the CPU, and ``3`` means
both GPUs are linked to different CPU sockets so communications go
through both CPU sockets.
* The third block indicates the link types between the GPUs. This can either be
``XGMI`` for AMD Infinity Fabric links or ``PCIE`` for PCIe Gen4 links.
* The fourth block reveals the localization of a GPU with respect to the NUMA organization
of the shared memory of the AMD EPYC processors.
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-smi-showtopo-output.png
:alt: Output from the ``rocm-smi showtopo`` command
Testing inter-device bandwidth
-----------------------------------
The ``rocm-smi --showtopo`` command from the :ref:`hw-verification-rocm-label` section
displays the system structure and shows how the GPUs are located and connected within this
structure. For more information, use the :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`, which can run benchmarks to
show the effective link bandwidth between the system components.
For information on how to install the ROCm Bandwidth Test, see :doc:`Building the environment <rocm_bandwidth_test:install/install>`.
The output lists the available compute devices (CPUs and GPUs), including
their device ID and PCIe ID:
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-bandwidth-test-output.png
:alt: Output from the rocm-bandwidth-test utility
It also displays the measured bandwidth for unidirectional and
bidirectional transfers between the devices on the CPU and GPU:
.. image:: ../../data/how-to/tuning-guides/mi300a-rocm-peak-bandwidth-output.png
:alt: Bandwidth information from the rocm-bandwidth-test utility
Abbreviations
=============
APBDIS
Algorithmic Performance Boost Disable
APU
Accelerated processing unit
BAR
Base Address Register
BIOS
Basic Input/Output System
CBS
Common BIOS Settings
CCD
Compute Core Die
CDNA
Compute DNA
CLI
Command Line Interface
CPU
Central Processing Unit
cTDP
Configurable Thermal Design Power
DF
Data Fabric
DMA
Direct Memory Access
GPU
Graphics Processing Unit
GRUB
Grand Unified Bootloader
HBM
High Bandwidth Memory
HPC
High Performance Computing
IOMMU
Input-Output Memory Management Unit
ISA
Instruction Set Architecture
NBIO
North Bridge Input/Output
NUMA
Non-Uniform Memory Access
OMM
Out of Memory
PCI
Peripheral Component Interconnect
PCIe
PCI Express
POR
Power-On Reset
RAS
Reliability, availability and serviceability
SMI
System Management Interface
SMT
Simultaneous Multi-threading
SOC
System On Chip
SR-IOV
Single Root I/O Virtualization
TSME
Transparent Secure Memory Encryption
UMC
Unified Memory Controller
VRAM
Video RAM
xGMI
Inter-chip Global Memory Interconnect

View File

@@ -1,836 +0,0 @@
.. meta::
:description: Learn about AMD Instinct MI300X system settings and performance tuning.
:keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, accelerator, GPU, EPYC, GRUB,
operating system
***************************************
AMD Instinct MI300X system optimization
***************************************
This document covers essential system settings and management practices required
to configure your system effectively. Ensuring that your system operates
correctly is the first step before delving into advanced performance tuning.
The main topics of discussion in this document are:
* :ref:`System settings <mi300x-system-settings>`
* :ref:`System BIOS settings <mi300x-bios-settings>`
* :ref:`GRUB settings <mi300x-grub-settings>`
* :ref:`Operating system settings <mi300x-os-settings>`
* :ref:`System management <mi300x-system-management>`
.. _mi300x-system-settings:
System settings
===============
This guide discusses system settings that are required to configure your system
for AMD Instinct™ MI300X accelerators. It is important to ensure a system is
functioning correctly before trying to improve its overall performance. In this
section, the settings discussed mostly ensure proper functionality of your
Instinct-based system. Some settings discussed are known to improve performance
for most applications running on a MI300X system. See
:doc:`../rocm-for-ai/inference-optimization/workload` for how to improve performance for
specific applications or workloads.
.. _mi300x-bios-settings:
System BIOS settings
--------------------
AMD EPYC 9004-based systems
^^^^^^^^^^^^^^^^^^^^^^^^^^^
For maximum MI300X GPU performance on systems with AMD EPYC™ 9004-series
processors and AMI System BIOS, the following configuration
of system BIOS settings has been validated. These settings must be used for the
qualification process and should be set as default values in the system BIOS.
Analogous settings for other non-AMI System BIOS providers could be set
similarly. For systems with Intel processors, some settings may not apply or be
available as listed in the following table.
Each row in the table details a setting but the specific location within the
BIOS setup menus may be different, or the option may not be present.
.. list-table::
:header-rows: 1
* - BIOS setting location
- Parameter
- Value
- Comments
* - Advanced / PCI subsystem settings
- Above 4G decoding
- Enabled
- GPU large BAR support.
* - Advanced / PCI subsystem settings
- SR-IOV support
- Enabled
- Enable single root IO virtualization.
* - AMD CBS / GPU common options
- Global C-state control
- Auto
- Global C-states -- do not disable this menu item).
* - AMD CBS / GPU common options
- CCD/Core/Thread enablement
- Accept
- May be necessary to enable the SMT control menu.
* - AMD CBS / GPU common options / performance
- SMT control
- Disable
- Set to Auto if the primary application is not compute-bound.
* - AMD CBS / DF common options / memory addressing
- NUMA nodes per socket
- Auto
- Auto = NPS1. At this time, the other options for NUMA nodes per socket
should not be used.
* - AMD CBS / DF common options / memory addressing
- Memory interleaving
- Auto
- Depends on NUMA nodes (NPS) setting.
* - AMD CBS / DF common options / link
- 4-link xGMI max speed
- 32 Gbps
- Auto results in the speed being set to the lower of the max speed the
motherboard is designed to support and the max speed of the CPU in use.
* - AMD CBS / NBIO common options
- IOMMU
- Enabled
-
* - AMD CBS / NBIO common options
- PCIe ten bit tag support
- Auto
-
* - AMD CBS / NBIO common options / SMU common options
- Determinism control
- Manual
-
* - AMD CBS / NBIO common options / SMU common options
- Determinism slider
- Power
-
* - AMD CBS / NBIO common options / SMU common options
- cTDP control
- Manual
- Set cTDP to the maximum supported by the installed CPU.
* - AMD CBS / NBIO common options / SMU common options
- cTDP
- 400
- Value in watts.
* - AMD CBS / NBIO common options / SMU common options
- Package power limit control
- Manual
- Set package power limit to the maximum supported by the installed CPU.
* - AMD CBS / NBIO common options / SMU common options
- Package power limit
- 400
- Value in watts.
* - AMD CBS / NBIO common options / SMU common options
- xGMI link width control
- Manual
- Set package power limit to the maximum supported by the installed CPU.
* - AMD CBS / NBIO common options / SMU common options
- xGMI force width control
- Force
-
* - AMD CBS / NBIO common options / SMU common options
- xGMI force link width
- 2
- * 0: Force xGMI link width to x2
* 1: Force xGMI link width to x8
* 2: Force xGMI link width to x16
* - AMD CBS / NBIO common options / SMU common options
- xGMI max speed
- Auto
- Auto results in the speed being set to the lower of the max speed the
motherboard is designed to support and the max speed of the CPU in use.
* - AMD CBS / NBIO common options / SMU common options
- APBDIS
- 1
- Disable DF (data fabric) P-states
* - AMD CBS / NBIO common options / SMU common options
- DF C-states
- Auto
-
* - AMD CBS / NBIO common options / SMU common options
- Fixed SOC P-state
- P0
-
* - AMD CBS / security
- TSME
- Disabled
- Memory encryption
.. _mi300x-grub-settings:
GRUB settings
-------------
In any modern Linux distribution, the ``/etc/default/grub`` file is used to
configure GRUB. In this file, the string assigned to ``GRUB_CMDLINE_LINUX`` is
the command line parameters that Linux uses during boot.
Appending strings via Linux command line
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It is recommended to append the following strings in ``GRUB_CMDLINE_LINUX``.
``pci=realloc=off``
With this setting Linux is able to unambiguously detect all GPUs of the
MI300X-based system because this setting disables the automatic reallocation
of PCI resources. It's used when Single Root I/O Virtualization (SR-IOV) Base
Address Registers (BARs) have not been allocated by the BIOS. This can help
avoid potential issues with certain hardware configurations.
``iommu=pt``
The ``iommu=pt`` setting enables IOMMU pass-through mode. When in pass-through
mode, the adapter does not need to use DMA translation to the memory, which can
improve performance.
IOMMU is a system specific IO mapping mechanism and can be used for DMA mapping
and isolation. This can be beneficial for virtualization and device assignment
to virtual machines. It is recommended to enable IOMMU support.
For a system that has AMD host CPUs add this to ``GRUB_CMDLINE_LINUX``:
.. code-block:: text
iommu=pt
Otherwise, if the system has Intel host CPUs add this instead to
``GRUB_CMDLINE_LINUX``:
.. code-block:: text
intel_iommu=on iommu=pt
``modprobe.blacklist=amdgpu``
For some system configurations, the ``amdgpu`` driver needs to be blocked during kernel initialization to avoid an issue where after boot, the GPUs are not listed when running the command ``rocm-smi`` or ``amd-smi``.
Alternatively, configuring the AMD recommended system optimized BIOS settings might remove the need for using this setting. Some manufacturers and users might not implement the recommended system optimized BIOS settings.
If you experience the mentioned issue, then add this to ``GRUB_CMDLINE_LINUX``:
.. code-block:: text
modprobe.blacklist=amdgpu
After the change, the ``amdgpu`` module must be loaded to support the ROCm framework
software tools and utilities. Run the following command to load the ``amdgpu`` module:
.. code-block:: text
sudo modprobe amdgpu
Update GRUB
-----------
Update GRUB to use the modified configuration:
.. code-block:: shell
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
On some Debian systems, the ``grub2-mkconfig`` command may not be available. Instead,
check for the presence of ``grub-mkconfig``. Additionally, verify that you have the
correct version by using the following command:
.. code-block:: shell
grub-mkconfig -version
.. _mi300x-os-settings:
Operating system settings
-------------------------
CPU core states (C-states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
There are several core states (C-states) that an AMD EPYC CPU can idle within:
* **C0**: active. This is the active state while running an application.
* **C1**: idle. This state consumes less power compared to C0, but can quickly
return to the active state (C0) with minimal latency.
* **C2**: idle and power-gated. This is a deeper sleep state and will have greater
latency when moving back to the active (C0) state as compared to when the CPU
is coming out of C1.
Disabling C2 is important for running with a high performance, low-latency
network. To disable the C2 state, install the ``cpupower`` tool using your Linux
distribution's package manager. ``cpupower`` is not a base package in most Linux
distributions. The specific package to be installed varies per Linux
distribution.
.. tab-set::
.. tab-item:: Ubuntu
:sync: ubuntu
.. code-block:: shell
sudo apt install linux-tools-common
.. tab-item:: RHEL
:sync: rhel
.. code-block:: shell
sudo yum install cpupowerutils
.. tab-item:: SLES
:sync: sles
.. code-block:: shell
sudo zypper install cpupower
Now, to disable power-gating on all cores run the following on Linux
systems, run the following command.
.. code-block:: shell
cpupower idle-set -d 2
`/proc` and `/sys` file system settings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. _mi300x-disable-numa:
Disable NUMA auto-balancing
'''''''''''''''''''''''''''
The NUMA balancing feature allows the OS to scan memory and attempt to migrate
to a DIMM that is logically closer to the cores accessing it. This causes an
overhead because the OS is second-guessing your NUMA allocations but may be
useful if the NUMA locality access is very poor. Applications can therefore, in
general, benefit from disabling NUMA balancing; however, there are workloads where
doing so is detrimental to performance. Test this setting
by toggling the ``numa_balancing`` value and running the application; compare
the performance of one run with this set to ``0`` and another run with this to
``1``.
Run the command ``cat /proc/sys/kernel/numa_balancing`` to check the current
NUMA (Non-Uniform Memory Access) settings. Output ``0`` indicates this
setting is disabled. If no output or output is ``1``, run the command
``sudo sh -c \\'echo 0 > /proc/sys/kernel/numa_balancing`` to disable it.
For these settings, the ``env_check.sh`` script automates setting, resetting,
and checking your environments. Find the script at
`<https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh>`__.
Run the script as follows to set or reset the settings:
``./env_check.sh [set/reset/check]``
.. tip::
Use ``./env_check.sh -h`` for help info.
Automate disabling NUMA auto-balance using Cron
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :ref:`mi300x-disable-numa` section describes the command to disable NUMA
auto-balance. To automate the command with Cron, edit the ``crontab``
configuration file for the root user:
.. code-block:: shell
sudo crontab -e
#. Add the following Cron entry to run the script at a specific interval:
.. code-block:: shell
@reboot sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
#. Save the file and exit the text editor.
#. Optionally, restart the system to apply changes by issuing ``sudo reboot``.
#. Verify your new configuration.
.. code-block::
cat /proc/sys/kernel/numa_balancing
The ``/proc/sys/kernel/numa_balancing`` file controls NUMA balancing in the
Linux kernel. If the value in this file is set to ``0``, the NUMA balancing
is disabled. If the value is set to ``1``, NUMA balancing is enabled.
.. note::
Disabling NUMA balancing should be done cautiously and for
specific reasons, such as performance optimization or addressing
particular issues. Always test the impact of disabling NUMA balancing in
a controlled environment before applying changes to a production system.
.. _mi300x-env-vars:
Environment variables
^^^^^^^^^^^^^^^^^^^^^
HIP provides an environment variable export ``HIP_FORCE_DEV_KERNARG=1`` that
can put arguments of HIP kernels directly to device memory to reduce the
latency of accessing those kernel arguments. It can improve performance by 2 to
3 µs for some kernels.
It is recommended to set the following environment variable:
.. code-block:: shell
export HIP_FORCE_DEV_KERNARG=1
.. note::
This is the default option as of ROCm 6.2.
Change affinity of ROCm helper threads
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This change prevents internal ROCm threads from having their CPU core affinity mask
set to all CPU cores available. With this setting, the threads inherit their parent's
CPU core affinity mask. If you have any questions regarding this setting,
contact your MI300X platform vendor. To enable this setting, enter the following command:
.. code-block:: shell
export HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0
IOMMU configuration -- systems with 256 CPU threads
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For systems that have 256 logical CPU cores or more, setting the input-output
memory management unit (IOMMU) configuration to ``disabled`` can limit the
number of available logical cores to 255. The reason is that the Linux kernel
disables X2APIC in this case and falls back to Advanced Programmable Interrupt
Controller (APIC), which can only enumerate a maximum of 255 (logical) cores.
If SMT is enabled by setting ``CCD/Core/Thread Enablement > SMT Control`` to
``enable``, you can apply the following steps to the system to enable all
(logical) cores of the system:
#. In the server BIOS, set IOMMU to ``Enabled``.
#. When configuring the GRUB boot loader, add the following argument for the Linux kernel: ``iommu=pt``.
#. Update GRUB.
#. Reboot the system.
#. Verify IOMMU passthrough mode by inspecting the kernel log via ``dmesg``:
.. code-block::
dmesg | grep iommu
.. code-block:: shell
[...]
[ 0.000000] Kernel command line: [...] iommu=pt
[...]
Once the system is properly configured, ROCm software can be
:doc:`installed <rocm-install-on-linux:index>`.
.. _mi300x-system-management:
System management
=================
To optimize system performance, it's essential to first understand the existing
system configuration parameters and settings. ROCm offers several CLI tools that
can provide system-level information, offering valuable insights for
optimizing user applications.
For a complete guide on how to install, manage, or uninstall ROCm on Linux, refer to
:doc:`rocm-install-on-linux:install/quick-start`. For verifying that the
installation was successful, refer to the
:doc:`rocm-install-on-linux:install/post-install`.
Should verification fail, consult :doc:`/how-to/system-debugging`.
.. _mi300x-hardware-verification-with-rocm:
Hardware verification with ROCm
-------------------------------
The ROCm platform provides tools to query the system structure. These include
:ref:`ROCm SMI <mi300x-rocm-smi>` and :ref:`ROCm Bandwidth Test <mi300x-bandwidth-test>`.
.. _mi300x-rocm-smi:
ROCm SMI
^^^^^^^^
To query your GPU hardware, use the ``rocm-smi`` command. ROCm SMI lists
GPUs available to your system -- with their device ID and their respective
firmware (or VBIOS) versions.
The following screenshot shows that all 8 GPUs of MI300X are recognized by ROCm.
Performance of an application could be otherwise suboptimal if, for example, out
of the 8 GPUs only 5 of them are recognized.
.. image:: ../../data/how-to/tuning-guides/rocm-smi-showhw.png
:align: center
:alt: ``rocm-smi --showhw`` output
To see the system structure, the localization of the GPUs in the system, and the
fabric connections between the system components, use the command
``rocm-smi --showtopo``.
.. image:: ../../data/how-to/tuning-guides/rocm-smi-showtopo.png
:align: center
:alt: ``rocm-smi --showtopo`` output
The first block of the output shows the distance between the GPUs similar to
what the ``numactl`` command outputs for the NUMA domains of a system. The
weight is a qualitative measure for the “distance” data must travel to reach one
GPU from another one. While the values do not carry a special, or "physical"
meaning, the higher the value the more hops are needed to reach the destination
from the source GPU. This information has performance implication for a
GPU-based application that moves data among GPUs. You can choose a minimum
distance among GPUs to be used to make the application more performant.
The second block has a matrix named *Hops between two GPUs*, where:
* ``1`` means the two GPUs are directly connected with xGMI,
* ``2`` means both GPUs are linked to the same CPU socket and GPU communications
will go through the CPU, and
* ``3`` means both GPUs are linked to different CPU sockets so communications will
go through both CPU sockets. This number is one for all GPUs in this case
since they are all connected to each other through the Infinity Fabric links.
The third block outputs the link types between the GPUs. This can either be
``XGMI`` for AMD Infinity Fabric links or ``PCIE`` for PCIe Gen5 links.
The fourth block reveals the localization of a GPU with respect to the NUMA
organization of the shared memory of the AMD EPYC processors.
To query the compute capabilities of the GPU devices, use rocminfo command. It
lists specific details about the GPU devices, including but not limited to the
number of compute units, width of the SIMD pipelines, memory information, and
instruction set architecture (ISA). The following is the truncated output of the
command:
.. image:: ../../data/how-to/tuning-guides/rocminfo.png
:align: center
:alt: rocminfo.txt example
For a complete list of architecture (such as CDNA3) and LLVM target names
(such gfx942 for MI300X), refer to the
:doc:`Supported GPUs section of the System requirements for Linux page <rocm-install-on-linux:reference/system-requirements>`.
Deterministic clock
'''''''''''''''''''
Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock
speed up to 1900 MHz instead of the default 2100 MHz. This can reduce
the chance of a PCC event lowering the attainable GPU clocks. This
setting will not be required for new IFWI releases with the production
PRC feature. Restore this setting to its default value with the
``rocm-smi -r`` command.
.. _mi300x-bandwidth-test:
ROCm Bandwidth Test
^^^^^^^^^^^^^^^^^^^
The section Hardware verification with ROCm showed how the command
``rocm-smi --showtopo`` can be used to view the system structure and how the
GPUs are connected. For more details on the link bandwidth,
``rocm-bandwidth-test`` can run benchmarks to show the effective link bandwidth
between the components of the system.
You can install ROCm Bandwidth Test, which can test inter-device bandwidth,
using the following package manager commands:
.. tab-set::
.. tab-item:: Ubuntu
:sync: ubuntu
.. code-block:: shell
sudo apt install rocm-bandwidth-test
.. tab-item:: RHEL
:sync: rhel
.. code-block:: shell
sudo yum install rocm-bandwidth-test
.. tab-item:: SLES
:sync: sles
.. code-block:: shell
sudo zypper install rocm-bandwidth-test
Alternatively, you can download the source code from
`<https://github.com/ROCm/rocm_bandwidth_test>`__ and build from source.
The output will list the available compute devices (CPUs and GPUs), including
their device ID and PCIe ID. The following screenshot is an example of the
beginning part of the output of running ``rocm-bandwidth-test``. It shows the
devices present in the system.
.. image:: ../../data/how-to/tuning-guides/rocm-bandwidth-test.png
:align: center
:alt: rocm-bandwidth-test sample output
The output will also show a matrix that contains a ``1`` if a device can
communicate to another device (CPU and GPU) of the system and it will show the
NUMA distance -- similar to ``rocm-smi``.
Inter-device distance:
.. figure:: ../../data/how-to/tuning-guides/rbt-inter-device-access.png
:align: center
:alt: rocm-bandwidth-test inter-device distance
Inter-device distance
Inter-device NUMA distance:
.. figure:: ../../data/how-to/tuning-guides/rbt-inter-device-numa-distance.png
:align: center
:alt: rocm-bandwidth-test inter-device NUMA distance
Inter-device NUMA distance
The output also contains the measured bandwidth for unidirectional and
bidirectional transfers between the devices (CPU and GPU):
Unidirectional bandwidth:
.. figure:: ../../data/how-to/tuning-guides/rbt-unidirectional-bandwidth.png
:align: center
:alt: rocm-bandwidth-test unidirectional bandwidth
Unidirectional bandwidth
Bidirectional bandwidth
.. figure:: ../../data/how-to/tuning-guides/rbt-bidirectional-bandwidth.png
:align: center
:alt: rocm-bandwidth-test bidirectional bandwidth
Bidirectional bandwidth
Abbreviations
=============
AMI
American Megatrends International
APBDIS
Algorithmic Performance Boost Disable
ATS
Address Translation Services
BAR
Base Address Register
BIOS
Basic Input/Output System
CBS
Common BIOS Settings
CLI
Command Line Interface
CPU
Central Processing Unit
cTDP
Configurable Thermal Design Power
DDR5
Double Data Rate 5 DRAM
DF
Data Fabric
DIMM
Dual In-line Memory Module
DMA
Direct Memory Access
DPM
Dynamic Power Management
GPU
Graphics Processing Unit
GRUB
Grand Unified Bootloader
HPC
High Performance Computing
IOMMU
Input-Output Memory Management Unit
ISA
Instruction Set Architecture
LCLK
Link Clock Frequency
NBIO
North Bridge Input/Output
NUMA
Non-Uniform Memory Access
PCC
Power Consumption Control
PCI
Peripheral Component Interconnect
PCIe
PCI Express
POR
Power-On Reset
SIMD
Single Instruction, Multiple Data
SMT
Simultaneous Multi-threading
SMI
System Management Interface
SOC
System On Chip
SR-IOV
Single Root I/O Virtualization
TP
Tensor Parallelism
TSME
Transparent Secure Memory Encryption
X2APIC
Extended Advanced Programmable Interrupt Controller
xGMI
Inter-chip Global Memory Interconnect

View File

@@ -7,6 +7,36 @@ myst:
# AMD RDNA2 system optimization
## Workstation workloads
Workstation workloads, much like those for HPC, have a unique set of
requirements: a blend of both graphics and compute, certification, stability and
others.
The document covers specific software requirements and processes needed to use
these GPUs for Single Root I/O Virtualization (SR-IOV) and machine learning
tasks.
The main purpose of this document is to help users utilize the RDNA™ 2 GPUs to
their full potential.
```{list-table}
:header-rows: 1
:stub-columns: 1
* - System Guide
- Architecture reference
- White papers
* - [System settings](#system-settings)
- [AMD RDNA 2 instruction set architecture](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf)
- [RDNA 2 architecture](https://www.amd.com/system/files/documents/rdna2-explained-radeon-pro-W6000.pdf)
```
## System settings
This chapter reviews system settings that are required to configure the system

View File

@@ -0,0 +1,19 @@
.. meta::
:description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
:keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
************************
AMD MI300X tuning guides
************************
The tuning guides in this section provide a comprehensive summary of the
necessary steps to properly configure your system for AMD Instinct™ MI300X
accelerators. They include detailed instructions on system settings and
application tuning suggestions to help you fully leverage the capabilities of
these accelerators, thereby achieving optimal performance.
* :doc:`../../rocm-for-ai/inference/vllm-benchmark`
* :doc:`../../rocm-for-ai/inference-optimization/workload`
* `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_

View File

@@ -53,13 +53,10 @@ ROCm documentation is organized into the following categories:
:class-body: rocm-card-banner rocm-hue-8
* [GPU architecture overview](./conceptual/gpu-arch.md)
* [Input-Output Memory Management Unit (IOMMU)](./conceptual/iommu.rst)
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
* [Using CMake](./conceptual/cmake-packages.rst)
* [PCIe atomics in ROCm](./conceptual/pcie-atomics.rst)
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
* [Oversubscription of hardware resources](./conceptual/oversubscription.rst)
:::
:::{grid-item-card} Reference

View File

@@ -101,14 +101,6 @@ subtrees:
title: System optimization
subtrees:
- entries:
- file: how-to/system-optimization/mi300x.rst
title: AMD Instinct MI300X
- file: how-to/system-optimization/mi300a.rst
title: AMD Instinct MI300A
- file: how-to/system-optimization/mi200.md
title: AMD Instinct MI200
- file: how-to/system-optimization/mi100.md
title: AMD Instinct MI100
- file: how-to/system-optimization/w6000-v620.md
title: AMD RDNA 2
- file: how-to/gpu-performance/mi300x.rst
@@ -164,20 +156,14 @@ subtrees:
title: AMD Instinct MI100/CDNA1 ISA
- url: https://www.amd.com/content/dam/amd/en/documents/instinct-business-docs/white-papers/amd-cdna-white-paper.pdf
title: White paper
- file: conceptual/iommu.rst
title: Input-Output Memory Management Unit (IOMMU)
- file: conceptual/file-reorg.md
title: File structure (Linux FHS)
- file: conceptual/gpu-isolation.md
title: GPU isolation techniques
- file: conceptual/cmake-packages.rst
title: Using CMake
- file: conceptual/pcie-atomics.rst
title: PCIe atomics in ROCm
- file: conceptual/ai-pytorch-inception.md
title: Inception v3 with PyTorch
- file: conceptual/oversubscription.rst
title: Oversubscription of hardware resources
- caption: Reference
entries: